Welcome to IBM Analog Hardware Acceleration Kit’s documentation!¶
Installation¶
The preferred way to install this package is by using the Python package index:
pip install aihwkit
Note
During the initial beta stage, we do not provide pip wheels (as in, pre-compiled binaries) for all the possible platform, version and architecture combinations (in particular, only CPU versions are provided).
Please refer to the Advanced installation guide page for instruction on how to compile the library for your environment in case you encounter errors during installing from pip.
The package require the following runtime libraries to be installed in your system:
OpenBLAS: 0.3.3+
CUDA Toolkit: 9.0+ (only required for the GPU-enabled simulator 1)
Note
Please note that the current pip wheels are only compatible with PyTorch
1.6.0
. If you need to use a different PyTorch
version, please
refer to the Advanced installation guide section in order to compile a custom
version. More details about the PyTorch
compatibility can be found in
this issue.
Verifying the installation¶
If the library was installed correctly, you can use the following snippet for creating an analog layer and predicting the output:
from torch import Tensor
from aihwkit.nn import AnalogLinear
model = AnalogLinear(3, 2)
model(Tensor([[0.1, 0.2], [0.3, 0.4]]))
If you encounter any issues during the installation or executing the snippet, please refer to the Advanced installation guide section for more details and don’t hesitate on using the issue tracker for additional support.
Next steps¶
You can read more about the PyTorch layers in the Using the pytorch integration section, and about the internal analog tiles in the Using analog tiles section.
- 1
Note that GPU support is not available in OSX, as it depends on a platform that has official CUDA support.
Advanced installation guide¶
Compilation¶
The build system for aihwkit
is based on cmake, making use of
scikit-build for generating the Python packages.
Some of the dependencies and tools are Python-based. For convenience, we suggest creating a virtual environment as a way to isolate your environment:
$ python3 -m venv aihwkit_env
$ cd aihwkit_env
$ source bin/activate
(aihwkit_env) $
Note
The following sections assume that the command line examples are executed
in the activated aihwkit_env
environment.
Dependencies¶
For compiling aihwkit
, the following dependencies are required:
Dependency |
Version |
Notes |
---|---|---|
C++11 compatible compiler |
||
3.18+ |
||
2.6.0+ |
Version 2.6.0 can be installed using |
|
0.11.0+ |
||
3.6+ |
||
BLAS implementation |
||
1.5+ |
The libtorch library and headers are needed 1 |
|
11.0.0+ |
Optional, OpenMP library and headers 2 |
|
CUDA |
9.0+ |
Optional, for GPU-enabled simulator |
1.8.0 |
Optional, for GPU-enabled simulator 4 |
|
1.10.0 |
Optional, for building the C++ tests 4 |
Please refer to your operative system documentation for instructions on how to install the different dependencies. The following section contains quick instructions for several operative systems:
Debian-based¶
On a Debian-based operative system, the following commands can be used for installing the minimal dependencies:
$ sudo apt-get install python3-dev libopenblas-dev
$ pip install cmake scikit-build torch pybind11
OSX¶
On an OSX-based system, the following commands can be used for installing the
minimal dependencies (note that Xcode
needs to be installed):
$ brew install openblas
$ pip install cmake scikit-build torch pybind11
miniconda¶
On a miniconda-based system, the following commands can be used for installing the minimal dependencies 3:
$ conda install cmake openblas pybind11
$ conda install -c conda-forge scikit-build
$ conda install -c pytorch pytorch
Windows (Experimental)¶
On a Windows-based system, we recommend to install OpenBLAS following this OpenBLAS - Visual Studio installation and usage guide. It requires installing MS Visual Studio 2019 and Miniconda.
After compiling and installing OpenBLAS, in the same Miniconda terminal, the following commands can be used for installing the minimal dependencies:
$ conda install pybind11 scikit-build
$ conda install pytorch -c pytorch
For compiling aihwkit
, it is recommended to use the x64 Native Tools Command
Prompt for VS 2019.
Note: If you want to use pip
instead of conda
, the following commands can
be used:
$ pip install cmake scikit-build pybind11
$ pip install torch -f https://download.pytorch.org/whl/torch_stable.html
Installing and compiling¶
Once the dependencies are in place, the following command can be used for compiling and installing the Python package:
$ pip install -v aihwkit
This command will:
download the source tarball for the library.
invoke
scikit-build
which in turn will invoke
cmake
for the compilation.execute the commands in verbose mode, for helping troubleshooting issues.
install the Python package.
If there are any issue with the dependencies or the compilation, the output of the command will help diagnosing the issue.
Note
Please note that the instruction on this page refer to installing as an end user. If you are planning to contribute to the project, an alternative setup and tips can be found at the Development setup section that is more tuned towards the needs of a development cycle.
- 1
This library uses PyTorch as both a build dependency and a runtime dependency. Please ensure that your torch installation includes
libtorch
and the development headers - they are included by default if installing torch frompip
.- 2
Support for the parts of the OpenMP 4.0+. Some compilers like LLVM or Clang do not support OpenMP. In case of you want to add shared memory processing support to the library using one of these compilers, you will need to install OpenMP library in your system.
- 3
Please note that currently support for conda-based distributions is experimental, and further commands might be needed.
- 4(1,2)
Both
Nvidia CUB
andgoogletest
are downloaded and compiled automatically during the build process. As a result, they do not need to be installed manually.
Analog AI¶
What is analog AI and an analog chip?¶
In a traditional hardware architecture, computation and memory are siloed in different locations. Information is moved back and forth between computation and memory units every time an operation is performed, creating a limitation called the von Neumann bottleneck.
In-memory computing delivers radical performance improvements by combining compute and memory in a single device, eliminating the von Neumann bottleneck. By leveraging the physical properties of memory devices, computation happens at the same place where the data is stored, drastically reducing energy consumption. Many types of memory devices such as phase-change memory (PCM), resistive random-access memory (RRAM), and Flash memory can be used for in-memory computing [1]. Because there is no movement of data, tasks can be performed in a fraction of the time and with much less energy. This is different from a conventional computer, where the data is transferred from the memory to the CPU every time a computation is done.

In deep learning, data propagation through multiple layers of a neural network involves a sequence of matrix multiplications, as each layer can be represented as a matrix of synaptic weights. These weights can be stored in the analog charge state or conductance state of memory devices. The devices are arranged in crossbar arrays, creating an artificial neural network where all matrix multiplications are performed in-place in an analog manner. This structure allows to run deep learning models at reduced energy consumption [1].
An in-memory computing chip typically consists of multiple crossbar arrays of memory devices that communicate with each other. A neural network layer can be implemented on (at least) one crossbar, in which the weights of that layer are stored in the charge or conductance state of the memory devices at the crosspoints. Usually, at least two devices per weight are used: one encoding the positive part of the synaptic weight and the other encoding the negative part. The propagation of data through that layer is performed in a single step by inputting the data to the crossbar rows and deciphering the results at the columns. The results are then passed through the neuron nonlinear function and input to the next layer. The neuron nonlinear function is typically implemented at the crossbar periphery, using analog or digital circuits. Because every layer of the network is stored physically on different arrays, each array needs to communicate at least with the array(s) storing the next layer for feed-forward networks, such as multi-layer perceptrons (MLPs) or convolutional neural networks (CNNs). For recurrent neural networks (RNNs), the output of an array needs to communicate with its input.

The efficient matrix multiplication realized via in-memory computing is very attractive for inference-only applications, where data is propagated through the network on offline-trained weights. In this scenario, the weights are typically trained using a conventional GPU-based hardware, and then are subsequently programmed into the in-memory-computing chip which performs inference. Because of device and circuit level non-idealities in the analog in-memory computing chip, custom techniques must be included into the training algorithm to mitigate their effect on the network accuracy (so-called hardware-aware training [2]).
In-memory computing can also be used in the context of supervised training of neural networks with backpropagation. This training involves three stages: forward propagation of labelled data through the network, backward propagation of the error gradients from output to the input of the network, and weight update based on the computed gradients with respect to the weights of each layer. This procedure is repeated over a large dataset of labelled examples for multiple epochs until satisfactory performance is reached by the network. When performing training of a neural network encoded in crossbar arrays, forward propagation is performed in the same way as for the inference described above. The only difference is that all the activations \(x_i\) of each layer have to be stored locally in the periphery. Next, backward propagation is performed by inputting the error gradient \(δ_j\) from the subsequent layer onto the columns of the current layer and deciphering the result from the rows. The resulting sum \(\sum_i δ_jW_{ij}\) needs to be multiplied by the derivative of the neuron nonlinear function, which is computed externally, to obtain the error gradient of the current layer. Finally, the weight update is implemented based on the outer product of activations and error gradients \(x_iδ_j\) of each layer. The weight update is performed in-memory by applying suitable electrical pulses to the devices which will increase their conductance in proportion to the desired weight update. See references [1, 3, 4, 5] for details on different techniques that have been proposed to perform weight updates with in-memory computing chips.
References¶
[1] 2020 Nature Nanotechnology, Memory devices and applications for in-memory computing
[2] 2020 Nature Communications. Accurate deep neural network inference using computational phase-change memory
[3] Frontiers in Neuroscience, Acceleration of deep neural network training with resistive cross-point devices: Design considerations
[4] Frontiers in Neuroscience, Mixed-precision deep learning based on computational memory
[5] Nature, Equivalent-accuracy accelerated neural-network training using analogue memory
Using the pytorch integration¶
This library exposes most of its higher-level features as PyTorch primitives, in order to take advantage of the rest of the PyTorch framework and integrate analog layers and other features in the regular workflow.
The following table lists the main modules that provide integration with PyTorch:
Module |
Notes |
---|---|
Analog Modules (layers) and Functions |
|
Analog Optimizers |
Analog layers¶
An analog layer is a neural network module that stores its weights in an analog tile. The library current includes the following analog layers:
AnalogLinear
: applies a linear transformation to the input data. It is the counterpart of PyTorch nn.Linear layer.AnalogConv2d
: applies a 2D convolution over an input signal composed of several input planes. It is the counterpart of PyTorch nn.Conv2d layer.
Using analog layers¶
The analog layers provided by the library can be used in a similar way to a standard PyTorch layer, by creating an object. For example, the following snippet would create a linear layer with 5 input features and 2 output features:
from aihwkit.nn import AnalogLinear
model = AnalogLinear(5, 3)
By default, the AnalogLinear
layer would use bias, and use a
FloatingPointTile
tile as the
underlying tile for the analog operations. These values can be modified by
passing additional arguments to the constructor.
The analog layers will perform the forward
and backward
passes directly
in the underlying tile.
Overall, the layer can be combined and used as if it was a standard torch layer. As an example, it can be mixed with existing layers:
from aihwkit.nn import AnalogLinear
from torch.nn import Linear, Sequential
model = Sequential(
AnalogLinear(2, 3),
Linear(3, 3),
AnalogLinear(3, 1)
)
Note
When using analog layers, please be aware that the Parameters
of the
layers (model.weight
and model.bias
) are not guaranteed to be in
sync with the actual weights and biased used internally by the analog
tile, as reading back the weights has a performance cost. If you need to
ensure that the tensors are synced, please use the
set_weights()
and
get_weights()
methods.
Customizing the analog tile properties¶
The snippet from the previous section can be extended for specifying that the
underlying analog tile should use a ConstantStep
resistive device, with
a specific value for one of its parameters (w_min
):
from aihwkit.nn import AnalogLinear
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import ConstantStepDevice
config = SingleRPUConfig(device=ConstantStepDevice(w_min=-0.4))
model = AnalogLinear(5, 3, bias=False, rpu_config=config)
You can read more about analog tiles in the Using analog tiles section.
Using CUDA¶
If your version of the library is compiled with CUDA support, you can use GPU-aware analog layers for improved performance:
model = model.cuda()
This would move the layers parameters (weights and biases tensors) to CUDA tensors, and move the analog tiles of the layers to a CUDA-enabled analog tile.
Note
Note that if you use analog layers that are children of other modules,
some of the features require manually performing them on the analog layers
directly (instead of only on the parent module).
Please check the rest of the document for more information about using
AnalogSequential
as the parent class
instead of nn.Sequential
, for convenience.
Optimizers¶
An analog optimizer is a representation of an algorithm that determines the training strategy taking into account the particularities of the analog layers involved. The library currently includes the following optimizers:
AnalogSGD
: implements stochastic gradient descent for analog layers. It is the counterpart of PyTorch optim.SGD optimizer.
Using analog optimizers¶
The analog layers provided by the library can be used in a similar way to a
standard PyTorch layer, by creating an object. For example, the following
snippet would create an analog-aware stochastic gradient descent optimizer
with a learning rate of 0.1
, and set it up for using with the
analog layers of the model:
from aihwkit.optim import AnalogSGD
optimizer = AnalogSGD(model.parameters(), lr=0.1)
optimizer.regroup_param_groups(model)
Note
The regroup_param_groups()
method
needs to be invoked in order to set up the parameter groups, as they are
used for handling the analog layers correctly.
The AnalogSGD
optimizer will behave in the same way as the regular
nn.SGD
optimizer for non-analog layers in the model. For the analog layers,
the updating of the weights is performed directly in the underlying analog
tile, according to the properties set for that particular layer.
Training example¶
The following example combines the usage of analog layers and analog optimizer in order to perform training:
from torch import Tensor
from torch.nn.functional import mse_loss
from aihwkit.nn import AnalogLinear
from aihwkit.optim import AnalogSGD
x = Tensor([[0.1, 0.2, 0.4, 0.3], [0.2, 0.1, 0.1, 0.3]])
y = Tensor([[1.0, 0.5], [0.7, 0.3]])
model = AnalogLinear(4, 2)
optimizer = AnalogSGD(model.parameters(), lr=0.1)
optimizer.regroup_param_groups(model)
for epoch in range(10):
pred = model(x)
loss = mse_loss(pred, y)
loss.backward()
optimizer.step()
print("Loss error: " + str(loss))
Using analog layers as part of other modules¶
When using analog layers in other modules, you can use the usual torch mechanisms for including them as part of the model.
However, as a number of torch functions are applied only to the parameters and buffers of a regular module, in some cases they would need to be applied directly to the analog layers themselves (as opposed to applying the parent container).
In order to bypass the need of applying the functions to the analog layers,
you can use the AnalogSequential
as both
a compatible replacement for nn.Sequential
, and as the superclass in case
of custom analog modules. By using this convenience module, the operations are
guaranteed to be applied correctly to its children. For example:
from aihwkit.nn import AnalogLinear, AnalogSequential
model = AnalogSequential(
AnalogLinear(10, 20)
)
model.cuda()
model.eval()
model.program_analog_weights()
Or in the case of custom classes:
from aihwkit.nn import AnalogConv2d, AnalogSequential
class Example(AnalogSequential):
def __init__(self):
super().__init__()
self.feature_extractor = AnalogConv2d(
in_channels=1, out_channels=16, kernel_size=5, stride=1
)
Using analog tiles¶
The core functionality of the package is provided by the rpucuda
simulator.
The simulator contains the primitives and functionality written in C++ and with
CUDA (if enabled), and is exposed to the rest of the package through a Python
interface.
The following table lists the main modules involved in accessing the simulator:
Module |
Notes |
---|---|
Entry point for instantiating analog tiles |
|
Configurations and parameters for analog tiles |
|
Low-level bindings of the C++ simulator members |
Analog tiles¶
The basic primitives involved in the simulation are analog tiles. An analog tile is a two-dimensional array of resistive devices that determine its behavior and properties, i.e. the material response properties when a single update pulse is given (a coincidence between row and column pulse train happened).
The following types of analog tiles are available:
Tile class |
Description |
---|---|
|
implements a floating point or ideal analog tile. |
|
implements an abstract analog tile with many cycle-to-cycle non-idealities and systematic parameter-spreads that can be user-defined. |
|
implements an analog tile for inference and hardware-aware training. |
Creating an analog tile¶
The simplest way of constructing a tile is by instantiating its class. For
example, the following snippet would create a floating point tile of the
specified dimensions (10x20
):
from aihwkit.simulator.tiles import FloatingPointTile
tile = FloatingPointTile(10, 20)
GPU-stored tiles¶
By default, the Tiles
will be set to perform their computations in the
CPU. They can be moved to the GPU by invoking its .cuda()
method:
from aihwkit.simulator.tiles import FloatingPointTile
cpu_tile = FloatingPointTile(10, 20)
gpu_tile = cpu_tile.cuda()
This method returns a counterpart of its original tile (for example, for a
FloatingPointTile
it will return a
CudaFloatingPointTile
). The
GPU-stored tiles share the same interface as the CPU-stored tiled, and their
methods can be used in the same manner.
Note
For GPU-stored tiles to be used, the library needs to be compiled
with GPU support. This can be checked by inspecting the return value of the
aihwkit.simulator.rpu_base.cuda.is_compiled()
function.
Using analog tiles¶
Analog arrays are low-level constructs that contain a number of functions that
allow using them in the context of neural networks. A full description of the
available arrays and its methods can be found at
aihwkit.simulator.tiles
.
Resistive processing units¶
A resistive processing unit is each of the elements on the crossbar array. The following types of resistive devices are available:
Floating point devices¶
Resistive device class |
Description |
---|---|
floating point reference, that implements ideal devices forward/backward/update behavior. |
Single resistive devices¶
Resistive device class |
Description |
---|---|
pulsed update resistive device containing the common properties of all pulsed devices. |
|
ideal update behavior (using floating point), but forward/backward might be non-ideal. |
|
pulsed update behavioral model: constant step, where the update step of material is constant throughout the resistive range (up to hard bounds). |
|
pulsed update behavioral model: linear step, where the update step response size of the material is linearly dependent with resistance (up to hard bounds). |
|
pulsed update behavioral model: soft bounds, where the update step response size of the material is linearly dependent and it goes to zero at the bound. |
|
exponential update step or CMOS-like update behavior. |
Unit cell devices¶
Resistive device class |
Description |
---|---|
abstract resistive device that combines multiple pulsed resistive devices in a single ‘unit cell’. |
|
abstract device model takes an arbitrary device per crosspoint and implements an explicit plus-minus device pair. |
|
abstract device model takes two arbitrary device per cross-point and implements an device with reference pair. |
Compound devices¶
Resistive device class |
Description |
---|---|
abstract device model that takes 2 or more devices per crosspoint and implements a ‘transfer’ based learning rule such as Tiki-Taka (see Gokmen & Haensch 2020). |
RPU Configurations¶
The combination of the parameters that affect the behavior of a tile and the parameters that determine the characteristic of a resistive processing unit are referred to as RPU configurations.
Creating a RPU configuration¶
A configuration can be created by instantiating the class that corresponds to the desired tile. Each kind of configuration has different parameters depending on the particularities of the tile.
For example, for creating a floating point configuration that has the default values for its parameters:
from aihwkit.simulator.configs import FloatingPointResistiveDevice
config = FloatingPointResistiveDevice()
Among those parameters is the resistive device that will be used for creating
the tile. For example, for creating a single resistive device configuration
that uses a ConstantStep
device:
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import ConstantStepDevice
config = SingleRPUConfig(device=ConstantStepDevice())
Device parameters¶
The parameters of the resistive devices that are part of a tile can be set by
passing a rpu_config=
parameter to the constructor:
from aihwkit.simulator.tiles import AnalogTile
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import ConstantStepDevice
config = SingleRPUConfig(device=ConstantStepDevice())
tile = AnalogTile(10, 20, rpu_config=config)
Each configuration and device have a number of parameters. The parameters can be specified during the device instantiation, or accessed as attributes of the device instance.
For example, the following snippet will create a LinearStepDevice
resistive
device, setting its weights limits to [-0.4, 0.6]
and other properties of
the tile:
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice
rpu_config = SingleRPUConfig(
forward=IOParameters(out_noise=0.1),
backward=BackwardIOParameters(out_noise=0.2),
update=UpdateParameters(desired_bl=20),
device=LinearStepDevice(w_min=-0.4, w_max=0.6)
)
A description of the available parameters each configuration and device can be
found at aihwkit.simulator.configs
.
An alternative way of specifying non-default parameters is first generating the config with the correct device and then set the fields directly:
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice
rpu_config = SingleRPUConfig(device=LinearStepDevice())
rpu_config.forward.out_noise = 0.1
rpu_config.backward.out_noise = 0.1
rpu_config.update.desired_bl = 20
rpu_config.device.w_min = -0.4
rpu_config.device.w_max = 0.6
This will generate the same analog tile settings as above.
Unit Cell Device¶
More complicated devices require specification of sub devices and may have more parameters. For instance, to configure a device that has 3 resistive device materials per cross-point, which all have different pulse update behavior, one could do (see also Example 7):
from aihwkit.simulator.configs import UnitCellRPUConfig
from aihwkit.simulator.configs.devices import (
ConstantStepDevice,
VectorUnitCell,
LinearStepDevice,
SoftBoundsDevice
)
# Define a single-layer network, using a vector device having multiple
# devices per crosspoint. Each device can be arbitrarily defined
rpu_config = UnitCellRPUConfig()
rpu_config.device = VectorUnitCell(
unit_cell_devices=[
ConstantStepDevice(),
LinearStepDevice(w_max_dtod=0.4),
SoftBoundsDevice()
]
)
# more configurations, if needed
# only one of the devices should receive a single update that is
# selected randomly, the effective weights is the sum of all
# weights
rpu_config.device.single_device_update = True
rpu_config.device.single_device_update_random = True
# use this configuration for a simple model with one analog tile
model = AnalogLinear(4, 2, bias=True, rpu_config=rpu_config)
# print information about all parameters
print(model.analog_tile.tile)
This analog tile, although very complicated in its hardware configuration, can be used in any given network layer in the same way as simpler analog devices. Also, diffusion or decay, might affect all sub-devices in difference ways, as they all implement their own version of these operations. For the vector unit cell, each weight contribution simple adds up to form a joined effective weight. During forward/backward this joint effective weight will be used. Update, however, will be done on each of the “hidden” weights independently.
Transfer Compound Device¶
Compound devices are more complex than unit cell devices, which have a number of devices per crosspoint, however, they share the underlying implementation. For instance, the “Transfer Compound Device” does contain (at least) two full crossbar arrays internally, where the stochastic gradient descent update is done on one (or a subset of these). It does a partial transfer of content in the first array to the second intermittently. This transfer is accomplished by doing an extra forward pass (with a one-hot input vector) on the first array and updating the output onto the second array. The parameter of this extra forward and update step can be given.
This compound device can be used to implement the tiki-taka learning rule as described in Gokmen & Haensch 2020. For instance, one could use the following tile configuration for that (see also Example 8):
# Imports from aihwkit.
from aihwkit.simulator.configs import UnitCellRPUConfig
from aihwkit.simulator.configs.devices import (
TransferCompound,
SoftBoundsDevice
)
# The Tiki-taka learning rule can be implemented using the transfer device.
rpu_config = UnitCellRPUConfig(
device=TransferCompound(
# devices that compose the Tiki-taka compound
unit_cell_devices=[
SoftBoundsDevice(w_min=-0.3, w_max=0.3),
SoftBoundsDevice(w_min=-0.6, w_max=0.6)
],
# Make some adjustments of the way Tiki-Taka is performed.
units_in_mbatch=True, # batch_size=1 anyway
transfer_every=2, # every 2 batches do a transfer-read
n_cols_per_transfer=1, # one forward read for each transfer
gamma=0.0, # all SGD weight in second device
scale_transfer_lr=True, # in relative terms to SGD LR
transfer_lr=1.0, # same transfer LR as for SGD
)
)
# make more adjustments (can be made here or above)
rpu_config.forward.inp_res = 1/64. # 6 bit DAC
# same forward/update for transfer-read as for actual SGD
rpu_config.device.transfer_forward = rpu_config.forward
# SGD update/transfer-update will be done with stochastic pulsing
rpu_config.device.transfer_update = rpu_config.update
# use tile configuration in model
model = AnalogLinear(4, 2, bias=True, rpu_config=rpu_config)
# print some parameter infos
print(model.analog_tile.tile)
Note that this analog tile now will perfom tiki-taka as the learning rule instead of plain SGD. Once the configuration is done, the usage of this complex analog tile for testing or training from the user point of view is however the same as for other tiles.
Inference and PCM statistical model¶
The analog AI hardware kit provides a state-of-the-art statistical model of a phase-change memory (PCM) array that can be used when performing inference to simulate the various sources of noise that are present in a real hardware [1]. This model is calibrated based on extensive measurements performed on an array containing 1 million PCM devices fabricated at IBM [2].
PCM is a key enabling technology for non-volatile electrical data storage at the nanometer scale, which can be used for analog AI [3]. A PCM device consists of a small active volume of phase-change material sandwiched between two electrodes. In PCM, data is stored by using the electrical resistance contrast between a high-conductive crystalline phase and a low-conductive amorphous phase of the phase-change material. The phase-change material can be switched from low to high conductive state, and vice-versa, through applying electrical current pulses. The stored data can be retrieved by measuring the electrical resistance of the PCM device. An appealing attribute of PCM is that the stored data is retained for a very long time (typically 10 years at room temperature), but is written in only a few nanoseconds.

The model simulates three different sources of noise from the PCM array: programming noise, read noise and temporal drift. The model is only used during inference and therefore it is assumed that network weights have been trained beforehand in software. The diagram below explains how these three sources of noise are incorporated during inference when using the statistical model:

Mapping the trained weights to target conductances¶
This step is typically done offline, after training, before programming the hardware. When the final converged network weights \(W\) have been obtained after training, they must be converted to target conductance values \(G_T\) that will be programmed on the hardware, within the range that it supports. In the statistical model, this range is set to \([0,1]\), where \(1\) corresponds to the largest conductance value \(g_\text{max}\) that can be reliably programmed on the hardware.
The statistical model assumes that each weight is programmed on two PCM devices in a differential configuration. That is, depending on the sign of the weight, either the device encoding the positive part of the weight or the negative part is programmed, and the other device is set to 0. Thus, the simplest way to map the weights to conductances is to multiply the weights by scaling factor \(\beta\), which is different for every network layer. A simple approach is to use \(\beta = 1/w_\text{max}\), where \(w_\text{max}\) is the maximum absolute weight value of a layer.
Programming noise¶
After the target conductances have been defined, they are programmed on the PCM devices of the hardware using a closed-loop iterative write-read-verify scheme [4]. The conductance values programmed in this way on the hardware will have a certain error compared with the target values. This error is characterized by the programming noise. The programming noise is modeled based on the standard deviation of the iteratively programmed conductance values measured from hardware.
The equations used in the statistical model to implement the programming noise are (where we use small letters for the elements of the matrices \(W\) and \(G_T\), etc., and omit the indeces for brevity):
The fit between this equation and the hardware measurement is shown below:

Drift¶
After they have been programmed, the conductance values of PCM devices drift over time. This drift is an intrinsic property of the phase-change material of a PCM device and is due to structural relaxation of the amorphous phase [5]. Knowing the conductance at time \(t_c\) from the last programming pulse, \(g_\text{prog}\), the conductance evolution can be modeled as:
where \(\nu\) is the so-called drift exponent and is sampled from \({\cal N}(\mu_\nu,\sigma_\nu)\). \(\nu\) exhibits variability across a PCM array and a dependence on the target conductance state \(g_T\). The mean drift exponent \(\mu_\nu\) and its standard deviation \(\sigma_\nu\) measured from hardware can be modeled with the following equations:
The fits between these equations and the hardware measurements are shown below:

Read noise¶
When performing a matrix-vector multiplication with the in-memory computing hardware, after the weights have been programmed, there will be instantaneous fluctuations on the hardware conductances due to the intrinsic noise from the PCM devices. PCM exhibits \(1/f\) noise and random telegraph noise characteristics, which alter the effective conductance values used for computation. This noise is referred to as read noise, because it occurs when the devices are read after they have been programmed.
The power spectral density \(S_G\) of the \(1/f\) noise in PCM is given by the following relationship:
The standard deviation of the read noise \(\sigma_{nG}\) at time \(t\) is obtained by integrating the above equation over the measurement bandwidth:
where \(t_{read} = 250\) ns is the width of the pulse applied when reading the devices.
The \(Q_s\) measured from the PCM devices as a function of \(g_T\) is given by:
The final simulated PCM conductance from the model at time \(t\), \(g(t)\), is given by:
Compensation method to mitigate the effect of drift¶
The conductance drift of PCM devices can have a very detrimental effect on the inference performance of a model mapped to hardware. This is because the magnitude of the PCM weights gradually reduces over time due to drift and this prevents the activations from properly propagating throughout the network. A simple global scaling calibration procedure can be used to compensate for the effect of drift on the matrix-vector multiplications performed with PCM crossbars. As proposed in [5], the summed current of a subset of the columns in the array can be periodically read over time at a constant voltage. The resulting total current is then divided by the summed current of the same columns but read at time \(t_0\). This results in a single scaling factor, \(\hat{\alpha}\), that can be applied to the output of the entire crossbar in order to compensate for a global conductance shift.
The figure below explains how the drift calibration procedure can be performed in hardware:

In the simulator, we implement drift compensation by performing a forward pass with an all 1-vector as an input, and then summing outputs (using the potential non-idealities defined for the forward pass) in an absolute way. This procedure is done once after programming and once after applying the drift expected as time point of inference \(t_\text{inference}\). The ratio of the two numbers is the global drift compensation scaling factor of that layer, and it is applied (in digital) to the (digital) output of the analog tile.
Note that the drift compensation class
BaseDriftCompensation
is user
extendable, so that new drift compensation methods can be added
easily.
Example of how to use the PCM noise model for inference¶
The above noise model for inference can be used in our package
in the following way. Instead of using a regular analog tile, that is catered
to doing training on analog with pulsed update and others (see Section
Using analog tiles), you can use an _inference_ tile that
only has non-idealities in the forward pass, but a perfect update and
backward pass. Moreover, for inference, weights can be subject to
realistic weight noise and drift as described above. To enable this
inference features, one has to build an model using our
InferenceTile
(see also
example 5):
# Define a single-layer network, using inference/hardware-aware training tile
rpu_config = InferenceRPUConfig()
# specify additional options of the non-idealities in forward to your liking
rpu_config.forward.inp_res = 1/64. # 6-bit DAC discretization.
rpu_config.forward.out_res = 1/256. # 8-bit ADC discretization.
rpu_config.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
rpu_config.forward.w_noise = 0.02 # Some short-term w-noise.
rpu_config.forward.out_noise = 0.02 # Some output noise.
# specify the noise model to be used for inference only
rpu_config.noise_model = PCMLikeNoiseModel(g_max=25.0) # the model described
# specify the drift compensation
rpu_config.drift_compensation = GlobalDriftCompensation()
# build the model (here just one simple linear layer)
model = AnalogLinear(4, 2, rpu_config=rpu_config)
Once the DNN is trained (automatically using hardware-aware training, if the forward pass has some non-idealities and noise included), then the inference with drift and drift compensation is done in the following manner:
model.eval() # model needs to be in inference mode
t_inference = 3600. # time of inference in seconds (after programming)
program_analog_weights(model) # can also omitted as it is called below in any case
drift_analog_weights(model, t_inference) # modifies weights according to noise model
# now the model can be evaluated with programmed/drifted/compensated weights
Note that we here have two types of non-linearities included. For the first, longer-term weight noise and drift (as described above), we assume that during the evaluation the weight related PCM noise and the drift is done once and then weights are kept constant. Thus, a subsequent test error calculation over the full test set would signify the expected test error for the model at a given time. Ideally, one would want to repeat this for different weight noise and drift instance and or different inference times to access the accuracy degradation properly.
The second type of non-idealities are short-term and on the level of
a single analog MACC (Multiply and Accumulate). Noise on that level vary
with each usage of the analog tile and are specified in the
rpu_config.forward
.
For details on the implementation of our inference noise model, please
consult PCMLikeNoiseModel
. In
particular, we use a
SinglePairConductanceConverter
to convert weights into conductance paris and then apply the noise pn
both of these pairs. More elaborate mapping schemes can be
incorporated by extending
BaseConductanceConverter
.
References¶
[1] Nandakumar, S. R., Boybat, I., Joshi, V., Piveteau, C., Le Gallo, M., Rajendran, B., … & Eleftheriou, E. Phase-change memory models for deep learning training and inference. In 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS) (pp. 727-730). 2019
[2] Joshi, V., Le Gallo, M., Haefeli, S., Boybat, I., Nandakumar, S. R., Piveteau, C., … & Eleftheriou, E. Accurate deep neural network inference using computational phase-change memory. Nature Communications, 11, 2473. 2020
[3] Le Gallo, M., & Sebastian, A. An overview of phase-change memory device physics. Journal of Physics D: Applied Physics, 53(21), 213002. 2020
[4] Papandreou, N., Pozidis, H., Pantazi, A., Sebastian, A., Breitwisch, M., Lam, C., & Eleftheriou, E. Programming algorithms for multilevel phase-change memory. In IEEE International Symposium of Circuits and Systems (ISCAS) (pp. 329-332). 2011
[5] Le Gallo, M., Krebs, D., Zipoli, F., Salinga, M., & Sebastian, A. Collective Structural Relaxation in Phase‐Change Memory Devices. Advanced Electronic Materials, 4(9), 1700627. 2018
[6] Le Gallo, M., Sebastian, A., Cherubini, G., Giefers, H., & Eleftheriou, E. Compressed sensing with approximate message passing using in-memory computing. IEEE Transactions on Electron Devices, 65(10), 4304-4312. 2018
aihwkit
design¶
aihwkit
layers¶
The architecture of the library is comprised by several layers:

PyTorch layer¶
The PyTorch layer is the high-level layer that provides primitives to users for using the features on the library from PyTorch, in particular layers and optimizers.
Overall, the elements on this layer take advantage of PyTorch facilities
(inheriting from existing PyTorch classes and integrating with the rest of
PyTorch features), replacing the default functionality with calls to a Tiles
object from the simulator abstraction layer.
Relevant modules:
Python simulator abstraction layer¶
This layer provides a series of Python objects that can be transparently manipulated and used as any other existing Python functionality, without requiring explicit references to the lower level constructs. By providing this separate Python interface, this allows us for greater flexibility when defining it, keeping all the extra operations and calls to the real bindings internal and performing any translations on behalf of the user.
The main purpose of this layer is to abstract away the implementation-specific complexities of the simulator layers, and map the structures and classes into an interface that caters to the needs of the PyTorch layer. This also provides benefits in regards to serialization and separating concerns overall.
Relevant modules:
aihwkit.simulator.devices
aihwkit.simulator.parameters
Pybind Python layer¶
This layer is the bridge between C++ and Python. The Python classes and functions in this layer are built using Pybind, and in general consist of exposing selected classes and methods from the C++ simulator, handling the conversion between specific types.
As a results, using the classes from this layer is very similar to how using the C++ classes would be. This is purposeful: by keeping the mapping close to 1:1 on this layer, we (and users that are experimenting directly with the simulator) benefit from being able to translate code almost directly. However, in general users are encouraged to not use the objects from this layer direcly, as it involves an extra overhead and precautions when using them that is managed by the upper classes.
aihwkit.simulator.rpu_base.parameters
C++ layer¶
Ultimately, this is the layer where the real operations over Tiles
take
place, and the one that implements the actual simulation and most of the
features. It is not directly accesible from Python - however, it can be actually
used directly from other C++ programs by using the provided headers.
Layer interaction example¶
For example, using this excerpt of code:
1 2 3 4 5 6 7 8 9 | model = AnalogLinear(2, 1)
opt = AnalogSGD(model.parameters(), lr=0.5)
...
for epoch in range(100):
pred = model(x_b)
loss = mse_loss(pred, y_b)
loss.backward()
opt.step()
|
The
AnalogLinear
constructor (line 1) will:create a
aihwkit.simulator.tiles.FloatingPointTile
. As no extra arguments are passed to the constructor, it will also create as a default aFloatingPointResistiveDevice
that uses the defaultFloatingPointResistiveDeviceParameters
parameters. These three objects are the ones from the pure-python layer.internally, the
aihwkit.simulator.tiles.FloatingPointTile
constructor will create aaihwkit.simulator.rpu_base.tiles.FloatingPointTile
instance, along with other objects. These objects are not exposed to the PyTorch layer, and are the ones from the Pybind bindings layer ataihwkit.simulator.rpu_base
.instantiating the bindings classes will create the C++ objects internally.
The
AnalogSGD
constructor (line 2) will:setup the optimizer, using the attributes of the
AnalogLinear
layer in order to identify which Parameters are to be handled differently during the optimization.
During the training loop (lines 6-8), the forward and backward steps will be performed in the analog tile:
for the
AnalogLinear
layer, PyTorch will call the function defined ataihwkit.nn.functions.AnalogFunction
.these functions will call the
forward()
andbackward()
functions defined in theaihwkit.simulator.tiles.FloatingPointTile
of the layer.in turn, they will delegate on the
forward()
andbackward()
functions defined in the bindings, which in turn delegate on the C++ methods.
The optimizer (line 9) will perform the update step in the analog tile:
using the information constructed during its initialization, the
AnalogSGD
will retrieve the reference to theaihwkit.simulator.tiles.FloatingPointTile
, calling itsupdate()
function.in turn, it will delegate on the
update()
function defined in the bindings object, which in turn delegate on the C++ method.
Development setup¶
This section is a complement to the Advanced installation guide section, with the goal of setting up a development environment and a development version of the package.
For convenience, we suggest creating a virtual environment as a way to isolate your development environment:
$ python3 -m venv aihwkit_env
$ cd aihwkit_env
$ source bin/activate
(aihwkit_env) $
Downloading the source¶
The first step is downloading the source of the library:
(aihwkit_env) $ git clone https://github.com/IBM/aihwkit.git
(aihwkit_env) $ cd aihwkit
Note
The following sections assume that the command line examples are executed
in the activated aihwkit_env
environment, and from the folder where the
sources have been cloned.
Compiling the library for development¶
After installing the requirements listed in the Advanced installation guide section, the shared library can be compiled using the following convenience command:
$ python setup.py build_ext --inplace
This will produce a shared library under the src/aihwkit/simulator
directory, without installing the package.
As an alternative, you can use cmake
directly for
finer control over the compilation and for easier debugging potential issues:
$ mkdir build
$ cd build
build$ cmake ..
build$ make
Note that the build system uses a temporary _skbuild
folder for caching
some steps of the compilation. While this is useful when making changes to
the source code, in some cases environment changes (such as installing a new
version of the dependencies, or switching the compiler) are not picked up
correctly and the output of the compilation can be different than expected
if the folder is present.
If the compilation was not successful, it is recommended to manually remove the folder and re-run the compilation in a clean state via:
$ make clean
Using the compiled version of the library¶
Once the library is compiled, the shared library will be created under the
src/aihwkit/simulator
directory. By default, this folder is not in the path
that Python uses for finding modules: it needs to be added to the
PYTHONPATH
accordingly by either:
Updating the environment variable for the session:
$ export PYTHONPATH=src/
Prepending
PYTHONPATH=src/
to the commands where the library needs to be found:$ PYTHONPATH=src/ python examples/01_simple_layer.py
Note
Please be aware that, if the PYTHONPATH
is not modified and there is a
version of aihkwit
installed via pip
, by default Python will use
the installed version, as opposed to the custom-compiled version. It is
recommended to remove the pip-installed version via:
$ pip uninstall aihwkit
when developing the library, in order to minimize the risk of confusion.
Compilation flags¶
There are several cmake
options that can be used for customizing the
compilation process:
Flag |
Description |
Default |
---|---|---|
|
Build with CUDA support |
|
|
Build the C++ test binaries |
|
|
BLAS backend of choice ( |
|
|
Use fast mod |
|
|
Use fastrand |
|
|
Target CUDA architectures |
|
The options can be passed both to setuptools
or to cmake
directly. For
example, for compiling and installing with CUDA support:
$ python setup.py build_ext --inplace -DUSE_CUDA=ON -DRPU_CUDA_ARCHITECTURES="60;70"
or if using cmake
directly:
build$ cmake -DUSE_CUDA=ON -DRPU_CUDA_ARCHITECTURES="60;70" ..
Passing other cmake
flags¶
In the same way flags specific to this project can be passed to setup.py
,
other generic cmake
flags can be passed as well. For example, for setting
the compiler to clang
in osx systems:
$ python setup.py build_ext --inplace -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++
Environment variables¶
The following environment variables are taken into account during the build process:
Environment variable |
Description |
---|---|
|
If present, sets the |
Development conventions¶
aihwkit is an open source project. This section describes how we organize the work and the conventions and procedures we use for developing the library.
Code conventions¶
In order to keep the codebase consistent and assist us in spotting bugs and issues, we use different tools:
Python:
pycodestyle
: for ensuring that we conform to PEP-8, as the minimal common style standard.pylint
: for being able to identify common pitfalls and potential issues in the code, along with additional style conventions.mypy
: for taking advantage of type hints and be able to identify issues before runtime and help maintenance.
C++:
clang-format
: for providing a unified style to the C++ sources. Note that different versions result in slightly different output - please use the10.x
versions.
Testing:
pytest
: while we strive for keeping the project tests stdlib compatible, we encourage usingpytest
as the test runner for its advanced features.
For convenience, a Makefile
is provided in the project, in order to invoke
the different tools easily. For example:
make pycodestyle
make pylint
make mypy
make clang-format
Continuous integration¶
The project uses continuous integration: when a new pull request is made or updated, the different tools and the tests will automatically be run under different environments (different Python versions, operative systems).
We rely on the result of those checks to help reviewing pull requests: when contributing, please make sure of reviewing the result of the continuous integration in order to help fixing potential issues.
Branches and releases¶
For the branches organization:
the
master
branch contains the latest changes and updates. We strive for keeping the branch runnable and working, but its contents can be considered experimental and “bleeding edge”.
When the time for a new release comes:
a new
git tag
is created. This tag can be used for referencing to that stable version of the codebase.a new package is published on PyPI.
This package uses semantic versioning for the version numbers, albeit with
an extra part as we are under beta. For a version number 0.MAJOR.MINOR
, we
strive to:
MAJOR number will be increased when we make incompatible API changes.
MINOR number will be increased when we add functionality that is backwards compatible, or backwards compatible bug fixes.
Please be aware that during the initial development rounds, there are cases where we might not be able to adhere fully to the convention.
Project roadmap¶
You are one of the early users of the IBM Analog Hardware Acceleration Kit. The initial releases have been focused on releasing a basic PyTorch integration for exploring selected features of the analog devices simulator, and set the basis that will be extended and improved upon:
integration of more simulator features in the PyTorch interface
tools to improve inference accuracy by converting pre-trained models with hardware-aware training
algorithmic tools to improve training accuracy by compensating for material short-comings
additional analog neural network layers
additional analog optimizers
custom network architectures and dataset/model zoos
integration with the cloud
hardware demonstrators
This document will be updated with more details as the roadmap for the project evolves. As a companion, please refer to the Issues tab in the repository for more in-depth details about the status of the implementation of the different features and a sneak peek into the next release.
We have an ambitious plan to incrementally bring new simulation and hardware
features to our users, but we are eager to hear your feedback on the features
of value for your work. Please contact us at aihwkit@us.ibm.com
for any
feedback or information.
Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning:
Added
for new features.Changed
for changes in existing functionality.Deprecated
for soon-to-be removed features.Removed
for now removed features.Fixed
for any bug fixes.Security
in case of vulnerabilities.
[0.2.1] - 2020/11/26¶
Added¶
The
rpu_config
is now pretty-printed in a readable manner (excluding the default settings and other readability tweak). (#60)Added a new
ReferenceUnitCell
which has two devices, where one is fixed and the other updated and the effective weight is computed a difference between the two. (#61)VectorUnitCell
accepts now arbitrary weighting schemes that can be user-defined by using a newgamma_vec
property that specifies how to combine the unit cell devices to form the effective weight. (#61)
Changed¶
The unit cell items in
aihwkit.simulator.configs
have been renamed, removing theirDevice
suffix, for having a more consistent naming scheme. (#57)The
Exceptions
raised by the library have been revised, making use in some cases of the ones introduced in a newaihwkit.exceptions
module. (#49)Some
VectorUnitCell
properties have been renamed and extended with an update policy specifying how to select the hidden devices. (#61)The
pybind11
version required has been bumped to 2.6.0, which can be installed frompip
and makes system-wide installation no longer required. Please update yourpybind11
accordingly for compiling the library. (#44)
Removed¶
The
BackwardIOParameters
specialization has been removed, as bound management is now automatically ignored for the backward pass. Please use the more generalIOParameters
instead. (#45)
Fixed¶
Serialization of
Modules
that contain children analog layers is now possible, both when using containers such asSequential
and when using analog layers as custom Module attributes. (#74, #80)The build system has been improved, with experimental Windows support and supporting using CUDA 11 correctly. (#58, #67, #68)
0.2.0 - 2020/10/20¶
Added¶
Added more types of resistive devices:
IdealResistiveDevice
,LinearStep
,SoftBounds
,ExpStep
,VectorUnitCell
,TransferCompoundDevice
,DifferenceUnitCell
. (#14)Added a new
InferenceTile
that supports basic hardware-aware training and inference using a statistical noise model that was fitted by real PCM devices. (#25)Added a new
AnalogSequential
layer that can be used in place ofSequential
for easier operation on children analog layers. (#34)
Changed¶
Specifying the tile configuration (resistive device and the rest of the properties) is now based on a new
RPUConfig
family of classes, that is passed as arpu_config
argument instead ofresistive_device
toTiles
andLayers
. Please check theaihwkit.simulator.config
module for more details. (#23)The different analog tiles are now organized into a
aihwkit.simulator.tiles
package. The internalIndexedTiles
have been removed, and the rest of previous top-level imports have been kept. (#29)
Fixed¶
Improved package compatibility when using non-UTF8 encodings (version file, package description). (#13)
The build system can now detect and use
openblas
directly when using the conda-installable version. (#22)When using analog layers as children of another module, the tiles are now correctly moved to CUDA if using
AnalogSequential
(or by the optimizer if using regular torch container modules). (#34)
API Reference¶
Analog hardware library for PyTorch. |
|
Custom Exceptions for aihwkit. |
|
RPU simulator bindings. |
|
Configurations for resistive processing units. |
|
Phenomenological noise models for inference. |
|
High level analog tiles. |
|
Neural network modules. |
|
Autograd functions for aihwkit. |
|
Neural network modules. |
|
Base class for analog Modules. |
|
Convolution layers. |
|
Analog layers. |
|
Analog Optimizers. |
|
Analog-aware stochastic gradient descent optimizer. |
IBM Analog Hardware Acceleration Kit is an open source Python toolkit for exploring and using the capabilities of in-memory computing devices in the context of artificial intelligence.
Components¶
The toolkit consists of two main components:
PyTorch integration¶
A series of primitives and features that allow using the toolkit within PyTorch:
Analog neural network modules (fully connected layer, convolution layer).
Analog optimizers (SGD).
Analog devices simulator¶
A high-performant (CUDA-capable) C++ simulator that allows for simulating a wide range of analog devices and crossbar configurations by using abstract functional models of material characteristics with adjustable parameters. Feature include:
Forward pass output-referred noise and device fluctuations, as well as adjustable ADC and DAC discretization and bounds
Stochastic update pulse trains for rows and columns with finite weight update size per pulse coincidence
Device-to-device systematic variations, cycle-to-cycle noise and adjustable asymmetry during analog update
Adjustable device behavior for exploration of material specifications for training and inference
State-of-the-art dynamic input scaling, bound management, and update management schemes
Warning
This library is currently in beta and under active development. Please be mindful of potential issues and keep an eye for improvements, new features and bug fixes in upcoming versions.
Example¶
from torch import Tensor
from torch.nn.functional import mse_loss
from aihwkit.nn import AnalogLinear
from aihwkit.optim import AnalogSGD
x = Tensor([[0.1, 0.2, 0.4, 0.3], [0.2, 0.1, 0.1, 0.3]])
y = Tensor([[1.0, 0.5], [0.7, 0.3]])
# Define a network using a single Analog layer.
model = AnalogLinear(4, 2)
# Use the analog-aware stochastic gradient descent optimizer.
opt = AnalogSGD(model.parameters(), lr=0.1)
opt.regroup_param_groups(model)
# Train the network.
for epoch in range(10):
pred = model(x)
loss = mse_loss(pred, y)
loss.backward()
opt.step()
print('Loss error: {:.16f}'.format(loss))