Welcome to IBM Analog Hardware Acceleration Kit’s documentation!

Installation

The preferred way to install this package is by using the Python package index:

pip install aihwkit

Note

During the initial beta stage, we do not provide pip wheels (as in, pre-compiled binaries) for all the possible platform, version and architecture combinations (in particular, only CPU versions are provided).

Please refer to the Advanced installation guide page for instruction on how to compile the library for your environment in case you encounter errors during installing from pip.

The package require the following runtime libraries to be installed in your system:

Note

Please note that the current pip wheels are only compatible with PyTorch 1.6.0. If you need to use a different PyTorch version, please refer to the Advanced installation guide section in order to compile a custom version. More details about the PyTorch compatibility can be found in this issue.

Optional features

The package contains optional functionality that is not installed as part of the default installed. In order to install the extra dependencies, the recommended way is by specifying the extra visualization dependencies:

pip install aihwkit[visualization]

Verifying the installation

If the library was installed correctly, you can use the following snippet for creating an analog layer and predicting the output:

from torch import Tensor
from aihwkit.nn import AnalogLinear

model = AnalogLinear(3, 2)
model(Tensor([[0.1, 0.2], [0.3, 0.4]]))

If you encounter any issues during the installation or executing the snippet, please refer to the Advanced installation guide section for more details and don’t hesitate on using the issue tracker for additional support.

Next steps

You can read more about the PyTorch layers in the Using the pytorch integration section, and about the internal analog tiles in the Using aihwkit Simulator section.

1

Note that GPU support is not available in OSX, as it depends on a platform that has official CUDA support.

Advanced installation guide

Compilation

The build system for aihwkit is based on cmake, making use of scikit-build for generating the Python packages.

Some of the dependencies and tools are Python-based. For convenience, we suggest creating a virtual environment as a way to isolate your environment:

$ python3 -m venv aihwkit_env
$ cd aihwkit_env
$ source bin/activate
(aihwkit_env) $

Note

The following sections assume that the command line examples are executed in the activated aihwkit_env environment.

Dependencies

For compiling aihwkit, the following dependencies are required:

Dependency

Version

Notes

C++11 compatible compiler

cmake

3.18+

pybind11

2.6.2+

Versions 2.6.0+ can be installed using pip (recommended)

scikit-build

0.11.0+

Python 3 development headers

3.6+

BLAS implementation

OpenBLAS or Intel MKL

PyTorch

1.7+

The libtorch library and headers are needed 1

OpenMP

11.0.0+

Optional, OpenMP library and headers 2

CUDA

9.0+

Optional, for GPU-enabled simulator

Nvidia CUB

1.8.0

Optional, for GPU-enabled simulator 4

googletest

1.10.0

Optional, for building the C++ tests 4

Please refer to your operative system documentation for instructions on how to install the different dependencies. The following section contains quick instructions for several operative systems:

Debian-based

On a Debian-based operative system, the following commands can be used for installing the minimal dependencies:

$ sudo apt-get install python3-dev libopenblas-dev
$ pip install cmake scikit-build torch pybind11
OSX

On an OSX-based system, the following commands can be used for installing the minimal dependencies (note that Xcode needs to be installed):

$ brew install openblas
$ pip install cmake scikit-build torch pybind11
miniconda

On a miniconda-based system, the following commands can be used for installing the minimal dependencies 3:

$ conda install cmake openblas pybind11
$ conda install -c conda-forge scikit-build
$ conda install -c pytorch pytorch
Windows using conda (Experimental)

On a Windows-based system, the following instructions can be used for installing the dependencies:

  1. Install (regular) Miniconda, install newest Cuda driver (if available) and the MS Visual Studio 2019 community edition with Desktop development with C++ workload.

  2. Start anaconda powershell (miniconda) and install the following packages:

    $ conda install pybind11 scikit-build
    $ conda install pytorch -c pytorch
    $ conda install -c intel mkl mkl-devel mkl-static mkl-include
    

Using this method, please make sure that the flags -DRPU_BLAS=MKL and -G "Visual Studio 16 2019" are passed to the installation and compilation commands. In particular, use the following command instead of the default one in the Installing and compiling sub-section:

$ pip install -v aihwkit --install-option="-DUSE_CUDA=ON" --install-option="-DRPU_BLAS=MKL" --install-option="-GVisual Studio 16 2019"
Windows with OpenBLAS (Experimental)

As an alternative on Windows-based system, compilation using OpenBLAS is also possible. We recommend installing OpenBLAS following this OpenBLAS - Visual Studio installation and usage guide. It requires installing MS Visual Studio 2019 and Miniconda.

After compiling and installing OpenBLAS, in the same Miniconda terminal, the following commands can be used for installing the minimal dependencies:

$ conda install pybind11 scikit-build
$ conda install pytorch -c pytorch

For compiling aihwkit, it is recommended to use the x64 Native Tools Command Prompt for VS 2019.

Note

If you want to use pip instead of conda, the following commands can be used:

$ pip install cmake scikit-build pybind11
$ pip install torch -f https://download.pytorch.org/whl/torch_stable.html

Installing and compiling

Once the dependencies are in place, the following command can be used for compiling and installing the Python package:

$ pip install -v aihwkit

This command will:

  • download the source tarball for the library.

  • invoke scikit-build

  • which in turn will invoke cmake for the compilation.

  • execute the commands in verbose mode, for helping troubleshooting issues.

  • install the Python package.

If there are any issue with the dependencies or the compilation, the output of the command will help diagnosing the issue.

Note

Please note that the instruction on this page refer to installing as an end user. If you are planning to contribute to the project, an alternative setup and tips can be found at the Development setup section that is more tuned towards the needs of a development cycle.

1

This library uses PyTorch as both a build dependency and a runtime dependency. Please ensure that your torch installation includes libtorch and the development headers - they are included by default if installing torch from pip.

2

Support for the parts of the OpenMP 4.0+. Some compilers like LLVM or Clang do not support OpenMP. In case of you want to add shared memory processing support to the library using one of these compilers, you will need to install OpenMP library in your system.

3

Please note that currently support for conda-based distributions is experimental, and further commands might be needed.

4(1,2)

Both Nvidia CUB and googletest are downloaded and compiled automatically during the build process. As a result, they do not need to be installed manually.

Using the pytorch integration

This library exposes most of its higher-level features as PyTorch primitives, in order to take advantage of the rest of the PyTorch framework and integrate analog layers and other features in the regular workflow.

The following table lists the main modules that provide integration with PyTorch:

Module

Notes

aihwkit.nn

Analog Modules (layers) and Functions

aihwkit.optim

Analog Optimizers

Analog layers

An analog layer is a neural network module that stores its weights in an analog tile. The library current includes the following analog layers:

  • AnalogLinear: applies a linear transformation to the input data. It is the counterpart of PyTorch nn.Linear layer.

  • AnalogConv1d: applies a 1D convolution over an input signal composed of several input planes. It is the counterpart of PyTorch nn.Conv1d layer.

  • AnalogConv2d: applies a 2D convolution over an input signal composed of several input planes. It is the counterpart of PyTorch nn.Conv2d layer.

  • AnalogConv3d: applies a 3D convolution over an input signal composed of several input planes. It is the counterpart of PyTorch nn.Conv3d layer.

Using analog layers

The analog layers provided by the library can be used in a similar way to a standard PyTorch layer, by creating an object. For example, the following snippet would create a linear layer with 5 input features and 2 output features:

from aihwkit.nn import AnalogLinear

model = AnalogLinear(5, 3)

By default, the AnalogLinear layer would use bias, and use a FloatingPointTile tile as the underlying tile for the analog operations. These values can be modified by passing additional arguments to the constructor.

The analog layers will perform the forward and backward passes directly in the underlying tile.

Overall, the layer can be combined and used as if it was a standard torch layer. As an example, it can be mixed with existing layers:

from aihwkit.nn import AnalogLinear, AnalogSequential
from torch.nn import Linear

model = AnalogSequential(
    AnalogLinear(2, 3),
    Linear(3, 3),
    AnalogLinear(3, 1)
)

Note

When using analog layers, please be aware that the Parameters of the layers (model.weight and model.bias) are not guaranteed to be in sync with the actual weights and biased used internally by the analog tile, as reading back the weights has a performance cost. If you need to ensure that the tensors are synced, please use the set_weights() and get_weights() methods.

Customizing the analog tile properties

The snippet from the previous section can be extended for specifying that the underlying analog tile should use a ConstantStep resistive device, with a specific value for one of its parameters (w_min):

from aihwkit.nn import AnalogLinear
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import ConstantStepDevice

config = SingleRPUConfig(device=ConstantStepDevice(w_min=-0.4))
model = AnalogLinear(5, 3, bias=False, rpu_config=config)

You can read more about analog tiles in the Using aihwkit Simulator section.

Using CUDA

If your version of the library is compiled with CUDA support, you can use GPU-aware analog layers for improved performance:

model = model.cuda()

This would move the layers parameters (weights and biases tensors) to CUDA tensors, and move the analog tiles of the layers to a CUDA-enabled analog tile.

Note

Note that if you use analog layers that are children of other modules, some of the features require manually performing them on the analog layers directly (instead of only on the parent module). Please check the rest of the document for more information about using AnalogSequential as the parent class instead of nn.Sequential, for convenience.

Optimizers

An analog optimizer is a representation of an algorithm that determines the training strategy taking into account the particularities of the analog layers involved. The library currently includes the following optimizers:

  • AnalogSGD: implements stochastic gradient descent for analog layers. It is the counterpart of PyTorch optim.SGD optimizer.

Using analog optimizers

The analog layers provided by the library can be used in a similar way to a standard PyTorch layer, by creating an object. For example, the following snippet would create an analog-aware stochastic gradient descent optimizer with a learning rate of 0.1, and set it up for using with the analog layers of the model:

from aihwkit.optim import AnalogSGD

optimizer = AnalogSGD(model.parameters(), lr=0.1)
optimizer.regroup_param_groups(model)

Note

The regroup_param_groups() method needs to be invoked in order to set up the parameter groups, as they are used for handling the analog layers correctly.

The AnalogSGD optimizer will behave in the same way as the regular nn.SGD optimizer for non-analog layers in the model. For the analog layers, the updating of the weights is performed directly in the underlying analog tile, according to the properties set for that particular layer.

Training example

The following example combines the usage of analog layers and analog optimizer in order to perform training:

from torch import Tensor
from torch.nn.functional import mse_loss

from aihwkit.nn import AnalogLinear
from aihwkit.optim import AnalogSGD

x = Tensor([[0.1, 0.2, 0.4, 0.3], [0.2, 0.1, 0.1, 0.3]])
y = Tensor([[1.0, 0.5], [0.7, 0.3]])

model = AnalogLinear(4, 2)
optimizer = AnalogSGD(model.parameters(), lr=0.1)
optimizer.regroup_param_groups(model)

for epoch in range(10):
    pred = model(x)
    loss = mse_loss(pred, y)
    loss.backward()
    optimizer.step()
    print("Loss error: " + str(loss))

Using analog layers as part of other modules

When using analog layers in other modules, you can use the usual torch mechanisms for including them as part of the model.

However, as a number of torch functions are applied only to the parameters and buffers of a regular module, in some cases they would need to be applied directly to the analog layers themselves (as opposed to applying the parent container).

In order to bypass the need of applying the functions to the analog layers, you can use the AnalogSequential as both a compatible replacement for nn.Sequential, and as the superclass in case of custom analog modules. By using this convenience module, the operations are guaranteed to be applied correctly to its children. For example:

from aihwkit.nn import AnalogLinear, AnalogSequential

model = AnalogSequential(
    AnalogLinear(10, 20)
)
model.cuda()
model.eval()
model.program_analog_weights()

Or in the case of custom classes:

from aihwkit.nn import AnalogConv2d, AnalogSequential

class Example(AnalogSequential):

    def __init__(self):
        super().__init__()

        self.feature_extractor = AnalogConv2d(
            in_channels=1, out_channels=16, kernel_size=5, stride=1
        )

Glossary

Analog AI

What is analog AI and an analog chip?

In a traditional hardware architecture, computation and memory are siloed in different locations. Information is moved back and forth between computation and memory units every time an operation is performed, creating a limitation called the von Neumann bottleneck.

In-memory computing delivers radical performance improvements by combining compute and memory in a single device, eliminating the von Neumann bottleneck. By leveraging the physical properties of memory devices, computation happens at the same place where the data is stored, drastically reducing energy consumption. Many types of memory devices such as phase-change memory (PCM), resistive random-access memory (RRAM), and Flash memory can be used for in-memory computing [1]. Because there is no movement of data, tasks can be performed in a fraction of the time and with much less energy. This is different from a conventional computer, where the data is transferred from the memory to the CPU every time a computation is done.

Analog AI comparison

In deep learning, data propagation through multiple layers of a neural network involves a sequence of matrix multiplications, as each layer can be represented as a matrix of synaptic weights. These weights can be stored in the analog charge state or conductance state of memory devices. The devices are arranged in crossbar arrays, creating an artificial neural network where all matrix multiplications are performed in-place in an analog manner. This structure allows to run deep learning models at reduced energy consumption [1].

An in-memory computing chip typically consists of multiple crossbar arrays of memory devices that communicate with each other. A neural network layer can be implemented on (at least) one crossbar, in which the weights of that layer are stored in the charge or conductance state of the memory devices at the crosspoints. Usually, at least two devices per weight are used: one encoding the positive part of the synaptic weight and the other encoding the negative part. The propagation of data through that layer is performed in a single step by inputting the data to the crossbar rows and deciphering the results at the columns. The results are then passed through the neuron nonlinear function and input to the next layer. The neuron nonlinear function is typically implemented at the crossbar periphery, using analog or digital circuits. Because every layer of the network is stored physically on different arrays, each array needs to communicate at least with the array(s) storing the next layer for feed-forward networks, such as multi-layer perceptrons (MLPs) or convolutional neural networks (CNNs). For recurrent neural networks (RNNs), the output of an array needs to communicate with its input.

Analog chip description

The efficient matrix multiplication realized via in-memory computing is very attractive for inference-only applications, where data is propagated through the network on offline-trained weights. In this scenario, the weights are typically trained using a conventional GPU-based hardware, and then are subsequently programmed into the in-memory-computing chip which performs inference. Because of device and circuit level non-idealities in the analog in-memory computing chip, custom techniques must be included into the training algorithm to mitigate their effect on the network accuracy (so-called hardware-aware training [2]).

In-memory computing can also be used in the context of supervised training of neural networks with backpropagation. This training involves three stages: forward propagation of labelled data through the network, backward propagation of the error gradients from output to the input of the network, and weight update based on the computed gradients with respect to the weights of each layer. This procedure is repeated over a large dataset of labelled examples for multiple epochs until satisfactory performance is reached by the network. When performing training of a neural network encoded in crossbar arrays, forward propagation is performed in the same way as for the inference described above. The only difference is that all the activations \(x_i\) of each layer have to be stored locally in the periphery. Next, backward propagation is performed by inputting the error gradient \(δ_j\) from the subsequent layer onto the columns of the current layer and deciphering the result from the rows. The resulting sum \(\sum_i δ_jW_{ij}\) needs to be multiplied by the derivative of the neuron nonlinear function, which is computed externally, to obtain the error gradient of the current layer. Finally, the weight update is implemented based on the outer product of activations and error gradients \(x_iδ_j\) of each layer. The weight update is performed in-memory by applying suitable electrical pulses to the devices which will increase their conductance in proportion to the desired weight update. See references [1, 3, 4, 5] for details on different techniques that have been proposed to perform weight updates with in-memory computing chips.

Analog AI Hardware

There are many promising candidates for the resistive element in analog in-memory computing, including Phase Change Memory (PCM), Resistive Randam Access memory (RRAM), Electro chemical Random-access memory (ECRAM), Complementary Metal Oxide Semiconductor (CMOS), Magnetic RAM (MRAM), Ferroelectric RAM (FERAM or FEFET), and photonics:ref:[1] <references>.

Shown on the figure below is not a complete list of possible resistive elements, but examples of how analog resistance level are achieved with various material and circuit implementation. One set of resistive RAM switches based on the formation and dissolution of a filament in a non-conductive dielectric material. Intermediate conductance is achieve either by modulating the width of the filament or by modulating the composition of the conductive path. Electro-chemical RAM modulates the conductance between source and channel using the gate reservoir voltage that drives ions into the channel.CMOS-based capacitive cells can also be used as resistive elements for analog computing, as long as leakage is controlled and that the compute and read operations can be completed quickly. Magnetic RAM is a very popular emerging non-volatile memory device with usually limited dynamic range that forces its use as binary digital memory. However, using a racetrack memory type implementation, precise control of domain wall movement can be used to modulate conductance in an analog fashion. Similar domain wall motion implementation can also be applied to ferro-electric devices, though analog in-memory implementation with binary conductance levels but analog accumulation is also actively explored using FE-RAM or FE-FETs. Last but not the least, vector-matrix multiplication with photonic devices, where the transmission of light is modulated in an analog fashion has also been demonstrated.

Clearly there is still no ideal Analog AI device as each one of the available options has its strenghts and weaknesses as illustrated in the figure. For instance, PCM devices are considered the most mature among storage-class memory types, however they suffer from assymetruc, drift, and abrupt reset.

Various Candidates for Resistive Memory Devices

This is one of the ket motivations behind building a simulator like the aihwkit to allow the exploration of various device and a multitude of charateristics on the performance of AI models.

Advantages and Challenges

Why is computation in analog interesting for AI acceleration. There are several advantages and challenges. First, due to the parallel in-memory compute, the matrix vector product is very power efficient as the weights data does not need to be fetched from memory for each operation. Second, because the analog matrix vector product is performed using the laws of electro-physics, it is computed in constant time. This implies that a large weight matrix needs in first approximations the same time to compute the matrix vector product as when using a smaller matrix. This is not the case for conventional digital MAC operations, where the time to compute the matrix vector product typically scales with the number of matrix elements as illustrated in the figure below.

MAC Operations in Analog vs. Digital

With all these great advantages, comes also key challenges for analog AI accelerators. Most importantly, the analog vector product is not exact. There are many possible noise sources and non-idealities. For instance, as shown in the graph below, repeating the same matrix vector product twice, will result in a slightly different outcome. This is the main motivation behind developing a flexible simulation tool for Analog accelerators as there is a need to investigate whether we can achieve acceptable accuracies when we accelerate important AI workloads with these future Analog chips. Additionally, a simulation tool that simulates the expected noise sources of the analog compute, will also help to develop new algorithms or error compensation methods to improve the accuracy and reduce any potential impact of the expected non-idealities of Analog compute.

MAC Operations in Analog vs. Digital

Using aihwkit Simulator

The core functionality of the package is provided by the rpucuda simulator. The simulator contains the primitives and functionality written in C++ and with CUDA (if enabled), and is exposed to the rest of the package through a Python interface.

The following table lists the main modules involved in accessing the simulator:

Module

Notes

aihwkit.simulator.tiles

Entry point for instantiating analog tiles

aihwkit.simulator.configs

Configurations and parameters for analog tiles

aihwkit.simulator.presets

Presets for analog tiles

aihwkit.simulator.rpu_base

Low-level bindings of the C++ simulator members

Analog Tiles Overview

The basic primitives involved in the simulation are analog tiles. An analog tile is a two-dimensional array of resistive devices that determine its behavior and properties, i.e. the material response properties when a single update pulse is given (a coincidence between row and column pulse train happened).

The following types of analog tiles are available:

Tile class

Description

FloatingPointTile

implements a floating point or ideal analog tile.

AnalogTile

implements an abstract analog tile with many cycle-to-cycle non-idealities and systematic parameter-spreads that can be user-defined.

InferenceTile

implements an analog tile for inference and hardware-aware training.

Creating an Analog Tile

The simplest way of constructing a tile is by instantiating its class. For example, the following snippet would create a floating point tile of the specified dimensions (10x20):

from aihwkit.simulator.tiles import FloatingPointTile

tile = FloatingPointTile(10, 20)

GPU-stored Tiles

By default, the Tiles will be set to perform their computations in the CPU. They can be moved to the GPU by invoking its .cuda() method:

from aihwkit.simulator.tiles import FloatingPointTile

cpu_tile = FloatingPointTile(10, 20)
gpu_tile = cpu_tile.cuda()

This method returns a counterpart of its original tile (for example, for a FloatingPointTile it will return a CudaFloatingPointTile). The GPU-stored tiles share the same interface as the CPU-stored tiled, and their methods can be used in the same manner.

Note

For GPU-stored tiles to be used, the library needs to be compiled with GPU support. This can be checked by inspecting the return value of the aihwkit.simulator.rpu_base.cuda.is_compiled() function.

Using Analog Tiles

Analog arrays are low-level constructs that contain a number of functions that allow using them in the context of neural networks. A full description of the available arrays and its methods can be found at aihwkit.simulator.tiles.

Resistive processing units

A resistive processing unit is each of the elements on the crossbar array. The following types of resistive devices are available:

Floating Point Devices

Resistive device class

Description

FloatingPointDevice

floating point reference, that implements ideal devices forward/backward/update behavior.

Single Resistive Devices

Resistive device class

Description

PulsedDevice

pulsed update resistive device containing the common properties of all pulsed devices.

IdealDevice

ideal update behavior (using floating point), but forward/backward might be non-ideal.

ConstantStepDevice

pulsed update behavioral model: constant step, where the update step of material is constant throughout the resistive range (up to hard bounds).

LinearStepDevice

pulsed update behavioral model: linear step, where the update step response size of the material is linearly dependent with resistance (up to hard bounds).

SoftBoundsDevice

pulsed update behavioral model: soft bounds, where the update step response size of the material is linearly dependent and it goes to zero at the bound.

SoftBoundsPmaxDevice

same model as in SoftBoundsDevice but using a more convenient parameterization for easier fits to experimentally measured update response curves.

ExpStepDevice

exponential update step or CMOS-like update behavior.

PowStepDevice

update step using a power exponent non-linearity.

Unit Cell Devices

Resistive device class

Description

VectorUnitCell

abstract resistive device that combines multiple pulsed resistive devices in a single ‘unit cell’.

OneSidedUnitCell

abstract device model that takes an arbitrary device per crosspoint and implements an explicit plus-minus device pair with one sided update.

ReferenceUnitCell

abstract device model takes two arbitrary device per cross-point and implements an device with reference pair.

Compound Devices

Resistive device class

Description

TransferCompound

abstract device model that takes 2 or more devices per crosspoint and implements a ‘transfer’ based learning rule such as Tiki-Taka (see Gokmen & Haensch 2020).

MixedPrecisionCompound

abstract device model that takes one devices per crosspoint and implements a ‘mixed-precision’ based learning rule where the rank-update is done in digital instead of using a fully analog parallel write (see Nandakumar et al. 2020).

RPU Configurations

The combination of the parameters that affect the behavior of a tile and the parameters that determine the characteristic of a resistive processing unit are referred to as RPU configurations.

Creating a RPU Configuration

A configuration can be created by instantiating the class that corresponds to the desired tile. Each kind of configuration has different parameters depending on the particularities of the tile.

For example, for creating a floating point configuration that has the default values for its parameters:

from aihwkit.simulator.configs import FloatingPointRPUConfig

config = FloatingPointRPUConfig()

Among those parameters is the resistive device that will be used for creating the tile. For example, for creating a single resistive device configuration that uses a ConstantStep device:

from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import ConstantStepDevice

config = SingleRPUConfig(device=ConstantStepDevice())

Device Parameters

The parameters of the resistive devices that are part of a tile can be set by passing a rpu_config= parameter to the constructor:

from aihwkit.simulator.tiles import AnalogTile
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import ConstantStepDevice

config = SingleRPUConfig(device=ConstantStepDevice())
tile = AnalogTile(10, 20, rpu_config=config)

Each configuration and device have a number of parameters. The parameters can be specified during the device instantiation, or accessed as attributes of the device instance.

For example, the following snippet will create a LinearStepDevice resistive device, setting its weights limits to [-0.4, 0.6] and other properties of the tile:

from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice
from aihwkit.simulator.configs.utils import IOParameters, UpdateParameters

rpu_config = SingleRPUConfig(
    forward=IOParameters(out_noise=0.1),
    backward=IOParameters(out_noise=0.2),
    update=UpdateParameters(desired_bl=20),
    device=LinearStepDevice(w_min=-0.4, w_max=0.6)
)

A description of the available parameters each configuration and device can be found at aihwkit.simulator.configs.

An alternative way of specifying non-default parameters is first generating the config with the correct device and then set the fields directly:

from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice

rpu_config = SingleRPUConfig(device=LinearStepDevice())

rpu_config.forward.out_noise = 0.1
rpu_config.backward.out_noise = 0.1
rpu_config.update.desired_bl = 20
rpu_config.device.w_min = -0.4
rpu_config.device.w_max = 0.6

This will generate the same analog tile settings as above.

Unit Cell Device

More complicated devices require specification of sub devices and may have more parameters. For instance, to configure a device that has 3 resistive device materials per cross-point, which all have different pulse update behavior, one could do (see also Example 7):

from aihwkit.nn import AnalogLinear
from aihwkit.simulator.configs import UnitCellRPUConfig
from aihwkit.simulator.configs.utils import VectorUnitCellUpdatePolicy
from aihwkit.simulator.configs.devices import (
    ConstantStepDevice,
    VectorUnitCell,
    LinearStepDevice,
    SoftBoundsDevice
)

# Define a single-layer network, using a vector device having multiple
# devices per crosspoint. Each device can be arbitrarily defined

rpu_config = UnitCellRPUConfig()

rpu_config.device = VectorUnitCell(
    unit_cell_devices=[
        ConstantStepDevice(),
        LinearStepDevice(w_max_dtod=0.4),
        SoftBoundsDevice()
    ]
)

# more configurations, if needed

# only one of the devices should receive a single update that is
# selected randomly, the effective weights is the sum of all
# weights
rpu_config.device.update_policy = VectorUnitCellUpdatePolicy.SINGLE_RANDOM

# use this configuration for a simple model with one analog tile
model = AnalogLinear(4, 2, bias=True, rpu_config=rpu_config)

# print information about all parameters
print(model.analog_tile.tile)

This analog tile, although very complicated in its hardware configuration, can be used in any given network layer in the same way as simpler analog devices. Also, diffusion or decay, might affect all sub-devices in difference ways, as they all implement their own version of these operations. For the vector unit cell, each weight contribution simple adds up to form a joined effective weight. During forward/backward this joint effective weight will be used. Update, however, will be done on each of the “hidden” weights independently.

Transfer Compound Device

Compound devices are more complex than unit cell devices, which have a number of devices per crosspoint, however, they share the underlying implementation. For instance, the “Transfer Compound Device” does contain (at least) two full crossbar arrays internally, where the stochastic gradient descent update is done on one (or a subset of these). It does a partial transfer of content in the first array to the second intermittently. This transfer is accomplished by doing an extra forward pass (with a one-hot input vector) on the first array and updating the output onto the second array. The parameter of this extra forward and update step can be given.

This compound device can be used to implement the tiki-taka learning rule as described in Gokmen & Haensch 2020. For instance, one could use the following tile configuration for that (see also Example 8):

# Imports from aihwkit.
from aihwkit.nn import AnalogLinear
from aihwkit.simulator.configs import UnitCellRPUConfig
from aihwkit.simulator.configs.devices import (
    TransferCompound,
    SoftBoundsDevice
)

# The Tiki-taka learning rule can be implemented using the transfer device.
rpu_config = UnitCellRPUConfig(
    device=TransferCompound(

        # devices that compose the Tiki-taka compound
        unit_cell_devices=[
            SoftBoundsDevice(w_min=-0.3, w_max=0.3),
            SoftBoundsDevice(w_min=-0.6, w_max=0.6)
        ],

        # Make some adjustments of the way Tiki-Taka is performed.
        units_in_mbatch=True,   # batch_size=1 anyway
        transfer_every=2,       # every 2 batches do a transfer-read
        n_cols_per_transfer=1,  # one forward read for each transfer
        gamma=0.0,              # all SGD weight in second device
        scale_transfer_lr=True, # in relative terms to SGD LR
        transfer_lr=1.0,        # same transfer LR as for SGD
    )
)

# make more adjustments (can be made here or above)
rpu_config.forward.inp_res = 1/64. # 6 bit DAC

# same forward/update for transfer-read as for actual SGD
rpu_config.device.transfer_forward = rpu_config.forward

# SGD update/transfer-update will be done with stochastic pulsing
rpu_config.device.transfer_update = rpu_config.update

# use tile configuration in model
model = AnalogLinear(4, 2, bias=True, rpu_config=rpu_config)

# print some parameter infos
print(model.analog_tile.tile)

Note that this analog tile now will perform tiki-taka as the learning rule instead of plain SGD. Once the configuration is done, the usage of this complex analog tile for testing or training from the user point of view is however the same as for other tiles.

Mixed Precision Compound

This abstract device implements an analog SGD optimizer suggested by Nandakumar et al. 2020 where the update is not done in analog directly, but in digital. Thus is uses a digital rank-update of an intermediately stored floating point matrix, which will be used to transfer the information to the analog tile that is used in forward and backward pass. This optimizer strategy is in contrast with the default mode in the simulator, that uses stochastic pulse trains to update in parallel onto the analog tile directly. This will have impact on the hardware design as well as expected runtime, as more digital computation is needed to be done. For details, see Nandakumar et al. 2020.

To enable mixed-precision one defines for example the following rpu_config:

# Imports from aihwkit.
from aihwkit.nn import AnalogLinear
from aihwkit.simulator.configs import DigitalRankUpdateRPUConfig
from aihwkit.simulator.configs.devices import (
    SoftBoundsDevice, MixedPrecisionCompound
)

rpu_config = DigitalRankUpdateRPUConfig(
    device=MixedPrecisionCompound(
        device=SoftBoundsDevice(),

        # make some adjustments of mixed-precision hyper parameter
        granularity=0.001,
        n_x_bins=0,  # floating point actiations for Chi update
        n_d_bins=0,  # floating point delta for Chi update
    )
)

# use tile configuration in model
model = AnalogLinear(4, 2, bias=True, rpu_config=rpu_config)

Now this analog tile will use the mixed-precision optimizer with a soft bounds device model.

Analog Presets

In addition to the building blocks for analog tiles described in the sections above, the toolkit includes:

  • a library of device presets that are calibrated to real hardware data and/or are based on models in the literature.

  • a library of configuration presets that specify a particular device and optimizer choice.

The current list of device and configuration presets can be found in the aihwkit.simulator.presets module. These presets can be used directly instead of manually specifying a RPU Configuration:

from aihwkit.simulator.tiles import AnalogTile
from aihwkit.simulator.presets import TikiTakaEcRamPreset

tile = AnalogTile(10, 20, rpu_config=TikiTakaEcRamPreset())

Using Experiments

Since version 0.3, the toolkit includes support for running Experiments. An Experiment represents a high-level use case, such as training a neural network, in a compact form that allows for easily running the experiment and variations of it with ease both locally and remotely.

Experiments

The following types of Experiments are available:

Tile class

Description

BasicTraining

Simple training of a neural network

Creating an Experiment

An Experiment can be created just by creating an instance of its class:

from torchvision.datasets import FashionMNIST

from torch.nn import Flatten, LogSoftmax, Sigmoid
from aihwkit.nn import AnalogLinear, AnalogSequential

from aihwkit.experiments import BasicTraining


my_experiment = BasicTraining(
    dataset=FashionMNIST,
    model=AnalogSequential(
        Flatten(),
        AnalogLinear(784, 256, bias=True),
        Sigmoid(),
        AnalogLinear(256, 128, bias=True),
        Sigmoid(),
        AnalogLinear(128, 10, bias=True),
        LogSoftmax(dim=1)
    )
)

Each Experiment has its own attributes, providing sensible defaults as needed. For example, the BasicTraining Experiment allows setting attributes that define the characteristics of the training (dataset, model, batch_size, loss_function, epochs, learning_rate).

The created Experiment contains the definition of the operation to be performed, but is not executed automatically: that is the role of the Runners.

Runners

A Runner is the object that controls the execution of an Experiment, setting up the environment and providing a convenient way of starting it and retrieving its results.

The following types of Runners are available:

Tile class

Description

LocalRunner

Runner for executing experiments locally

CloudRunner

Runner for executing experiments in the cloud

Running an Experiment Locally

In order to run an Experiment, the first step is creating the appropriate runner:

from aihwkit.experiments.runners import LocalRunner

my_runner = LocalRunner()

Note

Each runner has different configurations options depending on their type. For example, the LocalRunner has an option for setting the device where the model will be executed into, that can be used for using GPU:

from torch import device as torch_device

my_runner = LocalRunner(device=torch_device('cuda'))

Once the runner is created, the Experiment can be executed via:

result = my_runner.run(my_experiment)

This will start the desired experiment, and return the results of the experiment - in the training case, a dictionary containing the metrics for each epoch:

> print(result)

[{
  'epoch': 0,
  'accuracy': 0.8289,
  'train_loss': 0.4497026850991666,
  'valid_loss': 0.07776954893999771
 },
 {
  'epoch': 1,
  'accuracy': 0.8299,
  'train_loss': 0.43052176381352103,
  'valid_loss': 0.07716381718227858
 },
 {
  'epoch': 2,
  'accuracy': 0.8392,
  'train_loss': 0.41551961805393445,
  'valid_loss': 0.07490375201140385
 },
 ...
]

The local runner will also print information by default while the experiment is being executed (for example, if running the experiment in an interactive session, as a way of tracking progress). This can be turned off by the stdout argument to the run() function:

result = my_runner.run(my_experiment, stdout=False)

Note

The local runner will automatically attempt to download the dataset if it is FashionMNIST or SVHN into a temporary folder. For other datasets, please ensure that the dataset is downloaded previously, using the dataset_root argument to indicate the location of the data files:

result = my_runner.run(my_experiment, dataset_root='/some/path')

Cloud Runner

Experiments can also be run in the cloud at our companion AIHW Composer application, that allows for executing the experiments remotely using hardware acceleration and inspect the experiments and their results visually, along other features.

Setting up your account

The integration is provided by a Python client included in aihwkit that allows connecting to the AIHW Composer platform. In order to be able to run experiments in the cloud:

  1. Register in the platform and generate an API token in your user page. This token acts as the credentials for connecting with the application.

  2. Store your credentials by creating a ~/.config/aihwkit.conf file with the following contents, replacing YOUR_API_TOKEN with the string from the previous step:

    [cloud]
    api_token = YOUR_API_TOKEN
    

Running an Experiment in the cloud

Once your credentials are configured, running experiments in the cloud can be performed by using the CloudRunner, in an analogous way as running experiments locally:

from aihwkit.experiments.runners import CloudRunner

my_cloud_runner = CloudRunner()
cloud_experiment = my_cloud_runner.run(my_experiment)

Instead of waiting for the experiment to be completed, the run() method returns an object that represents a job in the cloud. As such, it has several convenience methods:

Checking the status of a cloud experiment

The status of a cloud experiment can be retrieved via:

cloud_experiment.status()
The response will provide information about the cloud experiment:
  • WAITING: if the experiment is waiting to be processed.

  • RUNNING: when the experiment is being executed in the cloud.

  • COMPLETED: if the experiment was executed successfully.

  • FAILED: if there was an error during the execution of the experiment.

Note

Some actions are only possible if the cloud experiment has finished successfully, for example, retrieving its results. Please also be mindful that some experiments can take a sizeable amount of time to be executed, specially during the initial versions of the platform.

Retrieving the results of a cloud experiment

Once the cloud experiment completes its execution, its results can be retrieved using:

result = cloud_experiment.get_result()

This will display the result of executing the experiment, in a similar form as the output of running an Experiment locally.

Retrieving the content of the experiment

The Experiment can be retrieved using:

experiment = cloud_experiment.get_experiment()

This will return a local Experiment (for example, a BasicTraining) that can be used locally and their properties inspected. In particular, the weights of the model will reflect the results of the experiment.

Retrieving a previous cloud experiment

The list of experiments previously executed in the cloud can be retrieved via:

cloud_experiments = my_cloud_runner.list_experiments()

Specialized Update Algorithms

To accelerate the training of a DNN, the analog accelerator needs to implement the forward, backward, and updates passes necessary for computing stochastic gradient descent. We further assume that only the matrix vector operations are accelerated in Analog, not the non-linear activation functions or the pooling functions. These latter are done in the digital domain. We assume that there will be separate digital compute units available on the same chip.

To be able to use digital co-processors along with the Analog in-memory computer processors, the activations need to be converted to analog for each crossbar array using digital-to-analog convertors (DAC) or analog-t—digital conversions (ADC). Additionally, there might be additional digital pre- and post-processing such as activation scaling or bias correction shifting which will be done in floating point or digital as well.

Accelerating DNN Training with Analog

The toolkit provides a functional simulation for forward, backward and update passes. Since we want to be able to scale up the simulation to relevant neural network sizes, it is not feasible to simulate the physical system in great details. For simulating non-idealities of the analog forward, backward passes, we use an abstract way to represent only the effective noise sources and non idealities that might have various origins in the physical system. We do not simulate them explicitly. However, we can define different noises and noise strengths for input, output, and weights. Additionally, the value range of the input, output, and the weights are limited because of the physical implementation details and hardware limitations. We also provide a simple way to quantize the input and output to simulate digital to analog and analog to digital conversion as well as various pre- and post-processing schemes that can be selected such as dynamic input range normalization.

Input and Output Quantization

For the update pass, we have put a lot of effort into the simulator to be able to estimate the impact of the noise characteristics of different material choices such as asymmetric resistive device update behavior or device to device variability. During the update path, to apply the gradient, the device conductance that caused the break value needs to be incrementally changed by a certain amount. To achieve this behavior, several finite-sized pulses are sent to the device causing change in the conductance values. This induced conductance change, however, is very noisy for many device materials as shown in the plot below.

AIHWKIT Model fit to Real Data Measurements

The upper line shows the conductance change of a given measured ReRAM device in response to 500 pulses in the up direction followed by 500 pulses in the down direction. Each of applied voltage pulses has the same strength in theory but the response is extremely noisy as illustrated in the figure. These three example traces show the implemented ReRAM model in the simulator, and it shows that it captures the measured conductance response curve quite well. One can also see the device-to-device variability in this case as illustrated by the three different colored plots. Here we show 3 different device updates.

We have implemented 3 different ways to perform the update in Analog and hope to extend the number of available optimizers in the future:

  • Plain SGD: Fully parallel update using stochastic pulse trains by Gokmen & Vlasov:[9].

  • Mixed precision: Digital rank update and transfer by Nandakumar et al.:[4].

  • Tiki-taka: Momentum-like SGD update by Gokmen & Haensch:[10].

These algorithmic improvements and the adaptation of existing algorithms to the characteristics of Analog hardware is one of the key focus areas of this toolkit.

Plain SGD optimizer implements a fast way to do the gradient update fully in Analog using coincidences of stochastic pulse trains to compute the outer product as was suggested by the paper of Gokmen & Valsov:[9]. The Mixed precision optimizer was proposed by Nandakumar et al in 2020:[4]. In this optimzer, the outer product to form the weight gradients is computed in digital. Compared to the first optimizer, we have more digital compute units on this chip than the first one which has the update fully in parallel. This would be a good choice for much more non-ideal devices. The third optimizer called Tiki Taka implements an algorithm that is similar to momentum stochastics gradient decent and assumes that both the momentum term and the weight matrix are on analog cross bar arrays as discussed in:ref:[10] <references>. The gradient update computation onto the momentum matrix uses the same fast update it it was explained in the plain SGD case.

Plain SGD: Fully Parallel Update

We discuss in this section how the parallel update process was implemented based on the work Gokmen & Vlasov:[9]. During the update pass, we need to compute the weight gradients is the outer product between the backpropagated error vector d and the activation vector x, which then needs to be added to the weight matrix. This can be done in Analog as follows. To compute the outer product between the backpropagated error vector and the activation vector, each side of the crossbar array receives stochastic pulse trains where the probability of having a pulse is proportional to the activation vector x or the error vector d. Since the pulses are drawn stochastically independent, the probability of having a coincidence is given by the product of both probabilities. So, when the coincidences are causing the incremental conductance change, the weight gradient updated is in this manner is performed in constant time for the full analog array in parallel. This is exactly what one needs to compute the product of the d and x and the update. In our implementation, we simulate this parallel update in great detail. In particular, we draw the random trains explicitly and apply up or down conductance changes only in case of a coincidence. Each coincidence of the conductance change of the configured device model will be applied which includes full cycle to cycle variations, device to device variabilities, and IR drop (the voltage drop due to energy losses in a resistor).

Analog Parallel Update Analog Parallel Update

Mixed Precision

The mixed precision optimizer is similar algorithmically to momentum SGD. In momentum SGD, the weight gradients are not directly applied to the weight matrix but first added in a leaky fashion to the momentum matrix M and then the momentum matrix is applied to the weight matrix. In this mixed precision optimizer the matrix M is computed in digital floating-point precision. This matrix is then used to update the weight matrix which is computed in analog. This way, the analog update will happen less often in each mini batch. The mixed precision optimizer will need a large amount of digital compute as the outer product is not calculated in Analog.

Momentum SGD Mixed Precision SGD

A list of mixed precisin presets to implement mixed precision optimizer on different Analog devices. The list is below:

  • MixedPrecisionReRamESPreset

  • MixedPrecisionReRamSBPreset

  • MixedPrecisionCapacitorPreset

  • MixedPrecisionEcRamMOPreset

  • MixedPrecisionGokmenVlasovPreset

See example 12 for an illustration of how to use the mixed precision update in the aihwkit:[4].

Tiki-taka: Momentum-like SGD Update

Tiki-Taka optimizer is also algorithmically similar to momentum SGD. The difference here is that the momentum matrix is also in Analog. This implied that the outer product update onto the momentum matrix is done on analog in fully parallel mode using stochastic pulse trains we described earlier. Therefore, this optimizer does not have the potential bottleneck to compute the outer product in digital as done in the mixed precision optimizer. A nice feature of this algorithm is how the decay of the momentum term is achieved. Because the multiplicative decay of conductance values of an analog crossbar array is not easily achievable in hardware. Instead, the device update asymmetry is used to implicitly decay the conductance values caused by random up and down pulses. This is explained in more details in this paper.

Tiki-taka: Momentum-like SGD Update

Analog Training Presets

The toolkit includes built-in analog presets that implement different types of devices that could be used to implement an analog neural network training. These presets (except “Idealized analog device”) are calibrated on the measured characteristics of real hardware devices that have been fabricated at IBM. Device non-ideal characteristics, noise, and variability are accurately simulated in all presets.

  • a library of device presets that are calibrated to real hardware data and/or are based on models in the literature.

  • a library of configuration presets that specify a particular device and optimizer choice.

The current list of device and configuration presets can be found in the aihwkit.simulator.presets module. These presets can be used directly instead of manually specifying a RPU Configuration:

from aihwkit.simulator.tiles import AnalogTile
from aihwkit.simulator.presets import TikiTakaEcRamPreset

tile = AnalogTile(10, 20, rpu_config=TikiTakaEcRamPreset())

In what follows we describe in more details the characteristics of some of the Analog training presets. For a comprehensive list of all available preset configurations, check the aihwkit.simulator.presets.configs module.

ReRAM-ES Preset

Summary: Resistive random-access memory (ReRAM) device based on hafnium oxide using exponential step model.

The example below shows how you can use this preset to create a simple Analog Linear layer network. For more details, check aihwkit.simulator.presets.configs.ReRamESPreset module:

from aihwkit.simulator.presets import ReRamESPreset

# Define a single-layer network, using the ReRAM-ES preset.
rpu_config = SingleRPUConfig(device=ReRamESPreset())
model = AnalogLinear(4, 2, bias=True, rpu_config=rpu_config)
Characterization of the ReRAM-ES Preset Device

Resistive random-access memory (ReRAM) is a non-volatile memory technology with tuneable conductance states that can be used for in-memory computing. The conductance change of a ReRAM device is bidirectional, that is, it is possible to both increase and decrease its conductance incrementally by applying suitable electrical pulses. This capability can be exploited to implement the backpropagation algorithm. The change of conductance in oxide ReRAM is attributed to change in the configuration of the current conducting filament which consists of oxygen vacancies in a metal oxide film.

This preset is based upon the ReRAM device presented in the work of Gong et al.:ref:[6] <references>. This device was fabricated with hafnium oxide as the metal oxide switching layer. The preset captures the experimentally measured response of this device to 1000 positive and 1000 negative pulses (shown in the Figure above), including the pulse-to-pulse fluctuations. The movement of the oxygen vacancies in response to electrical signals has a probabilistic nature and it emerges as inherent randomness in conductance changes. Realistic device-to-device variability is also included in the preset to appropriately simulate the behavior of an array of such devices.

The example below shows how you can create a simple Analog neural network using the ReRAM-RS present:

from aihwkit.simulator.tiles import FloatingPointTile

tile = FloatingPointTile(10, 20)

Parameters

Description

Number of steps

1000

Conductance update model

Exponential step

+/- step max. asymmetry

-670%

Step size variability across devices

20%

Step-to-step variability on same device

500%

Max/min conductance variability

30%

Instantaneous write noise per step (in % of weight range)

5%

ReRAM-SB Preset

Summary: Resistive random-access memory (ReRAM) device based on hafnium oxide using soft bounds model.

The example below shows how you can use this preset to create a simple Analog Linear layer network. For more details, check aihwkit.simulator.presets.configs.ReRamSBPreset:

from aihwkit.simulator.presets import ReRamSBPreset

rpu_config=ReRamSBPreset()
model = AnalogLinear(4, 2, bias=True, rpu_config=rpu_config)
Characterization of the ReRAM-SB Preset Device

This preset is similar ReRAM-SB preset except that it uses a soft bounds model instead of an exponential model. The parameters of this preset as shown in the table below:

Parameters

Description

Number of steps

1000

Conductance update model

Soft Bounds

+/- step asymmetry (at w=0.75)

-400%

Step size variability across devices

30%

Step-to-step variability on same device

375%

Max/min conductance variability

30%

Instantaneous write noise per step (in % of weight range)

5%

Capacitor-cell Preset

Summary: Capacitor-based unit cell device using trench capacitor in 14nm CMOS technology. For more details, check aihwkit.simulator.presets.configs.CapacitorPreset module.

Characterization of the Capacitor-based Unit Cell Preset Device

A capacitor-based cross-point array can be used to train analog neural networks. A capacitor can serve as an analog memory, connected to the gate of a “readout” pFET. During readout, the synaptic weight can be accessed by measuring the conductance of the readout FET. During weight update, the capacitor is charged/discharged by two “current source” FETs, as controlled by two analog inverters and one digital inverter. Charge can be added or subtracted continuously if the number of electrons is high, so analog and symmetric weight update can be achieved. Capacitor-cells have demonstrated some of the best linearity and symmetry characteristics among analog devices, making them a promising candidate for neural network training.

This preset is based upon the capacitor-based cross-point array presented in the work of Li et al.:[7]. The array was fabricated with trench capacitors in 14nm CMOS technology. The preset captures the experimentally measured response of this device to 400 positive update and 400 negative update pulses (shown in Figure 6), including the pulse-to-pulse fluctuations. The reported asymmetry between positive and negative updates of 10% is included as well as the cell-to-cell asymmetry variations. Capacitor leakage is also simulated by exponentially decreasing the weight values over training mini-batches, and cell-to-cell variations in leakage are included. Realistic pulse-to-pulse and device-to-device variability is also included in the preset to appropriately simulate the effect of non-ideal characteristics such as readout variation and variation of current sources.

Parameters

Description

Number of steps

400

Conductance update model

Linear step

+/- step max. asymmetry

-10%

Step size variability across devices

10%

Step-to-step variability on same device

30%

Max/min conductance variability

7%

Leakage (in mini-batches)

106

ECRAM Preset

Summary: Electro-Chemical Random-Access Memory (ECRAM) device based on lithium (Li) ion intercalation in tungsten oxide (WO3). For more details, check aihwkit.simulator.presets.configs.EcRamPreset module.

Characterization of the EcRam Device Preset

Electro-Chemical Random-Access Memory (ECRAM) is a three terminal non-volatile electrochemical switch that has been proposed as an artificial synapse for Analog AI. The electrochemically driven intercalation or redox reaction can be precisely and reversibly controlled by the amount of charge through the gate, so it can provide near-symmetric switching with plentiful discrete states and reduced stochasticity. As a trade-off for cell complexity using three-terminal device, the read and write operations are decoupled, allowing for better endurance and low-energy switching while maintaining non-volatility. These attributes make ECRAM a promising device candidate for neural network training applications.

This preset is based upon the ECRAM device presented in the work of Tang et al.:[7]. Lithium phosphorous oxynitride (LiPON) was used as a solid-state electrolyte. The amount of Li ions intercalated in WO3 is precisely controlled by the gate current and this process is reversible, enabling near-symmetric update. In operation, series of positive (negative) current pulses are fed into the gate for potentiation (depression).The preset captures the experimentally measured response of this device to 1000 positive and 1000 negative pulses (shown in Figure 13 in the paper), including the pulse-to-pulse fluctuations. Realistic device-to-device variability is also included in the preset to appropriately simulate the behavior of an array of such devices.

Parameters

Description

Conductance update model

Linear step

+/- step max. asymmetry

-75%

Step size variability across devices

10%

Step-to-step variability on same device

30%

Max/min conductance variability

5%

Inference and PCM statistical model

The analog AI hardware kit provides a state-of-the-art statistical model of a phase-change memory (PCM) array that can be used when performing inference to simulate the various sources of noise that are present in a real hardware [1]. This model is calibrated based on extensive measurements performed on an array containing 1 million PCM devices fabricated at IBM [2].

PCM is a key enabling technology for non-volatile electrical data storage at the nanometer scale, which can be used for analog AI [3]. A PCM device consists of a small active volume of phase-change material sandwiched between two electrodes. In PCM, data is stored by using the electrical resistance contrast between a high-conductive crystalline phase and a low-conductive amorphous phase of the phase-change material. The phase-change material can be switched from low to high conductive state, and vice-versa, through applying electrical current pulses. The stored data can be retrieved by measuring the electrical resistance of the PCM device. An appealing attribute of PCM is that the stored data is retained for a very long time (typically 10 years at room temperature), but is written in only a few nanoseconds.

The model simulates three different sources of noise from the PCM array: programming noise, read noise and temporal drift. The model is only used during inference and therefore it is assumed that network weights have been trained beforehand in software. The diagram below explains how these three sources of noise are incorporated during inference when using the statistical model:

Mapping the trained weights to target conductances

This step is typically done offline, after training, before programming the hardware. When the final converged network weights \(W\) have been obtained after training, they must be converted to target conductance values \(G_T\) that will be programmed on the hardware, within the range that it supports. In the statistical model, this range is set to \([0,1]\), where \(1\) corresponds to the largest conductance value \(g_\text{max}\) that can be reliably programmed on the hardware.

The statistical model assumes that each weight is programmed on two PCM devices in a differential configuration. That is, depending on the sign of the weight, either the device encoding the positive part of the weight or the negative part is programmed, and the other device is set to 0. Thus, the simplest way to map the weights to conductances is to multiply the weights by scaling factor \(\beta\), which is different for every network layer. A simple approach is to use \(\beta = 1/w_\text{max}\), where \(w_\text{max}\) is the maximum absolute weight value of a layer.

Programming noise

After the target conductances have been defined, they are programmed on the PCM devices of the hardware using a closed-loop iterative write-read-verify scheme [4]. The conductance values programmed in this way on the hardware will have a certain error compared with the target values. This error is characterized by the programming noise. The programming noise is modeled based on the standard deviation of the iteratively programmed conductance values measured from hardware.

The equations used in the statistical model to implement the programming noise are (where we use small letters for the elements of the matrices \(W\) and \(G_T\), etc., and omit the indeces for brevity):

\[g_\text{prog} = g_{T} + {\cal N}(0,\sigma_\text{prog})\]
\[\sigma_\text{prog} = \max\left(-1.1731 \, g_{T}^2 + 1.9650 \, g_{T} + 0.2635, 0 \right)\]

The fit between this equation and the hardware measurement is shown below:

Drift

After they have been programmed, the conductance values of PCM devices drift over time. This drift is an intrinsic property of the phase-change material of a PCM device and is due to structural relaxation of the amorphous phase [5]. Knowing the conductance at time \(t_c\) from the last programming pulse, \(g_\text{prog}\), the conductance evolution can be modeled as:

\[g_\text{drift}(t) = g_\text{prog} \left(\frac{t}{t_c}\right)^{-\nu}\]

where \(\nu\) is the so-called drift exponent and is sampled from \({\cal N}(\mu_\nu,\sigma_\nu)\). \(\nu\) exhibits variability across a PCM array and a dependence on the target conductance state \(g_T\). The mean drift exponent \(\mu_\nu\) and its standard deviation \(\sigma_\nu\) measured from hardware can be modeled with the following equations:

\begin{eqnarray*} \mu_\nu &=& \min\left(\max\left(-0.0155 \log (g_T) + 0.0244, 0.049\right), 0.1\right)\\ \sigma_\nu &=& \min\left(\max\left(-0.0125 \log (g_T) - 0.0059, 0.008\right), 0.045\right)\\ \end{eqnarray*}

The fits between these equations and the hardware measurements are shown below:

Read noise

When performing a matrix-vector multiplication with the in-memory computing hardware, after the weights have been programmed, there will be instantaneous fluctuations on the hardware conductances due to the intrinsic noise from the PCM devices. PCM exhibits \(1/f\) noise and random telegraph noise characteristics, which alter the effective conductance values used for computation. This noise is referred to as read noise, because it occurs when the devices are read after they have been programmed.

The power spectral density \(S_G\) of the \(1/f\) noise in PCM is given by the following relationship:

\[S_G/G^2 = Q/f\]

The standard deviation of the read noise \(\sigma_{nG}\) at time \(t\) is obtained by integrating the above equation over the measurement bandwidth:

\[σ_{nG}(t) = g_\text{drift}(t) Q_s \sqrt{\log\frac{t+t_\text{read}}{2 t_\text{read}}}\]

where \(t_\text{read} = 250\) ns is the width of the pulse applied when reading the devices.

The \(Q_s\) measured from the PCM devices as a function of \(g_T\) is given by:

\[Q_s=\min\left(0.0088/g_T^{0.65}, 0.2\right)\]

The final simulated PCM conductance from the model at time \(t\), \(g(t)\), is given by:

\[g(t)= g_\text{drift} (t)+ {\cal N}\left(0, \sigma_{nG} (t)\right)\]

Compensation method to mitigate the effect of drift

The conductance drift of PCM devices can have a very detrimental effect on the inference performance of a model mapped to hardware. This is because the magnitude of the PCM weights gradually reduces over time due to drift and this prevents the activations from properly propagating throughout the network. A simple global scaling calibration procedure can be used to compensate for the effect of drift on the matrix-vector multiplications performed with PCM crossbars. As proposed in [5], the summed current of a subset of the columns in the array can be periodically read over time at a constant voltage. The resulting total current is then divided by the summed current of the same columns but read at time \(t_0\). This results in a single scaling factor, \(\hat{\alpha}\), that can be applied to the output of the entire crossbar in order to compensate for a global conductance shift.

The figure below explains how the drift calibration procedure can be performed in hardware:

In the simulator, we implement drift compensation by performing a forward pass with an all 1-vector as an input, and then summing outputs (using the potential non-idealities defined for the forward pass) in an absolute way. This procedure is done once after programming and once after applying the drift expected as time point of inference \(t_\text{inference}\). The ratio of the two numbers is the global drift compensation scaling factor of that layer, and it is applied (in digital) to the (digital) output of the analog tile.

Note that the drift compensation class BaseDriftCompensation is user extendable, so that new drift compensation methods can be added easily.

Example of how to use the PCM noise model for inference

The above noise model for inference can be used in our package in the following way. Instead of using a regular analog tile, that is catered to doing training on analog with pulsed update and others (see Section Using Analog Tiles), you can use an _inference_ tile that only has non-idealities in the forward pass, but a perfect update and backward pass. Moreover, for inference, weights can be subject to realistic weight noise and drift as described above. To enable this inference features, one has to build an model using our InferenceTile (see also example 5):

from aihwkit.simulator.configs import InferenceRPUConfig
from aihwkit.simulator.configs.utils import WeightNoiseType
from aihwkit.inference import PCMLikeNoiseModel, GlobalDriftCompensation

# Define a single-layer network, using inference/hardware-aware training tile
rpu_config = InferenceRPUConfig()

# specify additional options of the non-idealities in forward to your liking
rpu_config.forward.inp_res = 1/64.  # 6-bit DAC discretization.
rpu_config.forward.out_res = 1/256. # 8-bit ADC discretization.
rpu_config.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
rpu_config.forward.w_noise = 0.02   # Some short-term w-noise.
rpu_config.forward.out_noise = 0.02 # Some output noise.

# specify the noise model to be used for inference only
rpu_config.noise_model = PCMLikeNoiseModel(g_max=25.0) # the model described

# specify the drift compensation
rpu_config.drift_compensation = GlobalDriftCompensation()

# build the model (here just one simple linear layer)
model = AnalogLinear(4, 2, rpu_config=rpu_config)

Once the DNN is trained (automatically using hardware-aware training, if the forward pass has some non-idealities and noise included), then the inference with drift and drift compensation is done in the following manner:

model.eval()        # model needs to be in inference mode
t_inference = 3600. # time of inference in seconds (after programming)

program_analog_weights(model) # can also omitted as it is called below in any case
drift_analog_weights(model, t_inference) # modifies weights according to noise model

# now the model can be evaluated with programmed/drifted/compensated weights

Note that we here have two types of non-linearities included. For the first, longer-term weight noise and drift (as described above), we assume that during the evaluation the weight related PCM noise and the drift is done once and then weights are kept constant. Thus, a subsequent test error calculation over the full test set would signify the expected test error for the model at a given time. Ideally, one would want to repeat this for different weight noise and drift instance and or different inference times to access the accuracy degradation properly.

The second type of non-idealities are short-term and on the level of a single analog MACC (Multiply and Accumulate). Noise on that level vary with each usage of the analog tile and are specified in the rpu_config.forward.

For details on the implementation of our inference noise model, please consult PCMLikeNoiseModel. In particular, we use a SinglePairConductanceConverter to convert weights into conductance paris and then apply the noise pn both of these pairs. More elaborate mapping schemes can be incorporated by extending BaseConductanceConverter.

References

aihwkit design

aihwkit architecture

The library is comprised by several layers:

aihwkit architecture

PyTorch layer

The PyTorch layer is the high-level layer that provides primitives to users for using the features on the library from PyTorch, in particular layers and optimizers.

Overall, the elements on this layer take advantage of PyTorch facilities (inheriting from existing PyTorch classes and integrating with the rest of PyTorch features), replacing the default functionality with calls to a Tiles object from the simulator abstraction layer.

Relevant modules:

Python simulator abstraction layer

This layer provides a series of Python objects that can be transparently manipulated and used as any other existing Python functionality, without requiring explicit references to the lower level constructs. By providing this separate Python interface, this allows us for greater flexibility when defining it, keeping all the extra operations and calls to the real bindings internal and performing any translations on behalf of the user.

The main purpose of this layer is to abstract away the implementation-specific complexities of the simulator layers, and map the structures and classes into an interface that caters to the needs of the PyTorch layer. This also provides benefits in regards to serialization and separating concerns overall.

Relevant modules:

Pybind Python layer

This layer is the bridge between C++ and Python. The Python classes and functions in this layer are built using Pybind, and in general consist of exposing selected classes and methods from the C++ simulator, handling the conversion between specific types.

As a results, using the classes from this layer is very similar to how using the C++ classes would be. This is purposeful: by keeping the mapping close to 1:1 on this layer, we (and users that are experimenting directly with the simulator) benefit from being able to translate code almost directly. However, in general users are encouraged to not use the objects from this layer direcly, as it involves an extra overhead and precautions when using them that is managed by the upper classes.

C++ layer (rpucuda)

Ultimately, this is the layer where the real operations over Tiles take place, and the one that implements the actual simulation and most of the features. It is not directly accesible from Python - however, it can be actually used directly from other C++ programs by using the provided headers.

Layer interaction example

For example, using this excerpt of code:

1
2
3
4
5
6
7
8
9
model = AnalogLinear(2, 1)
opt = AnalogSGD(model.parameters(), lr=0.5)
...

for epoch in range(100):
    pred = model(x_b)
    loss = mse_loss(pred, y_b)
    loss.backward()
    opt.step()
  1. The AnalogLinear constructor (line 1) will:

    • create a aihwkit.simulator.tiles.FloatingPointTile. As no extra arguments are passed to the constructor, it will also create as a default a FloatingPointResistiveDevice that uses the default FloatingPointResistiveDeviceParameters parameters. These three objects are the ones from the pure-python layer.

    • internally, the aihwkit.simulator.tiles.FloatingPointTile constructor will create a aihwkit.simulator.rpu_base.tiles.FloatingPointTile instance, along with other objects. These objects are not exposed to the PyTorch layer, and are the ones from the Pybind bindings layer at aihwkit.simulator.rpu_base.

    • instantiating the bindings classes will create the C++ objects internally.

  2. The AnalogSGD constructor (line 2) will:

    • setup the optimizer, using the attributes of the AnalogLinear layer in order to identify which Parameters are to be handled differently during the optimization.

  3. During the training loop (lines 6-8), the forward and backward steps will be performed in the analog tile:

    • for the AnalogLinear layer, PyTorch will call the function defined at aihwkit.nn.functions.AnalogFunction.

    • these functions will call the forward() and backward() functions defined in the aihwkit.simulator.tiles.FloatingPointTile of the layer.

    • in turn, they will delegate on the forward() and backward() functions defined in the bindings, which in turn delegate on the C++ methods.

  4. The optimizer (line 9) will perform the update step in the analog tile:

    • using the information constructed during its initialization, the AnalogSGD will retrieve the reference to the aihwkit.simulator.tiles.FloatingPointTile, calling its update() function.

    • in turn, it will delegate on the update() function defined in the bindings object, which in turn delegate on the C++ method.

Development setup

This section is a complement to the Advanced installation guide section, with the goal of setting up a development environment and a development version of the package.

For convenience, we suggest creating a virtual environment as a way to isolate your development environment:

$ python3 -m venv aihwkit_env
$ cd aihwkit_env
$ source bin/activate
(aihwkit_env) $

Downloading the source

The first step is downloading the source of the library:

(aihwkit_env) $ git clone https://github.com/IBM/aihwkit.git
(aihwkit_env) $ cd aihwkit

Note

The following sections assume that the command line examples are executed in the activated aihwkit_env environment, and from the folder where the sources have been cloned.

Compiling the library for development

After installing the requirements listed in the Advanced installation guide section, the shared library can be compiled using the following convenience command:

$ python setup.py build_ext --inplace

This will produce a shared library under the src/aihwkit/simulator directory, without installing the package.

As an alternative, you can use cmake directly for finer control over the compilation and for easier debugging potential issues:

$ mkdir build
$ cd build
build$ cmake ..
build$ make

Note that the build system uses a temporary _skbuild folder for caching some steps of the compilation. While this is useful when making changes to the source code, in some cases environment changes (such as installing a new version of the dependencies, or switching the compiler) are not picked up correctly and the output of the compilation can be different than expected if the folder is present.

If the compilation was not successful, it is recommended to manually remove the folder and re-run the compilation in a clean state via:

$ make clean

Using the compiled version of the library

Once the library is compiled, the shared library will be created under the src/aihwkit/simulator directory. By default, this folder is not in the path that Python uses for finding modules: it needs to be added to the PYTHONPATH accordingly by either:

  1. Updating the environment variable for the session:

    $ export PYTHONPATH=src/
    
  2. Prepending PYTHONPATH=src/ to the commands where the library needs to be found:

    $ PYTHONPATH=src/ python examples/01_simple_layer.py
    

Note

Please be aware that, if the PYTHONPATH is not modified and there is a version of aihkwit installed via pip, by default Python will use the installed version, as opposed to the custom-compiled version. It is recommended to remove the pip-installed version via:

$ pip uninstall aihwkit

when developing the library, in order to minimize the risk of confusion.

Compilation flags

There are several cmake options that can be used for customizing the compilation process:

Flag

Description

Default

USE_CUDA

Build with CUDA support

OFF

BUILD_TEST

Build the C++ test binaries

OFF

RPU_BLAS

BLAS backend of choice (OpenBLAS or MKL)

OpenBLAS

RPU_USE_FASTMOD

Use fast mod

ON

RPU_USE_FASTRAND

Use fastrand

OFF

RPU_CUDA_ARCHITECTURES

Target CUDA architectures

60

The options can be passed both to setuptools or to cmake directly. For example, for compiling and installing with CUDA support:

$ python setup.py build_ext --inplace -DUSE_CUDA=ON -DRPU_CUDA_ARCHITECTURES="60;70"

or if using cmake directly:

build$ cmake -DUSE_CUDA=ON -DRPU_CUDA_ARCHITECTURES="60;70" ..

Passing other cmake flags

In the same way flags specific to this project can be passed to setup.py, other generic cmake flags can be passed as well. For example, for setting the compiler to clang in osx systems:

$ python setup.py build_ext --inplace -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++

Environment variables

The following environment variables are taken into account during the build process:

Environment variable

Description

TORCH_VERSION_SPECIFIER

If present, sets the PyTorch dependency version in the built Python package

Development conventions

aihwkit is an open source project. This section describes how we organize the work and the conventions and procedures we use for developing the library.

Code conventions

In order to keep the codebase consistent and assist us in spotting bugs and issues, we use different tools:

  • Python:

    • pycodestyle: for ensuring that we conform to PEP-8, as the minimal common style standard.

    • pylint: for being able to identify common pitfalls and potential issues in the code, along with additional style conventions.

    • mypy: for taking advantage of type hints and be able to identify issues before runtime and help maintenance.

  • C++:

    • clang-format: for providing a unified style to the C++ sources. Note that different versions result in slightly different output - please use the 10.x versions.

  • Testing:

    • pytest: while we strive for keeping the project tests stdlib compatible, we encourage using pytest as the test runner for its advanced features.

For convenience, a Makefile is provided in the project, in order to invoke the different tools easily. For example:

make pycodestyle
make pylint
make mypy
make clang-format

Continuous integration

The project uses continuous integration: when a new pull request is made or updated, the different tools and the tests will automatically be run under different environments (different Python versions, operative systems).

We rely on the result of those checks to help reviewing pull requests: when contributing, please make sure of reviewing the result of the continuous integration in order to help fixing potential issues.

Testing

The following environment variables can be used for selecting the subset of tests to be performed:

  • TEST_DATASET: if set, the tests that involve using datasets will be executed (involves downloading the datasets).

  • TEST_CREATE: if set, the tests that involve remote experiment creation will be executed.

Branches and releases

For the branches organization:

  • the master branch contains the latest changes and updates. We strive for keeping the branch runnable and working, but its contents can be considered experimental and “bleeding edge”.

When the time for a new release comes:

  • a new git tag is created. This tag can be used for referencing to that stable version of the codebase.

  • a new package is published on PyPI.

This package uses semantic versioning for the version numbers, albeit with an extra part as we are under beta. For a version number 0.MAJOR.MINOR, we strive to:

  1. MAJOR number will be increased when we make incompatible API changes.

  2. MINOR number will be increased when we add functionality that is backwards compatible, or backwards compatible bug fixes.

Please be aware that during the initial development rounds, there are cases where we might not be able to adhere fully to the convention.

Project roadmap

You are one of the early users of the IBM Analog Hardware Acceleration Kit. The initial releases have been focused on releasing a basic PyTorch integration for exploring selected features of the analog devices simulator, and set the basis that will be extended and improved upon:

  • integration of more simulator features in the PyTorch interface

  • tools to improve inference accuracy by converting pre-trained models with hardware-aware training

  • algorithmic tools to improve training accuracy by compensating for material short-comings

  • additional analog neural network layers

  • additional analog optimizers

  • custom network architectures and dataset/model zoos

  • integration with the cloud

  • hardware demonstrators

This document will be updated with more details as the roadmap for the project evolves. As a companion, please refer to the Issues tab in the repository for more in-depth details about the status of the implementation of the different features and a sneak peek into the next release.

We have an ambitious plan to incrementally bring new simulation and hardware features to our users, but we are eager to hear your feedback on the features of value for your work. Please contact us at aihwkit@us.ibm.com for any feedback or information.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning:

  • Added for new features.

  • Changed for changes in existing functionality.

  • Deprecated for soon-to-be removed features.

  • Removed for now removed features.

  • Fixed for any bug fixes.

  • Security in case of vulnerabilities.

[0.5.1] - 2022/01/27

Added

  • Load model state dict into a new model with modified RPUConfig. (#276)

  • Visualization for noise models for analog inference hardware simulation. (#278)

  • State independent inference noise model. (# 284)

  • Transfer LR parameter for MixedPrecisionCompound. (#283)

  • The bias term can now be handled either by the analog or digital domain by controlling the digital_bias layer parameter. (#307)

  • PCM short-term weight noise. (#312)

  • IR-drop simulation across columns during analog mat-vec. (#312)

  • Transposed-read for TransferCompound. (#312)

  • BufferedTranferCompound and TTv2 presets. (#318)

  • Stochastic rounding for MixedPrecisionCompound. (#318)

  • Decay with arbitrary decay point (to reset bias). (#319)

  • Linear layer AnalogLinearMapped which maps a large weight matrix onto multiple analog tiles. (#320)

  • Convolution layers AnalogConvNdMapped which maps large weight matrix onto multiple tiles if necessary. (#331)

  • In the new mapping field of RPUConfig the max tile input and output sizes can be configured for the *Mapped layers. (#331)

  • Notebooks directory with several notebook examples (#333, #334)

  • Analog information summary function. (#316)

  • The alpha weight scaling factor can now be defined as learnable parameter by switching learn_out_scaling_alpha in the rpu_config.mapping parameters. (#353)

Fixed

  • Removed GPU warning during destruction when using multiple GPUs. (#277)

  • Fixed issue in transfer counter for mixed precision in case of GPU. (#283)

  • Map location keyword for load / save observed. (#293)

  • Fixed issue with CUDA buffer allocation when batch size changed. (#294)

  • Fixed missing load statedict for AnalogSequential. (#295)

  • Fixed issue with hierarchical hidden parameter settings. (#313)

  • Fixed serious issue that loaded model would not update analog gradients. (#320)

  • Fixed cuda import in examples. (#320)

Changed

  • The inference noise models are now located in aihwkit.inference. (#281)

  • Analog state dict structure `has changed (shared weight are not saved). (#293)

  • Some of the parameter names of theTransferCompound have changed. (#312)

  • New fast learning rate parameter for TransferCompound, SGD learning rate then is applied on the slow matrix (#312).

  • The fixed_value of WeightClipParameter is now applied for all clipping types if set larger than zero. (#318)

  • The use of generators for analog tiles of an AnalogModuleBase. (#320)

  • Digital bias is now accessable through MappingParameter. (#331)

  • The aihwkit documentation. New content around analog ai concepts, training presets, analog ai optimizers, new references, and examples. (#348)

  • The weight_scaling_omega can now be defined in the rpu_config.mapping. (#353)

Deprecated

  • The module aihwkit.simulator.noise_models has been depreciated in favor of aihwkit.inference. (#281)

0.4.0 - 2021/06/25

Added

  • A number of new config presets added to the library, namely EcRamMOPreset, EcRamMO2Preset, EcRamMO4Preset, TikiTakaEcRamMOPreset, MixedPrecisionEcRamMOPreset. These can be used for tile configuration (rpu_config). They specify a particular device and optimizer choice. (#207)

  • Weight refresh mechanism for OneSidedUnitCell to counteract saturation, by differential read, reset, and re-write. (#209)

  • Complex cycle-to-cycle noise for ExpStepDevice. (#226)

  • Added the following presets: PCMPresetDevice (uni-directional), PCMPresetUnitCell (a pair of uni-directional devices with periodical refresh) and a MixedPrecisionPCMPreset for using the mixed precision optimizer with a PCM pair. (#226)

  • AnalogLinear layer now accepts multi-dimensional inputs in the same way as PyTorch’s Linear layer does. (#227)

  • A new AnalogLSTM module: a recurrent neural network that uses AnalogLinear. (#240)

  • Return of weight gradients for InferenceTile (only), so that the gradient can be handled with any PyTorch optimizer. (#241)

  • Added a generic analog optimizer AnalogOptimizer that allows extending any existing optimizer with analog-specific features. (#242)

  • Conversion tools for converting torch models into a model having analog layers. (#265)

Changed

  • Renamed the DifferenceUnitCell to OneSidedUnitCell which more properly reflects its function. (#209)

  • The BaseTile subclass that is instantiated in the analog layers is now retrieved from the new RPUConfig.tile_class attribute, facilitating the use of custom tiles. (#218)

  • The default parameter for the dataset constructor used by BasicTraining is now the train=bool argument. If using a dataset that requires other arguments or transforms, they can now be specified via overriding get_dataset_arguments() and get_dataset_transform(). (#225)

  • AnalogContext is introduced, along with tile registration function to handle arbitrary optimizers, so that re-grouping param groups becomes unnecessary. (#241)

  • The AnalogSGD optimizer is now implemented based on the generic analog optimizer, and its base module is aihwkit.optim.analog_optimizer. (#242)

  • The default refresh rate is changed to once per mini-batch for PCMPreset (as opposed to once per mat-vec). (#243)

Deprecated

  • Deprecated the CudaAnalogTile and CudaInferenceTile and CudaFloatingPointTile. Now the AnalogTile can be either on cuda or on cpu (determined by the tile and the device attribute) similar to a torch Tensor. In particular, call of cuda() does not change the AnalogTile to CudaAnalogTile anymore, but only changes the instance in the tile field, which makes in-place calls to cuda() possible. (#257)

Removed

  • Removed weight and bias of analog layers from the module parameters as these parameters are handled internally for analog tiles. (#241)

Fixed

  • Fixed autograd functionality for recurrent neural networks. (#240)

  • N-D support for AnalogLinear. (#227)

  • Fixed an issue in the Experiments that was causing the epoch training loss to be higher than the epoch validation loss. (#238)

  • Fixed “Wrong device ordinal” errors for CUDA which resulted from a known issue of using CUB together with pytorch. (#250)

  • Renamed persistent weight hidden parameter field to persistent_weights. (#251)

  • Analog tiles now always move correctly to CUDA when model.cuda() or model.to(device) is used. (#252, #257)

  • Added an error message when wrong tile class is used for loading an analog state dict. (#262)

  • Fixed MixedPrecisionCompound being bypassed with floating point compute. (#263)

0.3.0 - 2021/04/14

Added

  • New analog devices:

    • A new abstract device (MixedPrecisionCompound) implementing an SGD optimizer that computes the rank update in digital (assuming digital high precision storage) and then transfers the matrix sequentially to the analog device, instead of using the default fully parallel pulsed update. (#159)

    • A new device model class PowStepDevice that implements a power-exponent type of non-linearity based on the Fusi & Abott synapse model. (#192)

    • New parameterization of the SoftBoundsDevice, called SoftBoundsPmaxDevice. (#191)

  • Analog devices and tiles improvements:

    • Option to choose deterministic pulse trains for the rank-1 update of analog devices during training. (#99)

    • More noise types for hardware-aware training for inference (polynomial). (#99)

    • Additional bound management schemes (worst case, average max, shift). (#99)

    • Cycle-to-cycle output referred analog multiply-and-accumulate weight noise that resembles the conductance dependent PCM read noise statistics. (#99)

    • C++ backend improvements (slice backward/forward/update, direct update). (#99)

    • Option to excluded bias row for hardware-aware training noise. (#99)

    • Option to automatically scale the digital weights into the full range of the simulated crossbar by applying a fixed output global factor in digital. (#129)

    • Optional power-law drift during analog training. (#158)

    • Cleaner setting of dw_min using device granularity. (#200)

  • PyTorch interface improvements:

    • Two new convolution layers have been added: AnalogConv1d and AnalogConv3d, mimicking their digital counterparts. (#102, #103)

    • The .to() method can now be used in AnalogSequential, along with .cpu() methods in analog layers (albeit GPU to CPU is still not possible). (#142, #149)

  • New modules added:

    • A library of device presets that are calibrated to real hardware data, namely ReRamESPresetDevice, ReRamSBPresetDevice, ECRamPresetDevice, CapacitorPresetDevice, and device presets that are based on models in the literature, e.g. GokmenVlasovPresetDevice and IdealizedPresetDevice. They can be used defining the device field in the RPUConfig. (#144)

    • A library of config presets, such as ReRamESPreset, Capacitor2Preset, TikiTakaReRamESPreset, and many more. These can be used for tile configuration (rpu_config). They specify a particular device and optimizer choice. (#144)

    • Utilities for visualization the pulse response properties of a given device configuration. (#146)

    • A new aihwkit.experiments module has been added that allows creating and running specific high-level use cases (for example, neural network training) conveniently. (#171, #172)

    • A CloudRunner class has been added that allows executing experiments in the cloud. (#184)

Changed
  • The minimal PyTorch version has been bumped to 1.7+. Please recompile your library and update the dependencies accordingly. (#176)

  • Default value for TransferCompound for transfer_every=0 (#174).

Fixed
  • Issue of number of loop estimations for realistic reads. (#192)

  • Fixed small issues that resulted in warnings for windows compilation. (#99)

  • Faulty backward noise management error message removed for perfect backward and CUDA. (#99)

  • Fixed segfault when using diffusion or reset with vector unit cells for CUDA. (#129)

  • Fixed random states mismatch in IoManager that could cause crashed in same network size and batch size cases for CUDA, in particular for TransferCompound. (#132)

  • Fixed wrong update for TransferCompound in case of transfer_every smaller than the batch size. (#132, #174)

  • Period in the modulus of TransferCompound could become zero which caused a floating point exception. (#174)

  • Ceil instead of round for very small transfers in TransferCompound (to avoid zero transfer for extreme settings). (#174)

Removed
  • The legacy NumpyAnalogTile and NumpyFloatingPointTile tiles have been finally removed. The regular, tensor-powered aihwkit.simulator.tiles tiles contain all their functionality and numerous additions. (#122)

0.2.1 - 2020/11/26

  • The rpu_config is now pretty-printed in a readable manner (excluding the default settings and other readability tweak). (#60)

  • Added a new ReferenceUnitCell which has two devices, where one is fixed and the other updated and the effective weight is computed a difference between the two. (#61)

  • VectorUnitCell accepts now arbitrary weighting schemes that can be user-defined by using a new gamma_vec property that specifies how to combine the unit cell devices to form the effective weight. (#61)

Changed

  • The unit cell items in aihwkit.simulator.configs have been renamed, removing their Device suffix, for having a more consistent naming scheme. (#57)

  • The Exceptions raised by the library have been revised, making use in some cases of the ones introduced in a new aihwkit.exceptions module. (#49)

  • Some VectorUnitCell properties have been renamed and extended with an update policy specifying how to select the hidden devices. (#61)

  • The pybind11 version required has been bumped to 2.6.0, which can be installed from pip and makes system-wide installation no longer required. Please update your pybind11 accordingly for compiling the library. (#44)

Removed

  • The BackwardIOParameters specialization has been removed, as bound management is now automatically ignored for the backward pass. Please use the more general IOParameters instead. (#45)

Fixed

  • Serialization of Modules that contain children analog layers is now possible, both when using containers such as Sequential and when using analog layers as custom Module attributes. (#74, #80)

  • The build system has been improved, with experimental Windows support and supporting using CUDA 11 correctly. (#58, #67, #68)

0.2.0 - 2020/10/20

Added

  • Added more types of resistive devices: IdealResistiveDevice, LinearStep, SoftBounds, ExpStep, VectorUnitCell, TransferCompoundDevice, DifferenceUnitCell. (#14)

  • Added a new InferenceTile that supports basic hardware-aware training and inference using a statistical noise model that was fitted by real PCM devices. (#25)

  • Added a new AnalogSequential layer that can be used in place of Sequential for easier operation on children analog layers. (#34)

Changed

  • Specifying the tile configuration (resistive device and the rest of the properties) is now based on a new RPUConfig family of classes, that is passed as a rpu_config argument instead of resistive_device to Tiles and Layers. Please check the aihwkit.simulator.config module for more details. (#23)

  • The different analog tiles are now organized into a aihwkit.simulator.tiles package. The internal IndexedTiles have been removed, and the rest of previous top-level imports have been kept. (#29)

Fixed

  • Improved package compatibility when using non-UTF8 encodings (version file, package description). (#13)

  • The build system can now detect and use openblas directly when using the conda-installable version. (#22)

  • When using analog layers as children of another module, the tiles are now correctly moved to CUDA if using AnalogSequential (or by the optimizer if using regular torch container modules). (#34)

0.1.0 - 2020/09/17

Added

  • Initial public release.

  • Added rpucuda C++ simulator, exposed through a pybind interface.

  • Added a PyTorch AnalogLinear neural network model.

  • Added a PyTorch AnalogConv2d neural network model.

API Reference

aihwkit

Analog hardware library for PyTorch.

aihwkit.cloud

Functionality related to the cloud client for AIHW Composer API.

aihwkit.cloud.client

Client for connecting to the the AIHW Composer API.

aihwkit.cloud.converter

Conversion utilities for interacting with the AIHW Composer API.

aihwkit.exceptions

Custom Exceptions for aihwkit.

aihwkit.experiments

High-level interface for executing Experiments.

aihwkit.experiments.experiments

Experiments for aihwkit.

aihwkit.experiments.runners

Experiment Runners for aihwkit.

aihwkit.inference

High level inference tools.

aihwkit.inference.compensation

Compensation methods such as drift compensation during analog inference.

aihwkit.inference.converter

Converter of weight matrix values into conductance values and back for analog inference.

aihwkit.inference.noise

Noise models to apply to converted weight values during analog inference.

aihwkit.simulator

RPU simulator bindings.

aihwkit.simulator.configs

Configurations for resistive processing units.

aihwkit.simulator.presets

Configurations presets for resistive processing units.

aihwkit.simulator.tiles

High level analog tiles.

aihwkit.nn

Neural network modules.

aihwkit.nn.conversion

Digital/analog model conversion utilities.

aihwkit.nn.functions

Autograd functions for aihwkit.

aihwkit.nn.modules

Neural network modules.

aihwkit.nn.modules.base

Base class for analog Modules.

aihwkit.nn.modules.container

Analog Modules that contain children Modules.

aihwkit.nn.modules.conv

Convolution layers.

aihwkit.nn.modules.linear

Analog layers.

aihwkit.nn.modules.lstm

Analog LSTM layers.

aihwkit.optim

Analog Optimizers.

aihwkit.optim.analog_optimizer

Analog-aware inference optimizer.

aihwkit.optim.context

Parameter context for analog tiles.

aihwkit.utils

Utilities and helpers for aihwkit.

aihwkit.utils.visualization

Visualization utilities.

aihwkit.version

Package version string.

IBM Analog Hardware Acceleration Kit is an open source Python toolkit for exploring and using the capabilities of in-memory computing devices in the context of artificial intelligence.

Components

The toolkit consists of two main components:

PyTorch integration

A series of primitives and features that allow using the toolkit within PyTorch:

  • Analog neural network modules (fully connected layer, 1d/2d/3d convolution layers, sequential container).

  • Analog training using torch training workflow:

    • Analog torch optimizers (SGD).

    • Analog in-situ training using customizable device models and algorithms (Tiki-Taka).

  • Analog inference using torch inference workflow:

    • State-of-the-art statistical model of a phase-change memory (PCM) array calibrated on hardware measurements from a 1 million PCM devices chip.

    • Hardware-aware training with hardware non-idealities and noise included in the forward pass.

Analog devices simulator

A high-performant (CUDA-capable) C++ simulator that allows for simulating a wide range of analog devices and crossbar configurations by using abstract functional models of material characteristics with adjustable parameters. Feature include:

  • Forward pass output-referred noise and device fluctuations, as well as adjustable ADC and DAC discretization and bounds

  • Stochastic update pulse trains for rows and columns with finite weight update size per pulse coincidence

  • Device-to-device systematic variations, cycle-to-cycle noise and adjustable asymmetry during analog update

  • Adjustable device behavior for exploration of material specifications for training and inference

  • State-of-the-art dynamic input scaling, bound management, and update management schemes

Other features

Along with the two main components, the toolkit includes other functionality:

  • A library of device presets that are calibrated to real hardware data and device presets that are based on models in the literature, along with config preset that specify a particular device and optimizer choice.

  • A module for executing high-level use cases (“experiments”), such as neural network training with minimal code overhead.

  • Integration with the AIHW Composer platform that allows executing experiments in the cloud.

Warning

This library is currently in beta and under active development. Please be mindful of potential issues and keep an eye for improvements, new features and bug fixes in upcoming versions.

Example

from torch import Tensor
from torch.nn.functional import mse_loss

from aihwkit.nn import AnalogLinear
from aihwkit.optim import AnalogSGD

x = Tensor([[0.1, 0.2, 0.4, 0.3], [0.2, 0.1, 0.1, 0.3]])
y = Tensor([[1.0, 0.5], [0.7, 0.3]])

# Define a network using a single Analog layer.
model = AnalogLinear(4, 2)

# Use the analog-aware stochastic gradient descent optimizer.
opt = AnalogSGD(model.parameters(), lr=0.1)
opt.regroup_param_groups(model)

# Train the network.
for epoch in range(10):
    pred = model(x)
    loss = mse_loss(pred, y)
    loss.backward()

    opt.step()
    print('Loss error: {:.16f}'.format(loss))

How to cite

In case you are using the IBM Analog Hardware Acceleration Kit for your research, please cite the AICAS21 paper that describes the toolkit:

Note

Malte J. Rasch, Diego Moreda, Tayfun Gokmen, Manuel Le Gallo, Fabio Carta, Cindy Goldberg, Kaoutar El Maghraoui, Abu Sebastian, Vijay Narayanan. “A flexible and fast PyTorch toolkit for simulating training and inference on analog crossbar arrays”, 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems

https://arxiv.org/abs/2104.02184