Composer CLI
Since version 0.3
, the toolkit includes support for running Experiments
.
An Experiment represents a high-level use case, such as training a neural
network, in a compact form that allows for easily running the experiment and
variations of it with ease both locally, in the cloud and its variations.
Experiments
The following types of Experiments are available:
Tile class |
Description |
---|---|
Simple training of a neural network |
|
|
Simple inference of a neural network |
Creating an Experiment
A BasicTraining
Experiment can be created just by creating an instance of its class:
from torchvision.datasets import FashionMNIST
from torch.nn import Flatten, LogSoftmax, Sigmoid
from aihwkit.nn import AnalogLinear, AnalogSequential
from aihwkit.experiments import BasicTraining
my_experiment = BasicTraining(
dataset=FashionMNIST,
model=AnalogSequential(
Flatten(),
AnalogLinear(784, 256, bias=True),
Sigmoid(),
AnalogLinear(256, 128, bias=True),
Sigmoid(),
AnalogLinear(128, 10, bias=True),
LogSoftmax(dim=1)
)
Similarly a BasicInferencing
Experiment can also be created by creating an instance of the class
from torch.nn import (
Flatten, LogSoftmax, MaxPool2d, Module, Tanh
)
from torchvision.datasets import FashionMNIST
from aihwkit.nn import AnalogConv2dMapped, AnalogLinearMapped, AnalogSequential
from aihwkit.experiments.experiments.inferencing import BasicInferencing
DATASET = FashionMNIST
MODEL = create_analog_lenet5_network()
BATCH_SIZE = 8
REPEATS = 2
I_TIMES = 86400
TEMPLATE_ID = 'hwa-trained-lenet5-mapped'
my_experiment = BasicInferencing(
dataset=DATASET,
model = MODEL,
batch_size = BATCH_SIZE,
weight_template_id = TEMPLATE_ID,
inference_repeats = REPEATS,
inference_time = I_TIMES
)
Each Experiment has its own attributes, providing sensible defaults as needed.
For example, the BasicTraining
Experiment allows setting attributes that
define the characteristics of the training (dataset
, model
,
batch_size
, loss_function
, epochs
, learning_rate
).
Similarly the BasicInferencing
Experiment allows setting attributes
that define the characteristics of the Inferencing experiment (dataset
,
model
, batch_size
, inference_repeats
, inference_time
)
The created Experiment contains the definition of the operation to be performed,
but is not executed automatically. That is the role of the Runners
.
Runners
A Runner is the object that controls the execution of an Experiment, setting up the environment and providing a convenient way of starting it and retrieving its results.
The following types of Runners are available:
Tile class |
Description |
---|---|
Runner for executing training experiments locally |
|
Runner for executing training experiments in the cloud |
|
Runner for executing inference experiments locally |
|
Runner for executing inference experiments in the cloud |
Running an Experiment Locally
In order to run an Experiment, the first step is creating the appropriate
runner, for executing a training exepriment
locally we create LocalRunner
from aihwkit.experiments.runners import LocalRunner
my_runner = LocalRunner()
Similarly for executing a Inferencing Experimnet
locally we create InferenceLocalRunner
from aihwkit.experiments.runners import InferenceLocalRunner
my_runner = InferenceLocalRunner()
Note
Each runner has different configurations options depending on their type.
For example, the LocalRunner
has an option for setting the device where
the model will be executed into, that can be used for using GPU:
from torch import device as torch_device
my_runner = LocalRunner(device=torch_device('cuda'))
Similarly , the InferenceLocalRunner
has also an option for setting the device
when the model would be used for inferencing , for using the available GPU
from torch import device as torch_device
my_runner - InferenceLocalRunner(device=torch_device('cuda'))
Once the runner is created for either Training
or Inferencing
experiment , the Experiment can be executed via:
result = my_runner.run(my_experiment)
This will start the desired experiment, and return the results of the experiment - in the training case, a dictionary containing the metrics for each epoch:
print(result)
[{
'epoch': 0,
'accuracy': 0.8289,
'train_loss': 0.4497026850991666,
'valid_loss': 0.07776954893999771
},
{
'epoch': 1,
'accuracy': 0.8299,
'train_loss': 0.43052176381352103,
'valid_loss': 0.07716381718227858
},
{
'epoch': 2,
'accuracy': 0.8392,
'train_loss': 0.41551961805393445,
'valid_loss': 0.07490375201140385
},
...
]
The LocalRunner
for Training
experiment and InferenceLocalRunner
for Inferencing
experiment
will also print information by default while the experiment is being executed (for example, if running
the experiment in an interactive session, as a way of tracking progress). This can be turned off by the
stdout
argument to the run()
function:
result = my_runner.run(my_experiment, stdout=False)
Note
The local runner for both Training
and Inferencing
type of experiments
will automatically attempt to download the dataset if it
is FashionMNIST
or SVHN
into a temporary folder. For other datasets,
please ensure that the dataset is downloaded previously, using the
dataset_root
argument to indicate the location of the data files:
result = my_runner.run(my_experiment, dataset_root='/some/path')
Cloud Runner
Experiments can also be run in the cloud at our companion AIHW Composer application, that allows for executing the experiments remotely using hardware acceleration and inspect the experiments and their results visually, along other features.
Setting up your account
The integration is provided by a Python client included in aihwkit
that
allows connecting to the AIHW Composer platform. In order to be able to
run experiments in the cloud:
Register in the platform and generate an API token in your user page. This token acts as the credentials for connecting with the application.
Store your credentials by creating a
~/.config/aihwkit.conf
file with the following contents, replacingYOUR_API_TOKEN
with the string from the previous step:[cloud] api_token = YOUR_API_TOKEN
You may need to download the SSL certificates and add them to the certificate store.
https://cacerts.digicert.com/DigiCertTLSRSASHA2562020CA1-1.crt.pem
Append the certificates to the cacert.pem file
Note
You can run the following command to find the location of the cacert.pem file
$ python -c “import certifi; print(certifi.where())“
Running an Experiment in the cloud
Once your credentials are configured, running training
experiments in the cloud can
be performed by using the CloudRunner
, in an analogous way as running
experiments locally:
from aihwkit.experiments.runners import CloudRunner
my_cloud_runner = CloudRunner()
cloud_experiment = my_cloud_runner.run(my_experiment)
Similarly Inferencing
experiment can also be performed in the cloud by using
the InferenceCloudRunner
, in an analogous way as running experiments locally
from aihwkit.experiments.runners import InferenceCloudRunner
cloud_runner = InferenceCloudRunner()
cloud_experiment = cloud_runner.run(my_experiment, analog_info,
noise_model_info, name=NAME, device='gpu')
Instead of waiting for the experiment to be completed, the run()
method
returns an object that represents a job in the cloud. As such, it has several
convenience methods:
Checking the status of a cloud experiment
The status of a cloud experiment for both Training
and Inferencing
experiments can be retrieved via:
cloud_experiment.status()
- The response will provide information about the cloud experiment:
WAITING
: if the experiment is waiting to be processed.RUNNING
: when the experiment is being executed in the cloud.COMPLETED
: if the experiment was executed successfully.FAILED
: if there was an error during the execution of the experiment.
Note
Some actions are only possible if the cloud experiment has finished successfully, for example, retrieving its results. Please also be mindful that some experiments can take a sizeable amount of time to be executed, specially during the initial versions of the platform.
Retrieving the results of a cloud experiment
Once the cloud experiment (Training
or Inferencing
) completes its execution, its results can be retrieved
using:
result = cloud_experiment.get_result()
This will display the result of executing the experiment, in a similar form as the output of running an Experiment locally.
Retrieving the content of the experiment
The Experiment can be retrieved using:
experiment = cloud_experiment.get_experiment()
This will return a local Experiment (for example, a BasicTraining
or BasicInferencing
) that
can be used locally and their properties inspected. In particular, the weights
of the model will reflect the results of the experiment.
Retrieving a previous cloud experiment
The list of experiments previously executed in the cloud can be retrieved via:
cloud_experiments = my_cloud_runner.list_experiments()
Please see https://github.com/IBM/aihwkit/tree/master/notebooks/cli for the experiment example notebooks.