Supported tech & roadmap

nebulgym has just been launched, and it is already capable of cutting training time in half. At the same time, it is expected that nebulgym may crash or fail in untested use cases. Moreover, the project is in its early stages and there is a lot of room for improvement for nebulgym to become a new paradigm for artificial intelligence training.
nebulgym aims to support every framework, every model, every hardware, and make the most of your hardware and software capabilities to train your model in a fraction of the time required now. In addition, nebulgym will always be extremely easy to use to empower any developer to build powerful AI applications.
nebulgym already embeds many great technologies. Below you can find a list of the features already implemented and those that will be implemented soon. More specific tasks can be found in the list of issues on GitHub.
Any ideas about what could be implemented next? Would you like to contribute to this fantastic library? We welcome any ideas, questions, issues and pull requests! For more info go to Questions & contributions.

Supported frameworks

Supported backends

  • PyTorch. Default compiler for models trained in PyTorch.
  • Rammer. A DNN compiler design that optimizes the execution of DNN workloads on massively parallel accelerators. It generates an efficient static spatio-temporal schedule for a DNN at compile time to minimize scheduling overhead. It maximizes hardware utilization by holistically exploiting parallelism through inter- and intra- operator co-scheduling. Rammer achieves this by proposing several novel, hardware neutral, and clean abstractions for the computation tasks and the hardware accelerators. These abstractions expose a much richer scheduling space to Rammer, which employs several heuristics to explore this space and finds efficient schedules. Read more.
  • ONNX Runtime. Training API leveraging on some techniques developed for inference optimization. It currently supports only Nvidia GPUs.
Learn how to switch among nebulgym Supported backends.

Optimization techniques for data loading

  • Cached datasets. nebulgym changes the way data is loaded, with the goal of eliminating any time when the processor is not processing but waiting for data to load. Indeed, a default data loader reads the data from your storage and performs some user-set preprocessing (e.g. converting the data to normalized tensors, removing biases, resizing images, etc.), and then transfers the data to the model. This process is repeated for each data and for each epoch. The data loader introduced in nebulgym at first epoch performs the same tasks (data loading and preprocessing) but writes/saves the preprocessed data (in parallel) to a fast access memory, which is usually SSD memory if available. This slows down the first epoch slightly (~20% slower during testing), but starting with the second epoch thereafter preprocessing will not be computed again and data will be transferred from fast-access memory to RAM (in parallel) to make maximum use of memory bandwidth. This speeds up all the following epochs and prevents data loading from becoming a bottleneck for the entire training process, which happens in many cases.

Model Optimization techniques

  • Selective-Backprop. Acceleration is achieved by prioritizing examples with high loss at each iteration. This means using the output of a training example’s forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of computationally-expensive back-propagation steps performed, nebulgym accelerates training. Acceleration is be achieved by using stale forward pass results for selection, thus also skipping forward passes of low priority examples.
  • Sparsified Back Propagation. As traditional neural network consumes a significant amount of computing resources during back propagation, leveraging a simple yet effective technique to alleviate this problem. In this technique, nebulgym computes only a small subset of the full gradient to update the model parameters in back propagation. The gradient vectors are sparsified so that only the elements with top magnitude are kept. As a result, a smaller fraction of the weight matrix is modified, leading to a linear reduction in the computational cost. Read more.
  • Layer Replacement (open issue)
  • ModelReshaper (open issue)
  • Distributed training (open issue)
  • Forward gradients (open issue)

Library installation methods

  • From PyPI
  • Source code

Backend installation methods

  • From the backend source code
  • Automatic installation with an auto-installer (open issue)