Nebullvm API

Optimize model API

The optimize_model function allows to optimize a model from one of the supported frameworks, and returns an optimized model that can be used with the same interface as the original model.
def optimize_model(
model: Any,
input_data: Union[Iterable, Sequence],
metric_drop_ths: Optional[float] = None,
metric: Union[str, (...) -> Any, None] = None,
optimization_time: str = "constrained",
dynamic_info: Optional[dict] = None,
config_file: Optional[str] = None,
ignore_compilers: Optional[List[str]] = None,
ignore_compressors: Optional[List[str]] = None,
store_latencies: bool = False,
device: str = None,
**kwargs: Any
) -> Any


model: Any
The input model, can belong to one of the following frameworks: PyTorch, TensorFlow, ONNX, HuggingFace. In the ONNX case it will be a string (the path to the saved onnx model), in the other cases it will be a torch.nn.Module or a tf.Module.
input_data: Iterable or Sequence
Input data to be used for model optimization, which can be one or more data samples. Note that if optimization_time is set to "unconstrained," it would be preferable to provide at least 100 data samples to also activate nebullvm techniques that require data (pruning, etc.). The data can be entered either as a sequence (data accessible by "element", e.g. data[i]) or as an iterable (data accessible with a loop, e.g. for x in data). In the case of a input model in PyTorch, TensorFlow and ONNX, a tensor must be passed in the torch.Tensor, tf.Tensor and np.ndarray formats, respectively. Note that each input sample must be a tuple containing a tuple as the first element, the inputs, and the label as second element. Inputs must be passed as a tuple, even in the case of a single input sample; in such a case, the input tuple will contain only one element. Hugging Face models can take both dictionaries and strings as data samples. In the case of a list of strings passed as input_data, a tokenizer must also be entered as extra arguments with the keyword 'tokenizer'. The strings will then be converted into data samples by Hugging Face tokenizer.
metric_drop_ths: float, optional
Maximum drop in the specified metric accepted. No model with a higher error will be accepted, i.e. all optimized model having a larger error with respect to the original one will be discarded, without even considering their possible speed-up. Default: 0.
metric: Callable, optional
Metric to be used for estimating the error that may arise from using optimization techniques and for evaluating if the error exceeds the metric_drop_ths and therefore the optimization has to be rejected. metric accepts as input a string, a user-defined metric, or none. Metric accepts a string containing the name of the metric; it currently supports "numeric_precision" and "accuracy". It also supports a user-defined metric that can be passed as a function that takes as input two tuples of tensors, which will be generated from the base model and the optimized model, and their original labels. For more information, see nebullvm.measure.compute_relative_difference and nebullvm.measure.compute_accuracy_drop. If none is given but a metric_drop_ths is received, the nebullvm.measure.compute_relative_difference metric will be used as the default one. Default: "numeric_precision".
optimization_time: OptimizationTime, optional
The optimization time mode. It can be "constrained" or "unconstrained". In "constrained" mode, nebullvm takes advantage only of compilers and precision reduction techniques, such as quantization. "unconstrained" optimization_time allows it to exploit more time-consuming techniques, such as pruning and distillation. Note that most techniques activated in "unconstrained" mode require fine-tuning, and therefore it is recommended that at least 100 samples be provided as input_data. Default: "constrained".
dynamic_info: Dict, optional
Dictionary containing dynamic axis information. It should contain as keys both "input" and "output" and as values two lists of dictionaries, where each dictionary represents dynamic axis information for an input/output tensor. The inner dictionary should have an integer as a key, i.e. the dynamic axis (also considering the batch size) and a string as a value giving it a tag, e.g., "batch_size.". Default: None.
config_file: str, optional
Configuration file containing the parameters needed to define the CompressionStep in the pipeline. Default: None.
ignore_compilers: List[str], optional
List of DL compilers ignored during optimization execution. The compiler name should be one among tvm, tensor RT, openvino, onnxruntime, deepsparse, tflite, bladedisc, torchscript, intel_neural_compressor . Default: None.
ignore_compressors: List[str], optional
List of DL compressors ignored during compression execution. The compressor name should be one among sparseml and intel_pruning. Default: None.
store_latencies: bool, optional
Parameter thay allows to store the latency for each compiler used by nebullvm in a json file, that will be created in the working directory. Default: False.
device: str, optional
Device used for inference, it can be cpu or gpu. If not set, gpu will be used if available, otherwise cpu. Default: None.

Returns: Inference Learner

Optimized version with the same interface of the input model. For example, optimizing a PyTorch model will return an InferenceLearner object that can be called exactly like a PyTorch model (either with model.forward(input) or model(input)). The optimized model will therefore take as input a torch.Tensors and return a torch.Tensors.