[null,null,["อัปเดตล่าสุด 2025-07-25 UTC"],[],[],null,["# Collaborative Optimization\n\n\u003cbr /\u003e\n\n~Maintained by Arm ML Tooling~\n\nThis document provides an overview of experimental APIs for combining various\ntechniques to optimize machine learning models for deployment.\n\nOverview\n--------\n\nCollaborative optimization is an overarching process that encompasses various\ntechniques to produce a model that, at deployment, exhibits the best balance of\ntarget characteristics such as inference speed, model size and accuracy.\n\nThe idea of collaborative optimizations is to build on individual techniques by\napplying them one after another to achieve the accumulated optimization effect.\nVarious combinations of the following optimizations are possible:\n\n- [Weight pruning](https://medium.com/tensorflow/tensorflow-model-optimization-toolkit-pruning-api-42cac9157a6a)\n- [Weight clustering](https://blog.tensorflow.org/2020/08/tensorflow-model-optimization-toolkit-weight-clustering-api.html)\n- Quantization\n\n - [Post-training quantization](https://medium.com/tensorflow/tensorflow-model-optimization-toolkit-post-training-integer-quantization-b4964a1ea9ba)\n - [Quantization aware training](https://blog.tensorflow.org/2020/04/quantization-aware-training-with-tensorflow-model-optimization-toolkit.html) (QAT)\n\nThe issue that arises when attempting to chain these techniques together is that\napplying one typically destroys the results of the preceding technique, spoiling\nthe overall benefit of simultaneously applying all of them; for example,\nclustering doesn't preserve the sparsity introduced by the pruning API. To solve\nthis problem, we introduce the following experimental collaborative optimization\ntechniques:\n\n- [Sparsity preserving clustering](https://www.tensorflow.org/model_optimization/guide/combine/sparse_clustering_example)\n- [Sparsity preserving quantization aware training](https://www.tensorflow.org/model_optimization/guide/combine/pqat_example) (PQAT)\n- [Cluster preserving quantization aware training](https://www.tensorflow.org/model_optimization/guide/combine/cqat_example) (CQAT)\n- [Sparsity and cluster preserving quantization aware training](https://www.tensorflow.org/model_optimization/guide/combine/pcqat_example)\n\nThese provide several deployment paths that could be used to compress a machine\nlearning model and to take advantage of hardware acceleration at inference time.\nThe diagram below demonstrates several deployment paths that can be explored in\nsearch for the model with desired deployment characteristics, where the leaf\nnodes are deployment-ready models, meaning they are partially or fully quantized\nand in tflite format. The green fill indicates steps where\nretraining/fine-tuning is required and a dashed red border highlights the\ncollaborative optimization steps. The technique used to obtain a model at a\ngiven node is indicated in the corresponding label.\n\nThe direct, quantization-only (post-training or QAT) deployment path is omitted\nin the figure above.\n\nThe idea is to reach the fully optimized model at the third level of the above\ndeployment tree; however, any of the other levels of optimization could prove\nsatisfactory and achieve the required inference latency/accuracy trade-off, in\nwhich case no further optimization is needed. The recommended training process\nwould be to iteratively go through the levels of the deployment tree applicable\nto the target deployment scenario and see if the model fulfils the inference\nlatency requirements and, if not, use the corresponding collaborative\noptimization technique to compress the model further and repeat until the model\nis fully optimized (pruned, clustered, and quantized), if needed.\n\nThe figure below shows the density plots of sample weight kernel going through\nthe collaborative optimization pipeline.\n\nThe result is a quantized deployment model with a reduced number of unique\nvalues as well as a significant number of sparse weights, depending on the\ntarget sparsity specified at training time. Other than the significant model\ncompression advantages, specific hardware support can take advantage of these\nsparse, clustered models to significantly reduce inference latency.\n\nResults\n-------\n\nBelow are some accuracy and compression results we obtained when experimenting\nwith PQAT and CQAT collaborative optimization paths.\n\n### Sparsity-preserving Quantization aware training (PQAT)\n\n| Model | Items | Baseline | Pruned Model (50% sparsity) | QAT Model | PQAT Model |\n|------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|\n| DS-CNN-L | FP32 Top1 Accuracy | **95.23%** | 94.80% | (Fake INT8) 94.721% | (Fake INT8) 94.128% |\n| | INT8 full integer quantization | 94.48% | **93.80%** | 94.72% | **94.13%** |\n| | Compression | 528,128 → 434,879 (17.66%) | 528,128 → 334,154 (36.73%) | 512,224 → 403,261 (21.27%) | 512,032 → 303,997 (40.63%) |\n| Mobilenet_v1-224 | FP32 Top 1 Accuracy | **70.99%** | 70.11% | (Fake INT8) 70.67% | (Fake INT8) 70.29% |\n| | INT8 full integer quantization | 69.37% | **67.82%** | 70.67% | **70.29%** |\n| | Compression | 4,665,520 → 3,880,331 (16.83%) | 4,665,520 → 2,939,734 (37.00%) | 4,569,416 → 3,808,781 (16.65%) | 4,569,416 → 2,869,600 (37.20%) |\n\n### Cluster-preserving Quantization aware training (CQAT)\n\n| Model | Items | Baseline | Clustered Model | QAT Model | CQAT Model |\n|--------------------------|--------------------------------|-------------------------------|-------------------------------|-------------------------------|--------------------------------|\n| Mobilenet_v1 on CIFAR-10 | FP32 Top1 Accuracy | **94.88%** | 94.48% | (Fake INT8) 94.80% | (Fake INT8) 94.60% |\n| | INT8 full integer quantization | 94.65% | **94.41%** | 94.77% | **94.52%** |\n| | Size | 3.00 MB | 2.00 MB | 2.84 MB | 1.94 MB |\n| Mobilenet_v1 on ImageNet | FP32 Top 1 Accuracy | **71.07%** | 65.30% | (Fake INT8) 70.39% | (Fake INT8) 65.35% |\n| | INT8 full integer quantization | 69.34% | **60.60%** | 70.35% | **65.42%** |\n| | Compression | 4,665,568 → 3,886,277 (16.7%) | 4,665,568 → 3,035,752 (34.9%) | 4,569,416 → 3,804,871 (16.7%) | 4,569,472 → 2,912,655 (36.25%) |\n\n### CQAT and PCQAT results for models clustered per channel\n\nResults below are obtained with the technique [clustering per channel](https://www.tensorflow.org/model_optimization/guide/clustering).\nThey illustrate that if convolutional layers of the model are clustered per channel, then the model accuracy is higher. If your model has many convolutional layers, then we recommend to cluster per channel. The compression ratio remains the same, but the model accuracy will be higher. The model optimization pipeline is 'clustered -\\\u003e cluster preserving QAT -\\\u003e post training quantization, int8' in our experiments.\n\n| Model | Clustered -\\\u003e CQAT, int8 quantized | Clustered per channel -\\\u003e CQAT, int8 quantized |\n|-----------------------|------------------------------------|------------------------------------------------|\n| DS-CNN-L | 95.949% | 96.44% |\n| MobileNet-V2 | 71.538% | 72.638% |\n| MobileNet-V2 (pruned) | 71.45% | 71.901% |\n\n\u003cbr /\u003e\n\nExamples\n--------\n\nFor end-to-end examples of the collaborative optimization techniques described\nhere, please refer to the\n[CQAT](https://www.tensorflow.org/model_optimization/guide/combine/cqat_example),\n[PQAT](https://www.tensorflow.org/model_optimization/guide/combine/pqat_example),\n[sparsity-preserving clustering](https://www.tensorflow.org/model_optimization/guide/combine/sparse_clustering_example),\nand\n[PCQAT](https://www.tensorflow.org/model_optimization/guide/combine/pcqat_example)\nexample notebooks."]]