Accelerating LLM family of models on Arm Neoverse based Graviton AWS processors with KleidiAI

Опубликовано: 28 Октябрь 2024
на канале: PyTorch

899

In this webinar we will introduce changes that were made to PyTorch to improve performance of LLaMA family of models on AArch64. To achieve performance improvements we have introduced two new ATen operations torch.ops.aten._kai_weights_pack_int4() and torch.ops.aten._kai_input_quant_mm_int4() that are using highly optimised packing and GEMM kernels that are available in KleidiAI library. These two new PyTorch operators are leveraged by gpt-fast to firstly, quantize weights to INT4 by using symmetric per-channel quantization and add additional array containing quantization scales and secondly, dynamically quantize activation matrix and execute INT8 matrix multiplication of activation matrix and weights by using AArch64 I8MM extensions. With this approach we are able to reduce memory footprint of linear layers in the models by up to 87% and match or better performance of state-of-the-art vertically integrated frameworks such as llama.cpp.