Benchmarking Strategies for Large Language Models

Опубликовано: 16 Октябрь 2024
на канале: Stephen Blum

437

How can you test large language models? It's actually pretty simple. You can look at how fast they respond.

This includes measuring time to first token (TTFT) which tells you how fast the model starts to make a response. Another measure is tokens per second, showing how fast the full response comes up. You could also measure the total time from start to finish.

A quicker AI could give short but high-quality responses. Tie all these measures together and you get an idea of performance. Now performance often has to do with the model's size like whether it's 70 billion parameters or 200 billion, it just keeps going up.

To get the best performance, calculate the total time as TTFT plus tokens per second times total tokens. The quickest AI system uses these metrics. You can find providers offering such systems as an API. They can work from multiple locations around the world like Paris, Virginia, and Seattle.

These providers do a warmup, setting up the TCP or TLS connection before sending the requests. In measuring TTFT, they start the clock when the HPE request is made, implying the connection is already warmed up. They also limit maximum tokens to 20 for uniformity, as different models often give various token numbers.

Moreover, they make sure to run each test three times, keeping the best result. This is to cut out the outliers because many people could be using the API at the same time.