Practical Benchmarking for LLMs¶

by Grayson Adkins, update March 28, 2024

This notebook provides practical benchmarking for large language models (LLMs) on a variety of technical tasks, such as converting code from one programming language to another, writing bash one-liners, explaining code snippets, and more.

Open In Colab

Attribution
This notebook is based on a new benchmark called yet-another-applied-llm-benchmark created by of Nicholas Carlini, a research scientist at Google DeepMind. He has a good write-up on the motivation and details on his personal blog.

Why should you read this notebook?¶

You want:

  • To evaluate LLMs using a practical benchmark for a set of challenging, real-world technical tasks.
  • You can a benchmark which is flexible and extensible, whihch support for adding new tests, more sophisticated logic flows, and new LLMs, including custom/self-hosted models.
  • A safe way to execute code written by LLMs and in an automated way.
  • A way to easily generate reports and compare model's relative performance on the tasks.

About the tasks¶

The tasks are taken from real-world usage of LLMs by technical users (primarily by Carlini himself). The goal is to practically inform the evaluator if a given model is capable of performing these challenging tasks. Failure to perform a task is not an indication that the model is bad. That is, the benchmark is not designed to be a scientific benchmark used to decide if one model is better than another. Instead, it should be used to see how models are progressing in accurately performing useful technical tasks.

Highlights¶

The benchmark includes:

  • Nearly 100 real-world technical tasks. You can see the full list here.
  • A code extractor and interpreter to safely execute LLM-generated code in a container
  • A simple data workflow domain specific language for composing, evaluating and adding new tasks. Users can define task "node" objects and append nodes and strings with the >> operator. For example, 'Write a "hello world" program in python' >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world"), instructs an LLM to write a program that prints "hello world" and then checks for a matching string in the output. According to Carlini, this format supports more sophisticated behavior than other benchmarks.
  • For some tasks, especially those which are not easily verifiable by string matching, an advanced LLM (such as GPT-4) is used to partially evaluate the output. For example, the task draw_flag_bmp.py tasks a model to write a program in C that draws the flag of Italy in BMP format. GPT-4 is then asked what the output image depicts. If GPT-4's output includes "Italy" or "Italian", then we assume the model got it right. (Carlini acknowledges that this is an imperfect approach, but, again, this benchmark is not meant to be scientifically rigorous.)

Task evaluation¶

The core component of the benchmark framework is the evaluator.py file, which defines a series of classes and functions that together form a flexible and extensible system for running tests, capturing outputs, and evaluating those outputs against expected results or criteria.

TestFlagDraw = "Write a C program that draws an american flag to stdout as a bmp." >> LLMRun() >> \
       ExtractCode() >> CRun() >> LLMVisionRun("What flag is shown in this image?") >> \
          (SubstringEvaluator("United States") | \
           SubstringEvaluator("USA") | \
           SubstringEvaluator("America"))

In this example, the nodes LLMRun(), ExtractCode(), CRun(), LLMVisionRun(), SubstringEvaluator() are each instances of their respective classes. Each class defines a set of functions that implement the desired behavior of the node. The output of a node becomes the input of the next node of the sequence. For nodes that require code execution, a Docker or Podmand container is spun up to safely run the code in a sandbox environment.

Users can run tests individually or all of them at once. The framework also conveniently includes a script for generating a results matrix in HTML format.

Supported LLMs¶

The benchmark is easily extensible, both in terms of adding new tests and new LLMs. As of the time of this writing, it supports the following LLMs:

  • Anthropic
  • Cohere
  • Gemini
  • Llama
  • Mistral
  • Moonshot
  • OpenAI
  • VertexAI

Additionally, Trelis Research has added support for custom models that implement the OpenAI API format. In this notebook, I also demonstrate testing against Mixtral 8x7B Instruct AWQ and OpenChat 3.5.

Install Dependencies¶

In [57]:
## Remove existing benchmark repo from local files

# import shutil
# shutil.rmtree('/content/yet-another-applied-llm-benchmark')
In [45]:
%cd /content
!git clone https://github.com/gadkins/yet-another-applied-llm-benchmark.git

%cd yet-another-applied-llm-benchmark
!pip install -qUr requirements.txt
!pip install -qUr requirements-extra.txt
!pip install -qU python-dotenv
/content
Cloning into 'yet-another-applied-llm-benchmark'...
remote: Enumerating objects: 12000, done.
remote: Counting objects: 100% (568/568), done.
remote: Compressing objects: 100% (185/185), done.
remote: Total 12000 (delta 391), reused 535 (delta 375), pack-reused 11432
Receiving objects: 100% (12000/12000), 72.45 MiB | 19.87 MiB/s, done.
Resolving deltas: 100% (2760/2760), done.
Updating files: 100% (10736/10736), done.
/content/yet-another-applied-llm-benchmark
In [46]:
# Unecessary in Google Colab (but critical on a local machine)
!sudo apt-get install podman
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
podman is already the newest version (3.4.4+ds1-1ubuntu1.22.04.2).
The following package was automatically installed and is no longer required:
  libfuse2
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.
In [47]:
!which podman
/usr/bin/podman

Setup¶

Custom models¶

In addition to the LLMs defined in the code, you can run this benchmark against your own custom models or just open-source models that are not included in the benchmark.

For help deploying production-ready model APIs, see the Model Serving notebook. I also have available some one-click templates to easily deploy the following models on Runpod:

  • Mistral 7B Instruct v0.1 AWQ with vLLM
  • Mixtral 8x7B Instruct AWQ with vLLM
  • OpenChat 3.5 with TGI

Configuration¶

Next, we'll create config.json and set up the following:

  • Add your OpenAI API key since GPT-4 is used as a partial evaluator in some tasks.
  • For custom models that implement the OpenAI API spec, add the api_key (or empty string if not applicable), API endpoint where the model is hosted, and Hugging Face model_id.
In [48]:
%%writefile config.json
{
    "container": "podman",
    "hparams": {
        "temperature": 0.7
    },
    "llms": {
        "Mistral-7B-Instruct-v0.1-AWQ": {
            "api_key": "EMPTY",
            "endpoint": "https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/",
            "slug": "TheBloke/Mistral-7B-Instruct-v0.1-AWQ"
        },
        "Mixtral-Instruct-AWQ": {
            "api_key": "EMPTY",
            "endpoint": "https://mc1s4jnygce5b5-8000.proxy.runpod.net/v1/",
            "slug": "casperhansen/mixtral-instruct-awq"
        },
        "openchat_3.5": {
            "api_key": "EMPTY",
            "endpoint": "https://i0vbjq7enev3du-8080.proxy.runpod.net/v1",
            "model_id": "openchat/openchat_3.5"
        },
        "openai": {
            "api_key": "YOUR_OPENAI_API_KEY"
        },
        "mistral": {
            "api_key": "TODO"
        },
        "cohere": {
            "api_key": "TODO"
        },
        "anthropic": {
            "api_key": "TODO"
        },
        "moonshot": {
            "api_key": "TODO"
        }
    }
}
Writing config.json

For testing purposes, follow these instructions:

We'll be using "gpt-3.5-turbo" which can be accessed via free accounts.

In llm.py update the following variables.

  1. Set llm = LLM("gpt-3.5-turbo") or whatever model you want
  2. Set eval_llm = LLM("gpt-3.5-turbo", override_hparams={'temperature': 0.1})

In evaluator.py

  1. Update the variable PYTHON_ENV = "python3.11" to PYTHON_ENV = "python"

In docker_controller.py (if not using podman (or) docker):

  1. Set I_HAVE_BLIND_FAITH_IN_LLMS_AND_AM_OKAY_WITH_THEM_BRICKING_MY_MACHINE_OR_MAKING_THEM_HALT_AND_CATCH_FIRE to True

If you prefer running it locally: Add the respective Python path in evaluator.py .

These changes will enable you to use "gpt-3.5-turbo" for testing.

Basic test with gpt-3.5-turbo¶

Let's makes sure one basic test is working on the free gpt-3.5-turbo model.

In [54]:
!PYTHONPATH='.' python tests/print_hello.py
gpt-3.5-turbo CACHE MISS ['Write a python program that prints the string "hello world" and tell me how it works in a sentence']
gpt-3.5-turbo CACHE MISS ['Take the below answer to my programming question  and return just the complete code in a single file so I can copy and paste it into an editor and directly run it. Include any header and main necessary so I can run it by copying this one file. DO NOT MODIFY THE CODE OR WRITE NEW CODE. Here is the code: \nprint("hello world")\n\nThis program uses the print() function in Python to output the string "hello world" to the console when the program is executed.']
# Initial Query
> Write a python program that prints the string "hello world" and tell me how it works in a sentence

# LLM Generation

## Query
> Write a python program that prints the string "hello world" and tell me how it works in a sentence

## Output
> print("hello world")
> 
> This program uses the print() function in Python to output the string "hello world" to the console when the program is executed.

# Extract Code
I extracted the following code from that output:
> ```
> print("hello world")
> ```

# Run Code Interpreter
Running the following program:
> ```
> print("hello world")
> ```
And got the output:
```
hello world
```

# Substring Evaluation
Testing if the previous output contains the string `hello world`: True

True

Basic test with a custom model¶

Now let's try running that same basic test on our custom model (here I'm using my own instance of Mistral-7B-Instruct-v0.1-AWQ on Runpod.io).

In [55]:
!PYTHONPATH='.' python main.py --model Mistral-7B-Instruct-v0.1-AWQ --test print_hello --run-tests
Running Mistral-7B-Instruct-v0.1-AWQ, iteration 0
Model name: Mistral-7B-Instruct-v0.1-AWQ
Model ID: None
API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/
print_hello.py
Run Job TestPrintHello
Test Passes: TestPrintHello

All tests with a custom model¶

Awesome! Now let's try running all the tests on our custom model. We'll also generate a report to summarize the results.

In [51]:
!PYTHONPATH='.' python main.py --model Mistral-7B-Instruct-v0.1-AWQ --run-tests --generate-report
Model name: Mistral-7B-Instruct-v0.1-AWQ
Model ID: None
API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/
Running Mistral-7B-Instruct-v0.1-AWQ, iteration 0
Model name: Mistral-7B-Instruct-v0.1-AWQ
Model ID: None
API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/
fix_torch_backward.py
Run Job TestTorchBackwardExplain
Test Fails: TestTorchBackwardExplain from fix_torch_backward.py
Run Job TestTorchBackwardFix
Test Passes: TestTorchBackwardFix
git_merge.py
Run Job TestGitMerge
Test Fails: TestGitMerge from git_merge.py
Run Job TestGitMergeConflict
Test Fails: TestGitMergeConflict from git_merge.py
jax_onehot.py
Run Job TestJaxOneHot
Test Fails: TestJaxOneHot from jax_onehot.py
fix_threading_issue.py
Run Job TestQuestionThreadedFix
Test Fails: TestQuestionThreadedFix from fix_threading_issue.py
jnp_nn_bugfix.py
Run Job TestFixJnpBug
Test Fails: TestFixJnpBug from jnp_nn_bugfix.py
implement_assembly_interpreter.py
Run Job TestImplementAssembly
Test Fails: TestImplementAssembly from implement_assembly_interpreter.py
convert_to_c.py
Run Job TestProgramRewriteC
Test Fails: TestProgramRewriteC from convert_to_c.py
rust_parallel_wordcount.py
Run Job TestRustParCount
Test Fails: TestRustParCount from rust_parallel_wordcount.py
Run Job TestRustParCountNoLib
Test Fails: TestRustParCountNoLib from rust_parallel_wordcount.py
print_hello.py
Run Job TestPrintHello
Test Passes: TestPrintHello
baking_help.py
Run Job TestMissingStep
Test Fails: TestMissingStep from baking_help.py
python_chess_game_prefix.py
Run Job TestPyChessPrefix
Test Fails: TestPyChessPrefix from python_chess_game_prefix.py
git_cherrypick.py
Run Job TestGitCherrypick
Test Fails: TestGitCherrypick from git_cherrypick.py
find_bug_in_paper.py
Run Job TestFindBugPaper
Test Fails: TestFindBugPaper from find_bug_in_paper.py
Run Job TestFindBugPaperEasy
Test Fails: TestFindBugPaperEasy from find_bug_in_paper.py
explain_code_prime.py
Run Job TestExplainPrime
Test Fails: TestExplainPrime from explain_code_prime.py
merge_into_16.py
Run Job TestMake16Files
Test Fails: TestMake16Files from merge_into_16.py
Run Job TestMake16FilesEasy
Test Fails: TestMake16FilesEasy from merge_into_16.py
base64_qanda.py
Run Job TestBase64Thought
Test Fails: TestBase64Thought from base64_qanda.py
what_is_automodel.py
Run Job TestWhatIsAutoModel
Test Fails: TestWhatIsAutoModel from what_is_automodel.py
extract_emails.py
Run Job TestExtractEmail
Test Fails: TestExtractEmail from extract_emails.py
regex_remove_5_words.py
Run Job TestRegex
Test Fails: TestRegex from regex_remove_5_words.py
numpy_advanced_index.py
Run Job TestNumpyAdvancedIndex
Test Fails: TestNumpyAdvancedIndex from numpy_advanced_index.py
Run Job TestNumpyAdvancedIndexEasier
Test Fails: TestNumpyAdvancedIndexEasier from numpy_advanced_index.py
fix_tokenizer.py
Run Job TestSimpleFix
Test Fails: TestSimpleFix from fix_tokenizer.py
convert_dp_to_iterative.py
Run Job TestProgramRemoveDP
Test Fails: TestProgramRemoveDP from convert_dp_to_iterative.py
explain_code_prime2.py
Run Job TestExplainPrime2
Test Fails: TestExplainPrime2 from explain_code_prime2.py
what_is_inv.py
Run Job TestWhatIsInv
Test Fails: TestWhatIsInv from what_is_inv.py
strided_trick.py
Run Job TestProgramStrided
Test Fails: TestProgramStrided from strided_trick.py
identify_uuencode.py
Run Job TestIsUU
Test Fails: TestIsUU from identify_uuencode.py
convert_to_c_simple.py
Run Job TestProgramRewriteCSimple
Traceback (most recent call last):
  File "/usr/lib/python3.10/subprocess.py", line 1154, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.10/subprocess.py", line 2021, in _communicate
    ready = selector.select(timeout)
  File "/usr/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/yet-another-applied-llm-benchmark/main.py", line 196, in <module>
    main()
  File "/content/yet-another-applied-llm-benchmark/main.py", line 179, in main
    result = run_all_tests(model, use_cache=False,
  File "/content/yet-another-applied-llm-benchmark/main.py", line 82, in run_all_tests
    ok, reason = run_one_test(test, test_llm, llm.eval_llm, llm.vision_eval_llm)
  File "/content/yet-another-applied-llm-benchmark/main.py", line 40, in run_one_test
    for success, output in test():
  File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 182, in __call__
    for output1, response1 in self.node1(orig_output):
  File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 183, in __call__
    for output2, response2 in self.node2(output1):
  File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 544, in __call__
    out = invoke_docker(self.env, {"main.c": code.encode(),
  File "/content/yet-another-applied-llm-benchmark/docker_controller.py", line 240, in invoke_docker
    proc = subprocess.run(run_cmd, cwd="/tmp/fakedocker_%d"%env.fake_docker_id, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  File "/usr/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1165, in communicate
    self._wait(timeout=sigint_timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait
    time.sleep(delay)
KeyboardInterrupt
^C
In [56]:
!PYTHONPATH='.' python regenerate_report.py
[dict_keys(['print_hello.py.TestPrintHello'])]
print_hello.py.TestPrintHello
BAD {} print_hello.py.TestPrintHello

Visualizing Results¶

If you pass the --generate-report option to the python main.py command, you can see a summary of the tests results in HTML format. Alternatively, you can run the script generate-report.py, which will run all tests for the default model in llm.py (llm = LLM(...)).

You can see an example report below.

In [5]:
%%html
<iframe src="https://nicholas.carlini.com/writing/2024/evaluation_examples/index.html" width="1000" height="1000"></iframe>