TestFlagDraw = "Write a C program that draws an american flag to stdout as a bmp." >> LLMRun() >> \
       ExtractCode() >> CRun() >> LLMVisionRun("What flag is shown in this image?") >> \
          (SubstringEvaluator("United States") | \
           SubstringEvaluator("USA") | \
           SubstringEvaluator("America"))


## Remove existing benchmark repo from local files

# import shutil
# shutil.rmtree('/content/yet-another-applied-llm-benchmark')


%cd /content
!git clone https://github.com/gadkins/yet-another-applied-llm-benchmark.git

%cd yet-another-applied-llm-benchmark
!pip install -qUr requirements.txt
!pip install -qUr requirements-extra.txt
!pip install -qU python-dotenv

/content
Cloning into 'yet-another-applied-llm-benchmark'...
remote: Enumerating objects: 12000, done.
remote: Counting objects: 100% (568/568), done.
remote: Compressing objects: 100% (185/185), done.
remote: Total 12000 (delta 391), reused 535 (delta 375), pack-reused 11432
Receiving objects: 100% (12000/12000), 72.45 MiB | 19.87 MiB/s, done.
Resolving deltas: 100% (2760/2760), done.
Updating files: 100% (10736/10736), done.
/content/yet-another-applied-llm-benchmark


# Unecessary in Google Colab (but critical on a local machine)
!sudo apt-get install podman

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
podman is already the newest version (3.4.4+ds1-1ubuntu1.22.04.2).
The following package was automatically installed and is no longer required:
  libfuse2
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


!which podman

/usr/bin/podman


%%writefile config.json
{
    "container": "podman",
    "hparams": {
        "temperature": 0.7
    },
    "llms": {
        "Mistral-7B-Instruct-v0.1-AWQ": {
            "api_key": "EMPTY",
            "endpoint": "https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/",
            "slug": "TheBloke/Mistral-7B-Instruct-v0.1-AWQ"
        },
        "Mixtral-Instruct-AWQ": {
            "api_key": "EMPTY",
            "endpoint": "https://mc1s4jnygce5b5-8000.proxy.runpod.net/v1/",
            "slug": "casperhansen/mixtral-instruct-awq"
        },
        "openchat_3.5": {
            "api_key": "EMPTY",
            "endpoint": "https://i0vbjq7enev3du-8080.proxy.runpod.net/v1",
            "model_id": "openchat/openchat_3.5"
        },
        "openai": {
            "api_key": "YOUR_OPENAI_API_KEY"
        },
        "mistral": {
            "api_key": "TODO"
        },
        "cohere": {
            "api_key": "TODO"
        },
        "anthropic": {
            "api_key": "TODO"
        },
        "moonshot": {
            "api_key": "TODO"
        }
    }
}

Writing config.json


!PYTHONPATH='.' python tests/print_hello.py

gpt-3.5-turbo CACHE MISS ['Write a python program that prints the string "hello world" and tell me how it works in a sentence']
gpt-3.5-turbo CACHE MISS ['Take the below answer to my programming question  and return just the complete code in a single file so I can copy and paste it into an editor and directly run it. Include any header and main necessary so I can run it by copying this one file. DO NOT MODIFY THE CODE OR WRITE NEW CODE. Here is the code: \nprint("hello world")\n\nThis program uses the print() function in Python to output the string "hello world" to the console when the program is executed.']
# Initial Query
> Write a python program that prints the string "hello world" and tell me how it works in a sentence

# LLM Generation

## Query
> Write a python program that prints the string "hello world" and tell me how it works in a sentence

## Output
> print("hello world")
> 
> This program uses the print() function in Python to output the string "hello world" to the console when the program is executed.

# Extract Code
I extracted the following code from that output:
> ```
> print("hello world")
> ```

# Run Code Interpreter
Running the following program:
> ```
> print("hello world")
> ```
And got the output:
```
hello world
```

# Substring Evaluation
Testing if the previous output contains the string `hello world`: True

True


!PYTHONPATH='.' python main.py --model Mistral-7B-Instruct-v0.1-AWQ --test print_hello --run-tests

Running Mistral-7B-Instruct-v0.1-AWQ, iteration 0
Model name: Mistral-7B-Instruct-v0.1-AWQ
Model ID: None
API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/
print_hello.py
Run Job TestPrintHello
Test Passes: TestPrintHello


!PYTHONPATH='.' python main.py --model Mistral-7B-Instruct-v0.1-AWQ --run-tests --generate-report

Model name: Mistral-7B-Instruct-v0.1-AWQ
Model ID: None
API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/
Running Mistral-7B-Instruct-v0.1-AWQ, iteration 0
Model name: Mistral-7B-Instruct-v0.1-AWQ
Model ID: None
API Endpoint: https://ymp90vl4mfkt5o-8000.proxy.runpod.net/v1/
fix_torch_backward.py
Run Job TestTorchBackwardExplain
Test Fails: TestTorchBackwardExplain from fix_torch_backward.py
Run Job TestTorchBackwardFix
Test Passes: TestTorchBackwardFix
git_merge.py
Run Job TestGitMerge
Test Fails: TestGitMerge from git_merge.py
Run Job TestGitMergeConflict
Test Fails: TestGitMergeConflict from git_merge.py
jax_onehot.py
Run Job TestJaxOneHot
Test Fails: TestJaxOneHot from jax_onehot.py
fix_threading_issue.py
Run Job TestQuestionThreadedFix
Test Fails: TestQuestionThreadedFix from fix_threading_issue.py
jnp_nn_bugfix.py
Run Job TestFixJnpBug
Test Fails: TestFixJnpBug from jnp_nn_bugfix.py
implement_assembly_interpreter.py
Run Job TestImplementAssembly
Test Fails: TestImplementAssembly from implement_assembly_interpreter.py
convert_to_c.py
Run Job TestProgramRewriteC
Test Fails: TestProgramRewriteC from convert_to_c.py
rust_parallel_wordcount.py
Run Job TestRustParCount
Test Fails: TestRustParCount from rust_parallel_wordcount.py
Run Job TestRustParCountNoLib
Test Fails: TestRustParCountNoLib from rust_parallel_wordcount.py
print_hello.py
Run Job TestPrintHello
Test Passes: TestPrintHello
baking_help.py
Run Job TestMissingStep
Test Fails: TestMissingStep from baking_help.py
python_chess_game_prefix.py
Run Job TestPyChessPrefix
Test Fails: TestPyChessPrefix from python_chess_game_prefix.py
git_cherrypick.py
Run Job TestGitCherrypick
Test Fails: TestGitCherrypick from git_cherrypick.py
find_bug_in_paper.py
Run Job TestFindBugPaper
Test Fails: TestFindBugPaper from find_bug_in_paper.py
Run Job TestFindBugPaperEasy
Test Fails: TestFindBugPaperEasy from find_bug_in_paper.py
explain_code_prime.py
Run Job TestExplainPrime
Test Fails: TestExplainPrime from explain_code_prime.py
merge_into_16.py
Run Job TestMake16Files
Test Fails: TestMake16Files from merge_into_16.py
Run Job TestMake16FilesEasy
Test Fails: TestMake16FilesEasy from merge_into_16.py
base64_qanda.py
Run Job TestBase64Thought
Test Fails: TestBase64Thought from base64_qanda.py
what_is_automodel.py
Run Job TestWhatIsAutoModel
Test Fails: TestWhatIsAutoModel from what_is_automodel.py
extract_emails.py
Run Job TestExtractEmail
Test Fails: TestExtractEmail from extract_emails.py
regex_remove_5_words.py
Run Job TestRegex
Test Fails: TestRegex from regex_remove_5_words.py
numpy_advanced_index.py
Run Job TestNumpyAdvancedIndex
Test Fails: TestNumpyAdvancedIndex from numpy_advanced_index.py
Run Job TestNumpyAdvancedIndexEasier
Test Fails: TestNumpyAdvancedIndexEasier from numpy_advanced_index.py
fix_tokenizer.py
Run Job TestSimpleFix
Test Fails: TestSimpleFix from fix_tokenizer.py
convert_dp_to_iterative.py
Run Job TestProgramRemoveDP
Test Fails: TestProgramRemoveDP from convert_dp_to_iterative.py
explain_code_prime2.py
Run Job TestExplainPrime2
Test Fails: TestExplainPrime2 from explain_code_prime2.py
what_is_inv.py
Run Job TestWhatIsInv
Test Fails: TestWhatIsInv from what_is_inv.py
strided_trick.py
Run Job TestProgramStrided
Test Fails: TestProgramStrided from strided_trick.py
identify_uuencode.py
Run Job TestIsUU
Test Fails: TestIsUU from identify_uuencode.py
convert_to_c_simple.py
Run Job TestProgramRewriteCSimple
Traceback (most recent call last):
  File "/usr/lib/python3.10/subprocess.py", line 1154, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.10/subprocess.py", line 2021, in _communicate
    ready = selector.select(timeout)
  File "/usr/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/yet-another-applied-llm-benchmark/main.py", line 196, in <module>
    main()
  File "/content/yet-another-applied-llm-benchmark/main.py", line 179, in main
    result = run_all_tests(model, use_cache=False,
  File "/content/yet-another-applied-llm-benchmark/main.py", line 82, in run_all_tests
    ok, reason = run_one_test(test, test_llm, llm.eval_llm, llm.vision_eval_llm)
  File "/content/yet-another-applied-llm-benchmark/main.py", line 40, in run_one_test
    for success, output in test():
  File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 182, in __call__
    for output1, response1 in self.node1(orig_output):
  File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 183, in __call__
    for output2, response2 in self.node2(output1):
  File "/content/yet-another-applied-llm-benchmark/evaluator.py", line 544, in __call__
    out = invoke_docker(self.env, {"main.c": code.encode(),
  File "/content/yet-another-applied-llm-benchmark/docker_controller.py", line 240, in invoke_docker
    proc = subprocess.run(run_cmd, cwd="/tmp/fakedocker_%d"%env.fake_docker_id, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  File "/usr/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1165, in communicate
    self._wait(timeout=sigint_timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait
    time.sleep(delay)
KeyboardInterrupt
^C


!PYTHONPATH='.' python regenerate_report.py

[dict_keys(['print_hello.py.TestPrintHello'])]
print_hello.py.TestPrintHello
BAD {} print_hello.py.TestPrintHello


%%html
<iframe src="https://nicholas.carlini.com/writing/2024/evaluation_examples/index.html" width="1000" height="1000"></iframe>

Practical Benchmarking for LLMs¶

Why should you read this notebook?¶

About the tasks¶

Highlights¶

Task evaluation¶

Supported LLMs¶

Install Dependencies¶

Setup¶

Custom models¶

Configuration¶

Basic test with gpt-3.5-turbo¶

Basic test with a custom model¶

All tests with a custom model¶

Visualizing Results¶