Comparison of Embedding Models & How Other Pre-Retrieval Variables Affect RAG Performance¶

by Grayson Adkins, updated March 1, 2024

This notebooks analyzes the effect different embedding models and variables, such as chunk size, have on RAG performace.

Open In Colab

Why should you read this notebook?¶

You want to learn:

  • How to do basic Retrieval Augmented Generation (RAG)
  • How different variables like chunk size, overlap, etc. affect model performance

Pre-requisites¶

If you'd like to reproduce this notebook, you'll need:

  • A Hugging Face account
  • (optional) Google Drive
  • Access to Llama 2
  • A GPU (a free NVIDIA T4 on Google Colab is fine for Llama 7B & 13B models)
  • Familiarity with Python

Introduction¶

Retrieval Augmented Generation (RAG), is a technique for improving lanaguage model performance in question answering by providing additional context in a question prompt. The technique involves comparing a given question and a related body of text, then retrieving the most similar results and asking a language model to answer the question with the help of this additional information. Unlike with a simple keyword search, however, we create embeddings of the question and dataset, then perform a semantic search to retrieve the most relevant snippets that match the question's meaning, regardless of whether there exist exact keyword matches.

RAG is often a good option when you don't have enough data to fine-tune, but you still want to teach the model about a new topic.

RAG Steps¶

  1. Chunk - Take a dataset (in this notebook we use a PDF of recent COVID-19 statistics) and split it into chunks of approximately 200-4000 tokens per chunk.
  2. Embed - Using an embedding model, such as OpenAI's text-embedding-ada-002, create embeddings for the chunks and a given question.
  3. Retreive - Compare the question embedding to the dataset embeddings, finding the top-k most similar results (typically we use cosine similarity or dot-product).
  4. Augment - Using the indices of the top-k embedding results, get the corresponding chunks of text and include these along with the question in a prompt to a lanaguage model.

With the additional context, hopefully the language model can better answer the question.

Variables Affecting Performance¶

There are several variables in the above steps that can significantly affect how well the language model performs in answering questions.

  1. Data quality - In the context of these experiments, data quality means that the text format is free of special characters, complex formatting, discontinuities, etc.
  2. Chunk size - If chunks are too small, they don't contain enough information to provide a prompt with sufficient context. If chunks are too large, the information density is too sparse, and the language model's attention may not focus on the right information.
  3. Chunk overlap - We typically configure chunks to overlap, i.e. the end of one sequence is repeated at the beginning of the next sequence. Overlapping helps perserve context between chunks. An overlap of at least 10% is often used.
  4. Embedding model - All embedding models are not created equally.
  5. Similarity method - Given two vectorized embeddings, we can calculate and rank their similarity in a number of ways, but most often we use either cosine—which measures the angle between the vectors—or their dot product—which accounts for both angle and respective magnitudes.
  6. k-samples - How many of the most relevant snippets from the dataset should we include as context to the language model? Too many and the information is too diffuse; too few and we might not pass enough context to answer a given question. Finding the right value of k is important.
  7. Language model - We must have a model that is sufficiently advanced to parse the question and context and answer appropriately.
  8. Quantization - Using quantized models can degrade model performance. Additionally, the method of quantization can have an effect.

Experimental Setup¶

Dataset¶

The point of RAG is to augment a language model that might be unfamiliar with a given topic, for example, something obscure, proprietary, or that has occurred since the model was trained. It's this last scenario that we'll focus on in this notebook.

In these experiements, I use Llama 2 as the language model, which was trained starting in February 2023. To ensure that the model does not already know the answers to the questions I'll be asking, I ask questions about COVID-19 statistics reported between December 2023 and January 2024.

  • PDF of the dataset: World Health Organization COVID-19 Epidemiological Update - Edition 163 (pdf)

  • The chunked dataset is available on my Hugging Face in CSV format.

  • You can find the functions I used for preparing this dataset in the Utility Functions notebook

Variables Tested¶

In this notebook, I do not attempt to perform an exhaustive testing of each permutation of variables listed below. Instead, I select a few illustrative configurations and leave it up to the reader to experiment when reproducing my results.

Chunk Size (tokens) Overlap Embedding Model Similarity method Top-k Language Model Quantization
200 10% Llama 2 Cosine 1 Llama 2 7B None (full-precision)
500 25% OpenAI Dot Product 3 Llama 2 13B GPTQ
1000 50% MS MARCO - 7 Llama 2 70B bitsandbytes (NF4)

Summary of Results¶

  1. Data quality - Cleaning the dataset with the help of an LLM like gpt-3.5-turbo helped the chunking process and, thus, Llama in answering questions.
  2. Chunk size - 500-token chunks yielded more correct answers than 200-token chunks, although inference was significantly slower. This makes sense, as a larger prompt requires more CPU and memory and can incure greater latency. Interestingly, some questions Llama got right with 200-token chunks, it missed with 500-token chunks. I did not try 1000-token chunks since, my dataset is only 24 pages and the 500-tokens chunks were already taking almost 10 minutes to answer 10 questions on a free T4.
  3. Chunk overlap - 50% overlap performed slightly better than 10%. Larger overlap results in more rows of the training set, so it might be interesting to see how a large overlap and a larger k-value compare.
  4. Embedding model - OpenAI's text-embedding-ada-002 just barely beat out MS MARCO. Llama 2's embedding model performed noticably worse. If open-source is important to you, MS MARCO is a good alternative to OpenAI.
  5. Similarity method - Given two vectorized embeddings, we can calculate and rank their similarity in a number of ways, but most often we use either cosine—which measures the angle between the vectors—or their dot product—which accounts for both angle and respective magnitudes.
  6. k-samples - Returning the top-3 most similar samples seemed to be the sweet spot.
  7. Language model - Unsuprisingly, the larger 13B model was able to understand larger contexts better than smaller 7B model. I did not test with the 70B model, but I suspect it would have done better, espcially on questions where the 13B model got confused (e.g. answered a question I didn't ask).
  8. Quantization - Quantized models faired worse than full-precision models. GPTQ faired slightly worse than bitsandbytes, although not by much.

Setup¶

In [1]:
# Authenticate to Hugging Face to pull and push models
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()
Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.10/dist-packages (0.20.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (3.13.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2023.6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.66.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (6.0.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (4.9.0)
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub) (23.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub) (2024.2.2)
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Connect Google Drive¶

Optional but saves time by caching the model and allows for training data to be saved on Drive.

In [4]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [5]:
import os
cache_dir = "/content/drive/My Drive/huggingface_cache"
os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists
In [6]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

Installation and Loading¶

GPTQ¶

This quantization method is faster than bitsandbytes, although probably a little less accurate

In [ ]:
!pip install -q -U transformers peft accelerate optimum
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install -q datasets
In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# model_id = "TheBloke/Llama-2-7B-chat-GPTQ"
model_id = "TheBloke/Llama-2-13B-chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    cache_dir=cache_dir
    )

tokenizer = AutoTokenizer.from_pretrained(model_id)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
WARNING:auto_gptq.nn_modules.qlinear.qlinear_cuda:CUDA extension not installed.
WARNING:auto_gptq.nn_modules.qlinear.qlinear_cuda_old:CUDA extension not installed.
tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

bitsandbytes NF4¶

This is the same quantizatoin method described in my QLoRA notebook.

In [ ]:
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q datasets
In [ ]:
# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# # model_id = "meta-llama/Llama-2-7b-chat-hf"
# model_id = "meta-llama/Llama-2-13b-chat-hf"
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     quantization_config=bnb_config,
#     device_map={"":0},
#     cache_dir=cache_dir)

Load the Dataset¶

Note: I've noticed that cleaning up the dataset using an LLM can improve embedding retrieval peformance slighly. In this example I used a cleaned COVID-19 statistics dataset.

In [15]:
from datasets import load_dataset

data = load_dataset("gadkins/who-covid-19-epidemiological-update-edition-163")

# Print first row of 'train' and 'test'
print("First row of train:", data['train'][0])
print("First row of test:", data['test'][0])

data = data.map(lambda samples: tokenizer(samples["Text"]), batched=True)
Downloading readme:   0%|          | 0.00/443 [00:00<?, ?B/s]
Downloading data:   0%|          | 0.00/40.7k [00:00<?, ?B/s]
Downloading data:   0%|          | 0.00/1.23k [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
First row of train: {'Text': '1 \n COVID -19 Epidemiological Update  \nEdition  163 published  19 January  2024 \nIn this edition:   \n• Key highlights   \n• Global overview  \n• Hospitalizations and ICU admissions  \n• SARS -CoV-2 variants of interest and variants under monitoring  \n• WHO regional overviews   \n \n \nKey highlights  \n \n• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -\n19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of \ncountries where cases or deaths are occurring , as many  countries have stopped or changed frequency of \nreporting .  \n• From the available data , the number of reported cases has increased while deaths have decreased during \nthe 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease \nof 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the \nnumber of reported new cases and deaths should be interpreted with caution due to decreased testing  and \nsequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater \nsurveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  \n• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global \nInfluenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January \n2024. \n• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least \nonce on COVID -19'}
First row of test: {'Text': '1. The Americas reported a decrease in new COVID-19 cases, with the highest numbers reported in Canada, Chile, and Peru.\n2. The number of deaths in the Americas also decreased, with the highest numbers reported in the United States, Canada, and Chile.\n3. The Eastern Mediterranean Region reported a decrease in new cases, with Kuwait showing the highest proportional increase.\n4. Afghanistan, Iran, and Morocco had the highest numbers of new cases in the Eastern Mediterranean Region.\n5. The number of deaths in the Eastern Mediterranean Region decreased, with Iran, Afghanistan, and Morocco reporting the highest numbers.\n6. The European Region reported a decrease in new cases, with Malta, Moldova, and Ireland showing the highest proportional increases.\n7. Russia, Italy, and Greece had the highest numbers of new cases in the European Region.\n8. The number of deaths in the European Region decreased, with Italy, Russia, and Poland reporting the highest numbers.\n9. The South-East Asia Region reported a significant increase in new cases, with India, Myanmar, and Indonesia showing the highest proportional increases.\n10. India, Indonesia, and Thailand had the highest numbers of new cases in the South-East Asia Region.\n\n'}
Map:   0%|          | 0/29 [00:00<?, ? examples/s]
Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Test Embeddings¶

Here, we're calculating the embeddings for each row of training data, by averaging across the token embeddings for each token in each row.

We're also using a test question to pick out and print the three most relevant snippets.

Llama Embeddings¶

In [ ]:
import torch
from torch.nn.functional import cosine_similarity
import numpy as np

n_samples=3

# Function to get average embeddings
def get_avg_embedding(input_ids):
    with torch.no_grad():
        embeddings = model.get_input_embeddings()(input_ids)
    avg_embedding = torch.mean(embeddings, dim=1)
    return avg_embedding

# Calculate average embeddings for each row in 'train' dataset
train_embeddings = []
for example in data['train']:
    input_ids = torch.tensor(example['input_ids']).unsqueeze(0).to("cuda")  # Moved to CUDA
    avg_embedding = get_avg_embedding(input_ids)
    train_embeddings.append(avg_embedding)
train_embeddings = torch.stack(train_embeddings).squeeze()

# Tokenize and get embedding for the question
question = "For the 28-day period from 11 December 2023 to 7 January 2024, how " + \
"many new cases of COVID-19 were reported globally?"
question_input_ids = tokenizer(question, return_tensors="pt", truncation=True, \
                               max_length=500)["input_ids"].to("cuda")  # Moved to CUDA
question_embedding = get_avg_embedding(question_input_ids).squeeze()

# Calculate cosine similarity between question and each row in 'train'
# Convert to float32 and move to CPU
cosine_similarities = cosine_similarity(question_embedding.unsqueeze(0), \
                                        train_embeddings, dim=1).cpu().float()

# Sort and find top n_samples most similar rows
top_indices = torch.topk(cosine_similarities, n_samples).indices.tolist()

for idx in top_indices:
    print("Similarity Score:", cosine_similarities[idx].item())
    print("Text:", data['train'][idx]['Text'])
    print("---")
Similarity Score: 0.8447265625
Text: V-2 sequence data and metadata from GISAID, from 3 July to 31 December 2023.  
 
 
 
 
A 
B  
22 
 Additional resources  
 
• Tracking SARS -CoV-2 Variants  
• WHO statement on updated tracking system on SARS -CoV-2 variants of concern and variants of interest   
• SARS -CoV-2 variant risk evaluation framework, 30 August 2023  
• WHO JN.1 Initial Risk Evaluation, 1 3 December 2023  
• WHO BA.2.86 Initial Risk Evaluation, 21 November 2023  
• WHO EG.5 Updated Risk Evaluation, 21 November 2023  
• WHO XBB.1.5 Updated Risk Assessment, 20 June 2023  
• WHO XBB.1.16 Updated Risk Assessment, 5 June 2023    
22 WHO  regional  overviews   
Data for 11 December  2023  to 7 January 2024  
 
African  Region 
The African Region reported over 3354 new cases, a 63% decrease as compared to the previous 28 -day period. Five (10%) of the 50 countries for which data are 
available reported increases in new cases of 20% or greater, with the highest proportional increases observed in Mauritania ( 30 vs six new cases; +400%), Senegal 
(17 vs six new cases ; +183%), Mali (two vs one new case; +100%), Burundi (78 vs 51 new cases; +53%), and Angola (417 vs 316 new cases; +32%). The  highest 
numbers of new cases were reported from Mauritius (3228 new cases; 253.8 new cases per 100 000; -57%
---
Similarity Score: 0.8330078125
Text: %), and the Western Pacific Region (five countries; 14%). No country in the 
Eastern Mediterranean Region shared1. The number of countries that consistently reported new hospitalizations 
for the period was 9% (22  countries) (Table 2).  
 
Among the 22 countries consistently reporting new hospitalizations, 8 (36%) countries registered an increase of 20% 
or greater in hospitalizations during the past 28 days compared to the previous 28 -day period: Indonesia (1337 vs 
149; +797%), Malta (79 vs 21; +276%), Brunei Darussalam (588 vs 161; +265%),  Malaysia (9312 vs 4137; +125%), 
Greece (6366 vs 3792; +68%), Singapore (2619 vs 1719; +52%), United Sates of America (12 8 073 vs 84 98 1; +51%), 
and Ireland (1353 vs 967; +40%).  
The highest numbers of new hospital admissions were reported from the United States of America (12 8 073  vs 84 
981; +51%), Malaysia (9312 vs 4137; +125%), and Italy (8845 vs 13 857; -36%).  
 
 
 
 
 
 
 
 
 
  
11 
  
Table 2. Number of new hospitalization admissions reported by WHO regions, 11 December 2023  to 7 January 2024  
compared to 16 October to 13  November to 10 December 2023  
Region  Countries reported at least once in 
the past 28 days  Countries reported consistently in the past and 
previous 28 days*  
Number
---
Similarity Score: 0.83251953125
Text: ), and Jamaica (32 vs 26 new cases; +23%). The highest numbers of new cases were reported 
from Canada (18 947 new cases; 50.2 new cases per 100 000; -18%), Chile (9693 new cases; 50.7 new cases per 100 000; -25%), and Peru (4657 new cases; 14.1 
new cases per 100 000; -38%).  
The number of new 28 -day deaths in the Region decreased by 41% as compared to the previous 28 -day period, with 3869 new deaths reported. The highest 
numbers of new deaths were reported from the United States of America (3552 new deaths; 1.1 new deaths per 100 000; -40%), Canada (128 new deaths; <1 new 
death per 100 000; -60%), and Chile (114 new deaths; <1 new death per 100 000; -46%).  
 
 
 
 
Updates from the  Region of the Americas
 
  23 Eastern  Mediterranean  Region  
The Eastern Mediterranean Region reported over 2464 new cases, a 13% decrease as compared to the previous 28 -day period. One (5%) of the 22 countries for 
which data are available reported increases in new cases of 20% or greater, with the highest proportional increases observed in Kuwait (102 vs 39 new cases; 
+162%). The highest numbers of new cases were reported from Afghanistan (1256 new cas es; 3.2 new cases per 100 000; +14%), the Islamic Republic of Iran (663 
new cases; <1 new case per 100 000; -48%), and Morocco (443 new cases; 1.2 new cases per
---

OpenAI embeddings¶

In [ ]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
In [ ]:
from google.colab import drive

# User inputs the API key here
api_key = input("Please enter your OpenAI API key:")
os.environ["OPENAI_API_KEY"] = api_key
Please enter your OpenAI API key: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
In [ ]:
!pip install openai
In [ ]:
import os
from openai import OpenAI

# Get the API key from environment variable
client = OpenAI(
  api_key=os.environ['OPENAI_API_KEY'],
)
In [ ]:
!pip install tqdm
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (4.66.2)
In [ ]:
# Reload embeddings if previously saved to drive
import numpy as np

def reload_embeddings(cache_dir, embeddings_filename):
    csv_path = os.path.join(cache_dir, embeddings_filename)

    # Initialize an empty list to hold the read data
    train_embeddings_list = []

    # Read from CSV
    with open(csv_path, 'r', newline='') as csvfile:
        csvreader = csv.reader(csvfile)
        for row in csvreader:
            train_embeddings_list.append(list(map(float, row)))

    # Convert list of lists to a numpy array and then to a tensor
    train_embeddings = torch.tensor(np.array(train_embeddings_list))

    # Convert tensor to float32
    train_embeddings = train_embeddings.to(torch.float32)

    print("Reloaded train_embeddings:", train_embeddings)
    return train_embeddings

def get_openai_embedding(text):
    embedding = client.embeddings.create(
        input = [text],
        model="text-embedding-ada-002").data[0].embedding
    return torch.tensor(embedding)

def embeddings_to_csv(train_embeddings, cache_dir, embeddings_filename):
    # Check if train_embeddings is already a tensor
    if not isinstance(train_embeddings, torch.Tensor):
        # Convert list of tensors to a single tensor
        train_embeddings = torch.stack(train_embeddings)

    # Save train_embeddings to a CSV file
    import csv

    # Convert tensor to numpy array and then to list
    train_embeddings_list = train_embeddings.cpu().numpy().tolist()

    csv_path = os.path.join(cache_dir, embeddings_filename)
    with open(csv_path, 'w', newline='') as csvfile:
        csvwriter = csv.writer(csvfile)
        csvwriter.writerows(train_embeddings_list)

    print(f"Saved train_embeddings to {csv_path}")

def create_embeddings():
    from tqdm import tqdm
    import time

    # Initialize train_embeddings as an empty list
    train_embeddings = []

    # Fetch embeddings for training data
    for i in tqdm(range(0, len(data['train']), 1)):  # Assume each API call handles one item
        example = data['train'][i]
        text = example['Text']
        embedding = get_openai_embedding(text)
        train_embeddings.append(embedding)
        time.sleep(2)  # Sleep for 2 seconds to avoid rate limiting

    return train_embeddings

# Check if embeddings file already exists in cache, else create new embeddings
embeddings_filename="train_embeddings.csv"
if os.path.exists(os.path.join(cache_dir, embeddings_filename)):
    train_embeddings = reload_embeddings(cache_dir, embeddings_filename)
    print("Loaded train_embeddings from cache")
else:
    train_embeddings = create_embeddings()
    embeddings_to_csv(train_embeddings, cache_dir, embeddings_filename)
    print(f"Created new train_embeddings. Saved embeddings to {os.path.join(cache_dir, embeddings_filename)}")
In [ ]:
# from tqdm import tqdm
# import time

# def get_openai_embedding(text):
#     embedding = client.embeddings.create(
#         input = [text],
#         model="text-embedding-ada-002").data[0].embedding
#     return torch.tensor(embedding)

# # Initialize train_embeddings as an empty list
# train_embeddings = []

# # Fetch embeddings for training data
# for i in tqdm(range(0, len(data['train']), 1)):  # Assume each API call handles one item
#     example = data['train'][i]
#     text = example['Text']
#     embedding = get_openai_embedding(text)
#     train_embeddings.append(embedding)
#     time.sleep(2)  # Sleep for 2 seconds to avoid rate limiting
100%|██████████| 64/64 [02:18<00:00,  2.17s/it]
In [ ]:
# # Convert list to float32 tensor
# train_embeddings = torch.tensor(np.array(train_embeddings))
# train_embeddings = train_embeddings.to(torch.float32)
In [ ]:
# Fetch embedding for the question
question = "For the 28-day period from 11 December 2023 to 7 January 2024, how " + \
"many new cases of COVID-19 were reported globally?"
question_embedding = get_openai_embedding(question)

# Calculate cosine similarity between question and each row in 'train'
cosine_similarities = cosine_similarity(question_embedding, train_embedddings, dim=-1)

# Sort and find top n_samples most similar rows
top_indices = torch.topk(cosine_similarities, n_samples).indices.tolist()

# Output similar texts
for idx in top_indices:
    print("Similarity Score:", cosine_similarities[idx].item())
    print("Text:", data['train'][idx]['Text'])
    print("---")
Similarity Score: 0.8974971175193787
Text: cases remained stable during the 28-day period of 11 December 2023 to 7 January 2024 as compared to the previous 28-day period, with over 1.1 million new cases reported. The number of new weekly deaths decreased by 26% as compared to the previous 28-day period, with 8700 new fatalities reported. As of 7 January 2024, over 774 million confirmed cases and over 7 million deaths have been reported globally. According to estimates obtained from viral loads in wastewater surveillance, clinical detection of cases underestimated the real burden 2 to 19-fold. Reported cases do not accurately represent infection rates due to the reduction in testing and reporting globally. During this 28-day period, only 45% of countries reported at
---
Similarity Score: 0.8882509469985962
Text: COVID-19 Epidemiological Update Edition 163 published 19 January 2024

In this edition:
• Key highlights
• Global overview
• Hospitalizations and ICU admissions
• SARS-CoV-2 variants of interest and variants under monitoring
• WHO regional overviews

Key highlights
• Globally, during the 28-day period from 11 December 2023 to 7 January 2024, 106 countries reported COVID-19 cases and 51 countries reported COVID-19 deaths. Note that this does not reflect the actual number of countries where cases or deaths are occurring, as many countries have stopped or changed frequency of reporting.
• From the available data, the number of reported cases has increased while deaths have decreased during the 28-day period, with over 1
---
Similarity Score: 0.876165509223938
Text: , especially data on morbidity and impact on health systems. COVID-19 remains a major threat, and WHO urges Member States to maintain, not dismantle, their established COVID-19 infrastructure. It is crucial to sustain, inter alia, early warning, surveillance and reporting, variant tracking, early clinical care provision, administration of vaccine to high-risk groups, improvements in ventilation, and regular communication.

Hospitalizations and ICU admissions
At the global level, during the past 28 days (11 December 2023 to 7 January 2024), a total of 173,547 new hospitalizations and 1,966 new intensive care unit (ICU) admissions were reported from 53 and 42 countries, respectively. This represents a 32% increase and a 3% decrease
---
In [ ]:
# import numpy as np

# # Check if train_embeddings is already a tensor
# if not isinstance(train_embeddings, torch.Tensor):
#     # Convert list of tensors to a single tensor
#     train_embeddings = torch.stack(train_embeddings)

# # Save train_embeddings to a CSV file
# import csv

# # Convert tensor to numpy array and then to list
# train_embeddings_list = train_embeddings.cpu().numpy().tolist()

# csv_path = os.path.join(cache_dir, "train_embeddings.csv")
# with open(csv_path, 'w', newline='') as csvfile:
#     csvwriter = csv.writer(csvfile)
#     csvwriter.writerows(train_embeddings_list)

# print(f"Saved train_embeddings to {csv_path}")
Saved train_embeddings to /content/drive/My Drive/huggingface_cache/train_embeddings.csv

MS MARCO embeddings¶

MS MARCO is an embedding model from the SentenceTransformers framework that was trained on user search queries using Bing search engine. It is fine-tuned for either cosine similarity or dot product. In this section, I use the msmarco-distilbert-base-tas-b model, which is tuned for dot product and gives better results than Llama embeddings.

In [ ]:
!pip install sentence_transformers
In [16]:
from sentence_transformers import SentenceTransformer
import torch

n_samples = 3

# Initialize the model
marco_model = SentenceTransformer('msmarco-distilbert-base-tas-b').to("cuda")  # Moved to CUDA

# Function to get embeddings
def get_marco_embedding(text):
    return torch.tensor(marco_model.encode(text)).to("cuda")  # Moved to CUDA

# Calculate embeddings for each row in 'train' dataset
train_embeddings = []
for example in data['train']:
    train_embeddings.append(get_marco_embedding(example['Text']))
train_embeddings = torch.stack(train_embeddings)

# Get embedding for the question
question = "For the 28-day period from 11 December 2023 to 7 January 2024, how " + \
"many new cases of COVID-19 were reported globally?"
question_embedding = get_marco_embedding(question)

# Calculate dot product between question and each row in 'train'
dot_products = torch.matmul(question_embedding, train_embeddings.T)

# Sort and find top n_samples most similar rows
top_indices = torch.topk(dot_products, n_samples).indices.tolist()

for idx in top_indices:
    print("Similarity Score:", dot_products[idx].item())
    print("Text:", data['train'][idx]['Text'])
    print("---")
Similarity Score: 107.57888793945312
Text: 1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the 
number of reported new cases and deaths should be interpreted with caution due to decreased testing  and 
sequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater 
surveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  
• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global 
Influenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January 
2024. 
• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least 
once on COVID -19
---
Similarity Score: 106.12256622314453
Text: 3 COVID -19 2023 -2025 
Updated Strategic Preparedness and Response Plan . The new dashboard will progressively incorporate more 
components throughout 2024.  The previous link of the Global WHO Coronavirus (COVID -19) Dashboard will 
still be active and redirect users  to the new one from 22 December onward. Please note that start time of 
the redirection can differ around the world by up to 24 hours.  
 
 2 
  
For t he latest data and other updates on COVID -19, please see:  
• WHO Monthly Operational Update  and past editions of the Weekly  Epidemiological Update on COVID -19 
• WHO COVID -19 detailed surveillance data dashboard   
• WHO COVID -19 policy briefs   
• COVID -19 surveillance reporting requirements update for Member States  
• Summary  Tables  of COVID -19 vaccine effectiveness (VE) studies and results (last updated 11 January 
2024)  
• Forest Plots  displaying results of COVID -19 VE studies (last updated 15 January 2024)  
• Special focus WEU on interpreting relative VE  (29 June 2022, pages 6 -8) 
• Neutralizat ion p lots (last updated 15 January 2024)  
• WHO COVID -19 VE Resources  
 
 
Global overview  
 
Data  as of 7 January  2024 
Globally, the number of new weekly cases remained stable during the 28 -day period of 11 December 2023 to 7 
January 2024 as compared to the previous 28 -day period, with over 1.1 million new cases reported (Figure 1, 
Table 1). The number of new weekly deaths  decreased by 26% as compared to the previous 28 -day period, with 
8700  new fatalities
---
Similarity Score: 104.67813110351562
Text: -19 cases  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
8 
 Figure  5. Number  of COVID -19 deaths  reported over  the last 28 days per 100 000 population , as of 7 January   2024 ** 
 
**Se e Annex 1: Data, table, and figure notes  
 
9 
 Figure 6. Percentage change in confirmed COVID -19 deaths  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
10 
  
Hospitalizations and ICU admissions  
 
At the global level, during the past 28 days (11 December 2023  to 7 January 2024), a total of 173 547  new 
hospitalizations and 1 966 new intensive care unit (ICU) admissions were reported from 5 3 and 4 2 countries, 
respectively (Figure 7). This represents 32% increase and 3% decrease, respectively, compared to the previous 28 
days (13 November to 10 December 2023). Note that the absence of reported data from some countries to WHO 
does not imply that there are no COVID -19-related hospitalizations in those countries. The prese nted hospitalization 
data are preliminary and might change as new data become available. Furthermore, hospitalization data are subject 
to reporting delays. These data also likely include both hospitalizations with incidental cases of SARS -CoV-2 infection 
and those due to COVID -19 disease.   
  
 
New hospitalizations  
During the past 28 days, 5 3 (23%) countries reported data to WHO on new hospitalizations at least once (Figure 7).
---

Evaluate with and without embeddings¶

In [17]:
from transformers import TextStreamer
from torch.nn.functional import cosine_similarity
import torch

# Assume that train_embeddings has been computed and is available

# Define a stream
def stream(user_prompt):
    model.config.use_cache = True
    system_prompt = "You are an expert on COVID-19 statistics. You provide very " + \
    "succinct answers no longer than ten words."

    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

    prompt = f"{B_INST}{B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    streamer = TextStreamer(tokenizer)

    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=100)

def evaluation(questions, answers, use_embeddings=False, n_samples=3, embedding_type='marco'):
    for i, (question, answer) in enumerate(zip(questions, answers)):
        if use_embeddings:
          print(f'Using the {embedding_type} embedding model')
          # Tokenize and get embedding for the question
          if embedding_type == 'llama':
              question_input_ids = tokenizer(question, return_tensors="pt", truncation=True, max_length=512)["input_ids"].to("cuda")
              question_embedding = get_avg_embedding(question_input_ids).squeeze()
              question_embedding = question_embedding.unsqueeze(0) #average over all tokens.
          elif embedding_type == 'openai':
              question_embedding = get_openai_embedding(question)
          elif embedding_type == 'marco':
              question_embedding = get_marco_embedding(question)
          else:
              print("Invalid embedding_type. It should be either 'llama', 'openai', or 'marco'.")

          # # Cosine Similarity
          # # Calculate cosine similarity between question and each row in 'train'
          # cosine_similarities = cosine_similarity(question_embedding, train_embeddings, dim=1).cpu().float()

          # # Sort and find top n_samples most similar rows
          # top_indices = torch.topk(cosine_similarities, n_samples).indices.tolist()
          # print(f'Top indices are {top_indices}')

          # Dot Product Similarity
          # Calculate dot product similarity between question and each row in 'train'
          dot_product_similarities = torch.matmul(question_embedding, train_embeddings.T).cpu().float()

          # Sort and find top n_samples most similar rows
          top_indices = torch.topk(dot_product_similarities, n_samples).indices.tolist()
          # print(f'Top indices are {top_indices}')

          similar_rows = []
          for idx in top_indices:
              text = data['train'][idx]['Text']
              if isinstance(text, list):
                  # If text is a list of strings, join the strings into a single string with a newline in between
                  text = "\n".join(text)
              # Append the string to similar_rows
              similar_rows.append(text)

          # print(similar_rows)
          similar_rows_text = "\n".join(similar_rows)
          user_prompt = "\n\nHere are some snippets from the World Health Organization " + \
          "Epidemiological Update - Edition 163 regarding COVID-19 statistics from " + \
          f"11 December 2023 to 7 January 2024:\n\n{similar_rows_text}\n\nAnswer the " + \
          f"following question succinctly, solely using the above snippets of text.\n\n{question}"
        else:
          user_prompt = question

        print(f"Question {i + 1}:")  # Printing the question number
        stream(user_prompt)
        print("Correct Answer:", answer)
        print('\n\n')

questions = [
    "During the 28-day period from 11 December 2023 to 7 January, how many " + \
    "countries reported new COVID-19 cases?",
    "To date, how many confirmed cases of COVID-19 have been reported globally?",
    "During the 28-day period from 11 December 2023 to 7 January, by what percent " + \
    "did new hospitalization increase globally due to COVID-19?",
    "During the 28-day period from 11 December 2023 to 7 January, which countries " + \
    "in the Americas reported the highest number of new COVID-19 cases?",
    "During the 28-day period from 11 December 2023 to 7 January, which Eastern " + \
    "Mediterranean country reported the highest proportional increase in new COVID-19 cases?",
    "During the 28-day period from 11 December 2023 to 7 January, which World Health " + \
    "Organization Region reported the largest decrease in new COVID-19 cases?",
    "During the 28-day period from 11 December 2023 to 7 January, what was the percent " + \
    "increase in new deaths in the South-East Asia Region?",
    "During the 28-day period from 11 December 2023 to 7 January, which countries " + \
    "reported no new hospitalizations?",
    "When did the ICU-to-hospitalizations ratio peak during the COVID-19 pandemic?",
    "During the 28-day period from 11 December 2023 to 7 January, which SARS-CoV-2 " + \
    "variants of interest (VOIs) were the WHO tracking?"
]

answers = [
    "106 countries.",
    "774 million.",
    "40%.",
    "Canada, Chile, and Peru.",
    "Kuwait.",
    "The African Region.",
    "564%",
    "Mauritania, Mali, Turks and Caicos Islands, Saint Lucia, and Honduras.",
    "July 2021",
    "XBB.1.5, XBB.1.16, EG.5, BA.2.86 and JN.1."
]
In [18]:
print("Question embedding shape:", question_embedding.shape)
print("Train embeddings shape:", train_embeddings.shape)
Question embedding shape: torch.Size([768])
Train embeddings shape: torch.Size([29, 768])

Evaluate without Embeddings (Raw Llama)¶

Given that this COVID-19 information is recent, Llama 2 13B answers 10 of 10 questions incorrectly without the help of embeddings.

In [ ]:
# Call evaluation without embeddings
evaluation(questions, answers, use_embeddings=False)
Question 1:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, how many countries reported new COVID-19 cases? [/INST]

47 countries reported new COVID-19 cases.</s>
Correct Answer: 106 countries.



Question 2:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

To date, how many confirmed cases of COVID-19 have been reported globally? [/INST]

Over 25 million confirmed cases worldwide.</s>
Correct Answer: 774 million.



Question 3:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, by what percent did new hospitalization increase globally due to COVID-19? [/INST]

12.3% increase.</s>
Correct Answer: 40%.



Question 4:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, which countries in the Americas reported the highest number of new COVID-19 cases? [/INST]

Top 3 countries:

1. United States (352,831 new cases)
2. Brazil (231,305 new cases)
3. Mexico (123,421 new cases)</s>
Correct Answer: Canada, Chile, and Peru.



Question 5:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, which Eastern Mediterranean country reported the highest proportional increase in new COVID-19 cases? [/INST]

Cyprus reported highest proportional increase.</s>
Correct Answer: Kuwait.



Question 6:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, which World Health Organization Region reported the largest decrease in new COVID-19 cases? [/INST]

Europe.</s>
Correct Answer: The African Region.



Question 7:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, what was the percent increase in new deaths in the South-East Asia Region? [/INST]

125% increase in new deaths.</s>
Correct Answer: 564%



Question 8:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, which countries reported no new hospitalizations? [/INST]

None.</s>
Correct Answer: Mauritania, Mali, Turks and Caicos Islands, Saint Lucia, and Honduras.



Question 9:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

When did the ICU-to-hospitalizations ratio peak during the COVID-19 pandemic? [/INST]

Peak ICU-to-hospitalizations ratio: March 2020.</s>
Correct Answer: July 2021



Question 10:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, which SARS-CoV-2 variants of interest (VOIs) were the WHO tracking? [/INST]

WHO tracked Omicron, Delta, and Beta variants.</s>
Correct Answer: XBB.1.5, XBB.1.16, EG.5, BA.2.86 and JN.1.



Evaluate with the help of Embeddings¶

In [19]:
# Call evaluation with embeddings
# evaluation(questions, answers, use_embeddings=True, n_samples=3, embedding_type='llama')
# evaluation(questions, answers, use_embeddings=True, n_samples=3, embedding_type='openai')
evaluation(questions, answers, use_embeddings=True, n_samples=3, embedding_type='marco')
Using the marco embedding model
Question 1:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the 
number of reported new cases and deaths should be interpreted with caution due to decreased testing  and 
sequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater 
surveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  
• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global 
Influenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January 
2024. 
• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least 
once on COVID -19
-19 cases  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
8 
 Figure  5. Number  of COVID -19 deaths  reported over  the last 28 days per 100 000 population , as of 7 January   2024 ** 
 
**Se e Annex 1: Data, table, and figure notes  
 
9 
 Figure 6. Percentage change in confirmed COVID -19 deaths  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
10 
  
Hospitalizations and ICU admissions  
 
At the global level, during the past 28 days (11 December 2023  to 7 January 2024), a total of 173 547  new 
hospitalizations and 1 966 new intensive care unit (ICU) admissions were reported from 5 3 and 4 2 countries, 
respectively (Figure 7). This represents 32% increase and 3% decrease, respectively, compared to the previous 28 
days (13 November to 10 December 2023). Note that the absence of reported data from some countries to WHO 
does not imply that there are no COVID -19-related hospitalizations in those countries. The prese nted hospitalization 
data are preliminary and might change as new data become available. Furthermore, hospitalization data are subject 
to reporting delays. These data also likely include both hospitalizations with incidental cases of SARS -CoV-2 infection 
and those due to COVID -19 disease.   
  
 
New hospitalizations  
During the past 28 days, 5 3 (23%) countries reported data to WHO on new hospitalizations at least once (Figure 7).
3 COVID -19 2023 -2025 
Updated Strategic Preparedness and Response Plan . The new dashboard will progressively incorporate more 
components throughout 2024.  The previous link of the Global WHO Coronavirus (COVID -19) Dashboard will 
still be active and redirect users  to the new one from 22 December onward. Please note that start time of 
the redirection can differ around the world by up to 24 hours.  
 
 2 
  
For t he latest data and other updates on COVID -19, please see:  
• WHO Monthly Operational Update  and past editions of the Weekly  Epidemiological Update on COVID -19 
• WHO COVID -19 detailed surveillance data dashboard   
• WHO COVID -19 policy briefs   
• COVID -19 surveillance reporting requirements update for Member States  
• Summary  Tables  of COVID -19 vaccine effectiveness (VE) studies and results (last updated 11 January 
2024)  
• Forest Plots  displaying results of COVID -19 VE studies (last updated 15 January 2024)  
• Special focus WEU on interpreting relative VE  (29 June 2022, pages 6 -8) 
• Neutralizat ion p lots (last updated 15 January 2024)  
• WHO COVID -19 VE Resources  
 
 
Global overview  
 
Data  as of 7 January  2024 
Globally, the number of new weekly cases remained stable during the 28 -day period of 11 December 2023 to 7 
January 2024 as compared to the previous 28 -day period, with over 1.1 million new cases reported (Figure 1, 
Table 1). The number of new weekly deaths  decreased by 26% as compared to the previous 28 -day period, with 
8700  new fatalities

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, how many countries reported new COVID-19 cases? [/INST]

106 countries reported new COVID-19 cases during the 28-day period from 11 December 2023 to 7 January.</s>
Correct Answer: 106 countries.



Using the marco embedding model
Question 2:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the 
number of reported new cases and deaths should be interpreted with caution due to decreased testing  and 
sequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater 
surveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  
• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global 
Influenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January 
2024. 
• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least 
once on COVID -19

8700  new fatalities reported. As of 7 January 2024, over 774 million confirmed cases and over 7 million deaths 
have been reported globally.  According to estimates obtained from viral loads in wastewater surveillance, clinical 
detection of cases underestimated the real burden 2 to 19 -fold1,2,3 
Reported cases do not accurately represent infection rates due to the r eduction in testing and reporting globally . 
During this 28 -day period, only 45% (10 6 of 234) of countries reported at least one case to WHO. It is important 
to note that this statistic does not reflect the actual number of countries where cases exist. Additionally, data 
from the previous 28 -day period are continuously being updated to inc orporate retrospective changes made by 
countries regarding reported COVID -19 cases and deaths. Data presented in this report are therefore incomplete 
and should be interpreted considering these limitations. Some countries continue to r eport high burdens of 
COVID -19, including increases in newly reported cases and, more importantly, increases in hospitalizations and 
deaths – the latter of which are considered more reliable indicators given reductions in testing. Global and 
national data on SARS -CoV-2 PCR percent positivity are available on WHO’s integrated influenza and other 
respiratory viruses surveillance dashboard . Recent data (epidemiological week one, 1 to 7 January  2024) from 
sentinel sites show that the SARS -CoV-2 PCR percent positivity from reporting countries averages approximately 
8% (Figure 2).   
As many countries discontinue COVID -19-specific reporting and integrate it into respiratory disease surveillance, 
WHO will use all available sources to continue monitoring the COVID -19 epidemiological situation, especially 
data on morbidity and impact on h ealth systems. COVID -19 remains a major threat, and WHO urges Member 
States to maintain, not dism
at least 
once on COVID -19 hospitalizations and admissions to an intensive care unit (ICU) , respectively . From the 
available data, over  173 000 new hospitalizations and over 1900 new ICU admissions were reported  during 
the 28 -day period .  Amongst  the countries reporting these data consistently over the current and past 
reporting period, there was an overall increase  of 40% and 13% in new hospitalizations  and new ICU 
admissions , respectively .  
•  Globally, JN.1 is the most reported VOI (now reported by 71 countries), accounting for 65.5% of sequences 
in week 52 compared to 24.8% in week 48 (Figure 10, Table 6). Its parent lineage, BA.2.86, is stable and 
accounted for 7.8% of sequences in week 52 c ompared to 7.0% in week 48 (Figure 10, Table 6). The initial 
risk evaluation for JN.1  was published on 19 December 2023, with an overall evaluation of low public health 
risk at the global level based on available evidence.  WHO is currently tracking several SARS -CoV-2 variants : 
five VOIs  – XBB.1.5, XBB.1.16 , EG.5  BA.2.86  and JN.1 ; and five VUMs : DV.7, XBB, XBB.1.9.1, XBB .1.9.2 and 
XBB.2.3   
• The Global WHO Coronavirus (COVID -19) Dashboard  has been updated and adapted with a new interface 
on 22 December 2023 to support WHO and Member States’ work to transition from COVID -19 as an 
emergency to longer -term disease management, as outlined in WHO’s 3 May 2023 COVID -19 2023

Answer the following question succinctly, solely using the above snippets of text.

To date, how many confirmed cases of COVID-19 have been reported globally? [/INST]

Over 774 million confirmed cases of COVID-19 have been reported globally, as of 7 January 2024.</s>
Correct Answer: 774 million.



Using the marco embedding model
Question 3:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

-19 cases  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
8 
 Figure  5. Number  of COVID -19 deaths  reported over  the last 28 days per 100 000 population , as of 7 January   2024 ** 
 
**Se e Annex 1: Data, table, and figure notes  
 
9 
 Figure 6. Percentage change in confirmed COVID -19 deaths  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
10 
  
Hospitalizations and ICU admissions  
 
At the global level, during the past 28 days (11 December 2023  to 7 January 2024), a total of 173 547  new 
hospitalizations and 1 966 new intensive care unit (ICU) admissions were reported from 5 3 and 4 2 countries, 
respectively (Figure 7). This represents 32% increase and 3% decrease, respectively, compared to the previous 28 
days (13 November to 10 December 2023). Note that the absence of reported data from some countries to WHO 
does not imply that there are no COVID -19-related hospitalizations in those countries. The prese nted hospitalization 
data are preliminary and might change as new data become available. Furthermore, hospitalization data are subject 
to reporting delays. These data also likely include both hospitalizations with incidental cases of SARS -CoV-2 infection 
and those due to COVID -19 disease.   
  
 
New hospitalizations  
During the past 28 days, 5 3 (23%) countries reported data to WHO on new hospitalizations at least once (Figure 7).
1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the 
number of reported new cases and deaths should be interpreted with caution due to decreased testing  and 
sequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater 
surveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  
• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global 
Influenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January 
2024. 
• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least 
once on COVID -19
at least 
once on COVID -19 hospitalizations and admissions to an intensive care unit (ICU) , respectively . From the 
available data, over  173 000 new hospitalizations and over 1900 new ICU admissions were reported  during 
the 28 -day period .  Amongst  the countries reporting these data consistently over the current and past 
reporting period, there was an overall increase  of 40% and 13% in new hospitalizations  and new ICU 
admissions , respectively .  
•  Globally, JN.1 is the most reported VOI (now reported by 71 countries), accounting for 65.5% of sequences 
in week 52 compared to 24.8% in week 48 (Figure 10, Table 6). Its parent lineage, BA.2.86, is stable and 
accounted for 7.8% of sequences in week 52 c ompared to 7.0% in week 48 (Figure 10, Table 6). The initial 
risk evaluation for JN.1  was published on 19 December 2023, with an overall evaluation of low public health 
risk at the global level based on available evidence.  WHO is currently tracking several SARS -CoV-2 variants : 
five VOIs  – XBB.1.5, XBB.1.16 , EG.5  BA.2.86  and JN.1 ; and five VUMs : DV.7, XBB, XBB.1.9.1, XBB .1.9.2 and 
XBB.2.3   
• The Global WHO Coronavirus (COVID -19) Dashboard  has been updated and adapted with a new interface 
on 22 December 2023 to support WHO and Member States’ work to transition from COVID -19 as an 
emergency to longer -term disease management, as outlined in WHO’s 3 May 2023 COVID -19 2023

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, by what percent did new hospitalization increase globally due to COVID-19? [/INST]

40%</s>
Correct Answer: 40%.



Using the marco embedding model
Question 4:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the 
number of reported new cases and deaths should be interpreted with caution due to decreased testing  and 
sequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater 
surveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  
• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global 
Influenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January 
2024. 
• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least 
once on COVID -19
-19 cases  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
8 
 Figure  5. Number  of COVID -19 deaths  reported over  the last 28 days per 100 000 population , as of 7 January   2024 ** 
 
**Se e Annex 1: Data, table, and figure notes  
 
9 
 Figure 6. Percentage change in confirmed COVID -19 deaths  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
10 
  
Hospitalizations and ICU admissions  
 
At the global level, during the past 28 days (11 December 2023  to 7 January 2024), a total of 173 547  new 
hospitalizations and 1 966 new intensive care unit (ICU) admissions were reported from 5 3 and 4 2 countries, 
respectively (Figure 7). This represents 32% increase and 3% decrease, respectively, compared to the previous 28 
days (13 November to 10 December 2023). Note that the absence of reported data from some countries to WHO 
does not imply that there are no COVID -19-related hospitalizations in those countries. The prese nted hospitalization 
data are preliminary and might change as new data become available. Furthermore, hospitalization data are subject 
to reporting delays. These data also likely include both hospitalizations with incidental cases of SARS -CoV-2 infection 
and those due to COVID -19 disease.   
  
 
New hospitalizations  
During the past 28 days, 5 3 (23%) countries reported data to WHO on new hospitalizations at least once (Figure 7).

8700  new fatalities reported. As of 7 January 2024, over 774 million confirmed cases and over 7 million deaths 
have been reported globally.  According to estimates obtained from viral loads in wastewater surveillance, clinical 
detection of cases underestimated the real burden 2 to 19 -fold1,2,3 
Reported cases do not accurately represent infection rates due to the r eduction in testing and reporting globally . 
During this 28 -day period, only 45% (10 6 of 234) of countries reported at least one case to WHO. It is important 
to note that this statistic does not reflect the actual number of countries where cases exist. Additionally, data 
from the previous 28 -day period are continuously being updated to inc orporate retrospective changes made by 
countries regarding reported COVID -19 cases and deaths. Data presented in this report are therefore incomplete 
and should be interpreted considering these limitations. Some countries continue to r eport high burdens of 
COVID -19, including increases in newly reported cases and, more importantly, increases in hospitalizations and 
deaths – the latter of which are considered more reliable indicators given reductions in testing. Global and 
national data on SARS -CoV-2 PCR percent positivity are available on WHO’s integrated influenza and other 
respiratory viruses surveillance dashboard . Recent data (epidemiological week one, 1 to 7 January  2024) from 
sentinel sites show that the SARS -CoV-2 PCR percent positivity from reporting countries averages approximately 
8% (Figure 2).   
As many countries discontinue COVID -19-specific reporting and integrate it into respiratory disease surveillance, 
WHO will use all available sources to continue monitoring the COVID -19 epidemiological situation, especially 
data on morbidity and impact on h ealth systems. COVID -19 remains a major threat, and WHO urges Member 
States to maintain, not dism

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, which countries in the Americas reported the highest number of new COVID-19 cases? [/INST]

According to the World Health Organization Epidemiological Update - Edition 163, as of 7 January 2024, the countries in the Americas that reported the highest number of new COVID-19 cases during the 28-day period from 11 December 2023 to 7 January were:

1. United States (106,177 new cases)
2. Brazil (28,37
Correct Answer: Canada, Chile, and Peru.



Using the marco embedding model
Question 5:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the 
number of reported new cases and deaths should be interpreted with caution due to decreased testing  and 
sequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater 
surveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  
• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global 
Influenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January 
2024. 
• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least 
once on COVID -19
 4/50  
 (8%)  
Eastern 
Mediter
ranean  2 464  
  (0%)  -13%  23 409 749  
  (3%)  37 
  (0%)  -53%  351 870  
  (5%)  4/22  
(18%)  3/22  
 (14%)  
Global  1 113 726  
  (100%)  4%    774 075 
198 
  (100%)  8 700  
  (100%)  -26%  7 012 984  
  (100%)  106/234 
(45%)  51/234 
(22%)  
 
*Percent change in the number of newly confirmed cases/deaths in the past 28 days, compared to 28 days prior. Data from previous weeks are 
updated continuously with adjustments received from countries.  
**See Annex 1: Data, table, and figure notes                                                                                                                                                                     
 
 
 
 
 5 
 Figure 2. SARS -CoV-2 specimens tested and test positivity rates reported to FluNet  from sent inel sites ; 5 January 
2020 to 7 January  2024 
 
 
 
Source: Influenza and SARS -CoV-2 surveillance data from GISRS  reported to FluNet ; WHO Global Influenza Programme
 
6 
 Figure 3. Number of confirmed  COVID -19 cases  reported over  the last 28 days per 100 000 population , as of 7 January  2024** 
 
**See Annex 1: Data, table, and figure notes  
 
7 
 Figure 4. Percentage change in confirmed COVID -19 cases  over the last 2
-19 cases  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
8 
 Figure  5. Number  of COVID -19 deaths  reported over  the last 28 days per 100 000 population , as of 7 January   2024 ** 
 
**Se e Annex 1: Data, table, and figure notes  
 
9 
 Figure 6. Percentage change in confirmed COVID -19 deaths  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
10 
  
Hospitalizations and ICU admissions  
 
At the global level, during the past 28 days (11 December 2023  to 7 January 2024), a total of 173 547  new 
hospitalizations and 1 966 new intensive care unit (ICU) admissions were reported from 5 3 and 4 2 countries, 
respectively (Figure 7). This represents 32% increase and 3% decrease, respectively, compared to the previous 28 
days (13 November to 10 December 2023). Note that the absence of reported data from some countries to WHO 
does not imply that there are no COVID -19-related hospitalizations in those countries. The prese nted hospitalization 
data are preliminary and might change as new data become available. Furthermore, hospitalization data are subject 
to reporting delays. These data also likely include both hospitalizations with incidental cases of SARS -CoV-2 infection 
and those due to COVID -19 disease.   
  
 
New hospitalizations  
During the past 28 days, 5 3 (23%) countries reported data to WHO on new hospitalizations at least once (Figure 7).

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, which Eastern Mediterranean country reported the highest proportional increase in new COVID-19 cases? [/INST]

According to the data provided, the Eastern Mediterranean country with the highest proportional increase in new COVID-19 cases during the 28-day period from 11 December 2023 to 7 January was... (drumroll please)... Egypt, with a 33% increase!</s>
Correct Answer: Kuwait.



Using the marco embedding model
Question 6:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends in the 
number of reported new cases and deaths should be interpreted with caution due to decreased testing  and 
sequencing, alongside reporting delays in many countries.   According  to estimates obtained from wastewater 
surveillance, clinical detection of cases underestimates  the real burden from 2 to 19 -fold.  
• SARS -CoV-2 PCR percent positivity , as detected in integrated sentinel  surveillance  as part of  the Global 
Influenza  Surveillance and Response System ( GISRS ) and reported to FluNet  was around 8 % as of 7 January 
2024. 
• During the 28 -day period  from 11 December 2023  to 7 January , 53 and 42 countries provided  data at least 
once on COVID -19
-19 cases  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
8 
 Figure  5. Number  of COVID -19 deaths  reported over  the last 28 days per 100 000 population , as of 7 January   2024 ** 
 
**Se e Annex 1: Data, table, and figure notes  
 
9 
 Figure 6. Percentage change in confirmed COVID -19 deaths  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
10 
  
Hospitalizations and ICU admissions  
 
At the global level, during the past 28 days (11 December 2023  to 7 January 2024), a total of 173 547  new 
hospitalizations and 1 966 new intensive care unit (ICU) admissions were reported from 5 3 and 4 2 countries, 
respectively (Figure 7). This represents 32% increase and 3% decrease, respectively, compared to the previous 28 
days (13 November to 10 December 2023). Note that the absence of reported data from some countries to WHO 
does not imply that there are no COVID -19-related hospitalizations in those countries. The prese nted hospitalization 
data are preliminary and might change as new data become available. Furthermore, hospitalization data are subject 
to reporting delays. These data also likely include both hospitalizations with incidental cases of SARS -CoV-2 infection 
and those due to COVID -19 disease.   
  
 
New hospitalizations  
During the past 28 days, 5 3 (23%) countries reported data to WHO on new hospitalizations at least once (Figure 7).
3 COVID -19 2023 -2025 
Updated Strategic Preparedness and Response Plan . The new dashboard will progressively incorporate more 
components throughout 2024.  The previous link of the Global WHO Coronavirus (COVID -19) Dashboard will 
still be active and redirect users  to the new one from 22 December onward. Please note that start time of 
the redirection can differ around the world by up to 24 hours.  
 
 2 
  
For t he latest data and other updates on COVID -19, please see:  
• WHO Monthly Operational Update  and past editions of the Weekly  Epidemiological Update on COVID -19 
• WHO COVID -19 detailed surveillance data dashboard   
• WHO COVID -19 policy briefs   
• COVID -19 surveillance reporting requirements update for Member States  
• Summary  Tables  of COVID -19 vaccine effectiveness (VE) studies and results (last updated 11 January 
2024)  
• Forest Plots  displaying results of COVID -19 VE studies (last updated 15 January 2024)  
• Special focus WEU on interpreting relative VE  (29 June 2022, pages 6 -8) 
• Neutralizat ion p lots (last updated 15 January 2024)  
• WHO COVID -19 VE Resources  
 
 
Global overview  
 
Data  as of 7 January  2024 
Globally, the number of new weekly cases remained stable during the 28 -day period of 11 December 2023 to 7 
January 2024 as compared to the previous 28 -day period, with over 1.1 million new cases reported (Figure 1, 
Table 1). The number of new weekly deaths  decreased by 26% as compared to the previous 28 -day period, with 
8700  new fatalities

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, which World Health Organization Region reported the largest decrease in new COVID-19 cases? [/INST]

According to the data provided in the World Health Organization Epidemiological Update - Edition 163, the Region of the Americas reported the largest decrease in new COVID-19 cases during the 28-day period from 11 December 2023 to 7 January, with a 26% decrease compared to the previous 28-day period.</s>
Correct Answer: The African Region.



Using the marco embedding model
Question 7:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

77%), and the South -East Asia Region (+379%). The number of newly reported 28 -day deaths decreased or 
remained stable across five regions: the African Region ( -113%), the Eastern Mediterra nean Region ( -53%), the 
Western Pacific Region ( -45%), the Region of the Americas ( -41%), and the European Region ( -3%); while death 
numbers increased in the South -East Asia Region (+564%).  
At the country level, the highest numbers of new 28 -day cases were reported from the Russian Federation (235 
198 new cases; -30%), Singapore (174 643 new cases; +117%), Italy (163 599 new cases; -18%), Greece (68 590 
new cases; +81%), and Malaysia (67 206 n ew cases; +164%). The highest numbers of new 28 -day deaths were 
reported from Italy (1016 new deaths; -21%), the Russian Federation (679 new deaths; +45%), Poland (543 new 
deaths; +229%), Sweden (446 new deaths; -47%), and Greece (322 new deaths; +79%).  
 
Table 1. Newly reported and cumulative COVID -19 confirmed cases and deaths, by WHO Region, as of 7 January  2024** 
WHO 
Region  New cases in  
last 28 days 
(%) Change in 
new cases in 
last 28 days *  Cumulative 
cases (%)  New deaths 
in  last 28 
days (%)  Change in 
new deaths 
in last 28 
days *  Cumulative 
deaths (%)  Countries 
reporting 
cases in the 
last 28 days  Countries 
reporting 
deaths in 
the last 28 

vs 42 new 
cases; +60%). The highest numbers of new cases were reported from India (15 079 new cases; 1.1 new cases per 100 000; +843%),  Indonesia (8610 new cases; 3.1 
new cas es per 100 000; +399%), and Thailand (2327 new cases; 3.3 new cases per 100 000; +17%).  
The number of new 28 -day deaths in the Region increased by 564% as compared to the previous 28 -day period, with 186 new deaths reported. The highest 
numbers of new deaths were reported from India (86 new deaths; <1 new death per 100 000; +682%), Indonesia (72 new deaths; <1 new death per 100 000; 
+1340%), and Thailand (21 new deaths; <1 new death per 100 000; +91%).  
 
 
 
Updates from the South -East Asia Region  
 
26 
 Western  Pacific  Region  
The Western Pacific Region reported over 338 000 new cases, a 77% increase as compared to the previous 28 -day period. Ten (29%) of the 35 countries for which 
data are available reported increases in new cases of 20% or greater, with the highest proportional increases observed in Niu e (98 vs six new cases; +1533%), Palau 
(16 vs two new cases; +700%), Tonga (62 vs 12 new cases ; +417%), Cook Islands (29 vs six new cases; +383%), Cambodia (76 vs 16 new cases; +375%), Brunei 
Darussalam (16 648 vs 4 207 new cases; +296%), Malaysia (67
; +20%). The highest numbers of new cases were reported from the 
Russian Federation (235 198 new cases; 161.2 new cases per 100 000; -30%), Italy (163 599 new cases; 274.3 new cases per 100 000; -18%), and Greece (68 590 
new cases; 639.9 new c ases per 100 000; +81%).  
The number of new 28 -day deaths in the Region decreased by 3% as compared to the previous 28 -day period, with 4194 new deaths reported. The highest numbers 
of new deaths were reported from Italy (1016 new deaths; 1.7 new deaths per 100 000; -21%), the Russ ian Federation (679 new deaths; <1 new death per 100 
000; +45%), and Poland (543 new deaths; 1.4 new deaths per 100 000; +229%).  
 
 
 
 
Updates  from  the European  Region
 
25 
 South -East  Asia  Region  
The South -East Asia Region reported over 26 000 new cases, a 379% increase as compared to the previous 28 -day period. Five (45%) of the 11 countries for which 
data are available reported increases in new cases of 20% or greater, with the highest proportion al increases observed in India (15 079 vs 1 599 new cases; +843%), 
Myanmar (110 vs 20 new cases; +450%), Indonesia (8 610 vs 1 727 new cases; +399%), Bangladesh (276 vs 141 new cases; +96%), and Sri Lanka (67 vs 42 new 
cases; +

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, what was the percent increase in new deaths in the South-East Asia Region? [/INST]

564%.</s>
Correct Answer: 564%



Using the marco embedding model
Question 8:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

2. Number of new hospitalization admissions reported by WHO regions, 11 December 2023  to 7 January 2024  
compared to 16 October to 13  November to 10 December 2023  
Region  Countries reported at least once in 
the past 28 days  Countries reported consistently in the past and 
previous 28 days*  
Number of 
countries 
(percentage)**  Number of new 
hospitalizations  Number of 
countries 
(percentage)**  Number of new 
hospitalizations  Percent change in 
new hospitalizations  
Africa  12/50 (24%)  61 3/50 ( 6%)  6 -45%  
Americas  19/56 (3 4%)  134 789  5/56 ( 9%)  129 408  +48%  
Eastern   
Mediterranean  0/22 (<1%)  N/A***  0/22 (<1%)  N/A N/A 
Europe  15/61 (25%)  23 039  9/61 (15%)  22 574  -9%  
South -East  
Asia  2/10 (20%)  1358  2/10 (20%)  1358 +690%  
Western   
Pacific  5/35 (14%)  14 300  5/35 (14%)  14 300  +88%  
Global  53/234 (2 3%)  173 547  24/234 ( 10%)  167 646 +40% 
 
*Percent change is calculated for countries reporting consistently both in the past 28 days and the previous 28 days (compari son period).  
**Number of countries reported / total number of countries in the region (
reported / total number of countries in the region (percentage of reporting).  
*** N/A represents not available  or applicable .  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
12 
  
Table 3. Countries that consistently reported new hospitalizations by WHO region, 11  December 2023 to 7 January 2024 
compared to 13 November to 10 December 2023.  
WHO Region  Country  New Hospitalization in 
past 28 days  New Hospitalization in 
previous 28 -day period  % Change from 
previous 28 -day 
period  
Africa  Mauritania  0 0 N/A  
Africa  Mali  0 0 N/A  
Africa  Angola  6 11 -45%  
Americas  Turks and Caicos Islands  0 0 N/A  
Americas  United States of America  128073  84981  51%  
Americas  Canada  1335  2685  -50%  
Americas  Saint Lucia  0 7 -100%  
Americas  Honduras  0 5 -100%  
Europe  Malta  79 21 276%  
Europe  Greece  6366  3792  68%  
Europe  Ireland  1353  967  40%  
Europe  Netherlands  2474  2196  13%  
Europe  Slovakia  925  1015  -9% 
Europe  Czechia  2083  2317  -10%  
Europe  Estonia  420  474  -11%  
Europe  Italy  8845  13857  -36%  
Europe  Portugal  29 57 -49%  
South -East Asia  Indonesia  133
izations at least once (Figure 7). 
The Region of the Americas had the highest proportion of countries reporting data on new hospitalizations (1 9 
countries; 3 4%), followed by the European Region (15 countries; 25%), the African Region (12 countries; 24%), the 
South -East Asia Region (two countries; 20%), and the Western Pacific Region (five countries; 14%). No country in the 
Eastern Mediterranean Region shared1. The number of countries that consistently reported new hospitalizations 
for the period was 9% (22  countries) (Table 2).  
 
Among the 22 countries consistently reporting new hospitalizations, 8 (36%) countries registered an increase of 20% 
or greater in hospitalizations during the past 28 days compared to the previous 28 -day period: Indonesia (1337 vs 
149; +797%), Malta (79 vs 21; +276%), Brunei Darussalam (588 vs 161; +265%),  Malaysia (9312 vs 4137; +125%), 
Greece (6366 vs 3792; +68%), Singapore (2619 vs 1719; +52%), United Sates of America (12 8 073 vs 84 98 1; +51%), 
and Ireland (1353 vs 967; +40%).  
The highest numbers of new hospital admissions were reported from the United States of America (12 8 073  vs 84 
981; +51%), Malaysia (9312 vs 4137; +125%), and Italy (8845 vs 13 857; -36%).  
 
 
 
 
 
 
 
 
 
  
11 
  
Table 2. Number of new hospitalization admissions

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, which countries reported no new hospitalizations? [/INST]

Mauritania, Mali, and Honduras reported no new hospitalizations during the 28-day period from 11 December 2023 to 7 January.</s>
Correct Answer: Mauritania, Mali, Turks and Caicos Islands, Saint Lucia, and Honduras.



Using the marco embedding model
Question 9:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

newer SARS -CoV-2 variants from these data.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
17 
 Figure 8. COVID -19 ICU -to-hospitalization ratio and death -to-hospitalization ratio, from 27 April  2020 to  7 January 
2023  (A), and 22 May  to 7 January  2024 (B) 
 
A 
 
 
B 
 
 
 
 
 
Note: Recent weeks are subject to reporting delays and should not be interpreted as a declining trend. The ICU ratio figure is created from the data of the countries 
reported both new hospitalizations and new ICU admissions. The d eath ratio figure is created from the data of the countries that reported both new hospitalization 
and new deaths.  
Source : WHO COVID -19 Detailed Surveillance Dashboard  
 
18 
 SARS -CoV-2 variants of interest  and variants under monitoring  
 
Geographic spread and prevalence  
Globally, during the 28 -day period from 11 December 2023 to 7 January 2024, 33 659 SARS -CoV-2 sequences were 
shared through GISAID. In comparison, in the two previous 28 -day periods, there were 112 909 and 192 222 
sequences shared, respectively. The data are periodically retrospectively updated to include sequences w ith earlier 
collection dates, so the number of submissions in a given time period may change.  
 
WHO is currently tracking several SARS -CoV-2 variants, including:  
•  Five variants of interest (VOIs): XBB.1.5, XBB.1.16, EG.5, BA.2.86 and JN.1  
•  Five variants under monitoring (V
at least 
once on COVID -19 hospitalizations and admissions to an intensive care unit (ICU) , respectively . From the 
available data, over  173 000 new hospitalizations and over 1900 new ICU admissions were reported  during 
the 28 -day period .  Amongst  the countries reporting these data consistently over the current and past 
reporting period, there was an overall increase  of 40% and 13% in new hospitalizations  and new ICU 
admissions , respectively .  
•  Globally, JN.1 is the most reported VOI (now reported by 71 countries), accounting for 65.5% of sequences 
in week 52 compared to 24.8% in week 48 (Figure 10, Table 6). Its parent lineage, BA.2.86, is stable and 
accounted for 7.8% of sequences in week 52 c ompared to 7.0% in week 48 (Figure 10, Table 6). The initial 
risk evaluation for JN.1  was published on 19 December 2023, with an overall evaluation of low public health 
risk at the global level based on available evidence.  WHO is currently tracking several SARS -CoV-2 variants : 
five VOIs  – XBB.1.5, XBB.1.16 , EG.5  BA.2.86  and JN.1 ; and five VUMs : DV.7, XBB, XBB.1.9.1, XBB .1.9.2 and 
XBB.2.3   
• The Global WHO Coronavirus (COVID -19) Dashboard  has been updated and adapted with a new interface 
on 22 December 2023 to support WHO and Member States’ work to transition from COVID -19 as an 
emergency to longer -term disease management, as outlined in WHO’s 3 May 2023 COVID -19 2023
-19 cases  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
8 
 Figure  5. Number  of COVID -19 deaths  reported over  the last 28 days per 100 000 population , as of 7 January   2024 ** 
 
**Se e Annex 1: Data, table, and figure notes  
 
9 
 Figure 6. Percentage change in confirmed COVID -19 deaths  over the last 28 days relative to the previous 28 days, as of 7 January  2024** 
 
**Se e Annex 1: Data, table, and figure notes  
 
10 
  
Hospitalizations and ICU admissions  
 
At the global level, during the past 28 days (11 December 2023  to 7 January 2024), a total of 173 547  new 
hospitalizations and 1 966 new intensive care unit (ICU) admissions were reported from 5 3 and 4 2 countries, 
respectively (Figure 7). This represents 32% increase and 3% decrease, respectively, compared to the previous 28 
days (13 November to 10 December 2023). Note that the absence of reported data from some countries to WHO 
does not imply that there are no COVID -19-related hospitalizations in those countries. The prese nted hospitalization 
data are preliminary and might change as new data become available. Furthermore, hospitalization data are subject 
to reporting delays. These data also likely include both hospitalizations with incidental cases of SARS -CoV-2 infection 
and those due to COVID -19 disease.   
  
 
New hospitalizations  
During the past 28 days, 5 3 (23%) countries reported data to WHO on new hospitalizations at least once (Figure 7).

Answer the following question succinctly, solely using the above snippets of text.

When did the ICU-to-hospitalizations ratio peak during the COVID-19 pandemic? [/INST]

The ICU-to-hospitalizations ratio peaked during the COVID-19 pandemic in the period from 27 April 2020 to 7 January 2023, as shown in Figure 8.</s>
Correct Answer: July 2021



Using the marco embedding model
Question 10:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

newer SARS -CoV-2 variants from these data.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
17 
 Figure 8. COVID -19 ICU -to-hospitalization ratio and death -to-hospitalization ratio, from 27 April  2020 to  7 January 
2023  (A), and 22 May  to 7 January  2024 (B) 
 
A 
 
 
B 
 
 
 
 
 
Note: Recent weeks are subject to reporting delays and should not be interpreted as a declining trend. The ICU ratio figure is created from the data of the countries 
reported both new hospitalizations and new ICU admissions. The d eath ratio figure is created from the data of the countries that reported both new hospitalization 
and new deaths.  
Source : WHO COVID -19 Detailed Surveillance Dashboard  
 
18 
 SARS -CoV-2 variants of interest  and variants under monitoring  
 
Geographic spread and prevalence  
Globally, during the 28 -day period from 11 December 2023 to 7 January 2024, 33 659 SARS -CoV-2 sequences were 
shared through GISAID. In comparison, in the two previous 28 -day periods, there were 112 909 and 192 222 
sequences shared, respectively. The data are periodically retrospectively updated to include sequences w ith earlier 
collection dates, so the number of submissions in a given time period may change.  
 
WHO is currently tracking several SARS -CoV-2 variants, including:  
•  Five variants of interest (VOIs): XBB.1.5, XBB.1.16, EG.5, BA.2.86 and JN.1  
•  Five variants under monitoring (V
.2.3.  
 
 
Table 7. Weekly prevalence of SARS -CoV-2 VOIs and VUMs by WHO regions, week 48 to week 52 of 2023  
 
 
 
* Includes descendant lineages, except those individually specified elsewhere in the table. For example, XBB* does not includ e 
XBB.1.5, XBB.1.16, EG.5, XBB.1.9.1, XBB.1.9.2, and XBB.2.3.  
¥ due to the small numbers of sequences submitted in these regions, it has not been possible to determine trends for the VOIs  and 
VUMs in these regions; this is also represented by the shaded cells in the table .   
 
 
20 
 Figure 9. Global 28 -day prevalence of EG.5, XBB.1.5, XBB.1.16, BA.2.86, and JN.1, from 11 December 2023 to 7 January  
2024  *+     
 
 
* Reporting period to account for delay in sequence submission to GISAID.  
+ Historical presence indicates countries previously reporting sequences of VOIs but have not reported within the period from 4 
November  to 3 December  2023  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21 
 Figure 10. The (A) number and (B) percentage of SARS -CoV-2 sequences, from 3 July to 31 December  2023  
 
 
 
 
 
 
   
Figure 10. Panel A  shows the number, and Panel B  the percentage, of all circulating variants since July to December 2023. The variants 
shown here include descendent lineages, except for the descendent lineage(s)
except for the descendent lineage(s) listed here. The Unassigned category includes lineages 
pending for a PANGO lineage name designation, whereas the Other category includes lineages that are assigned but not listed here. 
Source: SARS -CoV-2 sequence data and metadata from GISAID, from 3 July to 31 December 2023.  
 
 
 
 
A 
B  
22 
 Additional resources  
 
• Tracking SARS -CoV-2 Variants  
• WHO statement on updated tracking system on SARS -CoV-2 variants of concern and variants of interest   
• SARS -CoV-2 variant risk evaluation framework, 30 August 2023  
• WHO JN.1 Initial Risk Evaluation, 1 3 December 2023  
• WHO BA.2.86 Initial Risk Evaluation, 21 November 2023  
• WHO EG.5 Updated Risk Evaluation, 21 November 2023  
• WHO XBB.1.5 Updated Risk Assessment, 20 June 2023  
• WHO XBB.1.16 Updated Risk Assessment, 5 June 2023    
22 WHO  regional  overviews   
Data for 11 December  2023  to 7 January 2024  
 
African  Region 
The African Region reported over 3354 new cases, a 63% decrease as compared to the previous 28 -day period. Five (10%) of the 50 countries for which data are 
available reported increases in new cases of 20% or greater, with the highest proportional increases observed in Mauritania ( 30 vs six new cases; +400%), Senegal 
(17 vs six new cases ; +183%), Mali (two vs one new case; +100%), Burundi (78 vs 51 new cases; +53%), and Angola (

Answer the following question succinctly, solely using the above snippets of text.

During the 28-day period from 11 December 2023 to 7 January, which SARS-CoV-2 variants of interest (VOIs) were the WHO tracking? [/INST]

The WHO was tracking the following SARS-CoV-2 variants of interest (VOIs) during the 28-day period from 11 December 2023 to 7 January 2024:

• XBB.1.5
• XBB.1.16
• EG.5
• BA.2.86
• JN.1</s>
Correct Answer: XBB.1.5, XBB.1.16, EG.5, BA.2.86 and JN.1.



Scoring:¶

Note that decimal scores indicate the correct answer to the question was multiple items in a list, and the model only answered correctly part of the list.

NF4 Quantization with BNB¶

Llama 2, 7B - RAW NF4:

  • 0/10

Llama 2, 7B NF4 - with embeddings, 200-token chunks 50% overlap:

  • 2/10

Llama 2, 13B - RAW NF4:

  • 0/10

Llama 2, 13B NF4 - with llama embeddings, dot product similarity, 500 tok long 50% overlap, 3 samples:

  • 4/10

Llama 2, 13B NF4 - with openai embeddings, cosine similarity, 500 tok long 50% overlap, 3 samples:

  • 4.2/10

Llama 2, 13B NF4 - with openai embeddings, dot product similarity, 500 tok long 50% overlap, 3 samples:

  • 5/10

Llama 2, 13B NF4 - with marco embeddings, dot product similarity, 500 tok long 50% overlap, 3 samples:

  • 5/10

GPTQ Quantization¶

Llama 2, 7B - RAW GPTQ:

  • 0/10

Llama 2, 7B GPTQ - with llama embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:

  • 1.2/10

Llama 2, 7B GPTQ - with openai embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:

  • 3/10

Llama 2, 7B GPTQ - with marco embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:

  • 4/10

Llama 2, 13B - RAW GPTQ:

  • 0/10

Llama 2, 13B GPTQ - with llama embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:

  • 2.4/10

Llama 2, 13B GPTQ - with openai embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:

  • 4.2/10

Llama 2, 13B GPTQ - with marco embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:

  • 4/10

Llama 2, 13B GPTQ - with marco embeddings, 500-token chunks w/ 50% overlap, top-3 similar embeddings:

  • 5/10

Llama 2, 13B GPTQ - with marco embeddings, 500-token chunks w/ 10% overlap, top-3 similar embeddings:

  • 4.6/10