LooGLE is a comprehensive evaluation benchmark for LLM long context understanding which contains up-to-date (all after 2022) and extremely long realistic documents (over 24k tokens per document, many of which exceed 100k words) and 6,000 newly generated questions spanning diverse domains and categories. Details statistics of our dataset can be seen in the table below.
Short and long dependency tasks ๐ LooGLE is composed of 7 major tasks to evaluate LLMsโ ability to understand both short and long dependency content. We refer to ``long dependencyโ tasks as those that require the understanding of the inter-dependency across multiple shreds of evidences widely spanning over the entire long text. We delicately design 5 types of long dependency tasks, including comprehension and reasoning, computation, timeline reorder, multiple information retrieval, and summarization.
Long context evaluation ๐ In order to provide more comprehensive and general results, LooGLE relies on automatic automatic metrics based on semantic similarity, GPT4-as-judgment and human evaluation to get an overall performance for reference. We conduct the evaluation of 8 representative LLMs. We specifically select LLMs which have made great effort in addressing the challenge of understanding long contexts by utilizing flash attention, position interpolation, optimized Transformer and finetuning, external memory etc.
LooGLE not only provides a systematic and comprehensive evaluation schema on long-context LLMs, but also sheds light on future development of enhanced models towards โtrue long-context understandingโ.
๐ Statistics of LooGLE
โ๏ธ Table of Contents
- ๐ Statistics of LooGLE
- โ๏ธ Table of Contents
- ๐ Capability leaderboard
- ๐ Quick Start
- ๐ Evaluation
- ๐ก Main result on short and long dependency tasks
- ๐ Citation
- ๐ฃ Contacts
๐ Capability leaderboard
The overall performance comparisons of different models on different tasks in our dataset are shown in the figure below.
๐ Quick Start
Step 1. Prerequisites
Clone this repo and install the dependencies. The test environment is under torch 2.0.1+cu121.
cd LooGLE
conda create -n loogle python=3.9
conda activate loogle
pip install -r requirements.txt
export OPENAI_API_KEY="[your_openai_api_key]"
Step 2. Download the data
You can download and load the LooGLE data through the Hugging Face datasets (๐ค HF Repo):
from datasets import load_dataset
datasets = ["shortdep_qa", "shortdep_cloze", "longdep_qa", "longdep_summarization"]
for testset in datasets:
data = load_dataset('bigainlco/LooGLE', testset, split='test')
# evaluate your model
You can also access our sample data LooGLE-testdata/.
All data in LooGLE are standardized to the following format:
{
"input": "The original long input texts",
"title": "The title of the given document", //for arxiv paper, we use "title" to refer the identical ID for specific paper
"qa_pairs":[
{
"Q": "Question to ask based on the given input",
"A": "Groundtruth answer for the question",
"S": [ "One or more evidence (complete sentences) for answering the question, which are extracted directly from the original input"
]
},
] // There are multiple questions and corresponding answers in the list (each of them is in json format)
// For arxiv paper summarization, we use "none" instead for non-qa/non-cloze tasks
"output": "none" // the predicted outputs of LLM given the long input and instructions, which is initialized as "none"
To mention that, in long dependency QA data, we add an extra key type
for each question in json to indicate the 4 types of long dependency tasks(apart from summarization).
Step 3. Generate the prediction results
We test LLMs using 3 python codes under the path Prediction/ for corresponding types of models. We select the model for evaluation via --model_name
and the specific task via --task
. Letโs take short dependency QA as an example:
For GPT-3.5-turbo and GPT4:
python Prediction/pred_gpt_models.py --model_name gpt-3.5-turbo-16k --task shortdep_qa --max_length 500
For LlamaIndex:
python Prediction/pred_llamaindex.py --task shortdep_qa --max_length 500
For other open-source models (take chatglm2-6b-32k as an example):
python Prediction/pred_opensource_models.py --model_name chatglm2-6b-32k --task shortdep_qa --max_length 500
Open-source models can be download and loaded from Models/ by default, you can change the path via --model_path
You can also determine the long texts output result through --output_path
.
Please note that in config/
, we provide the prompt format suitable for each task and the maximum generation length. The input parameter --max_length
limits the max length of input prompt for selcted model. Feel free to modify them to better suit the model you want to evaluate.
We test all the open-source baselines with a single 80G A800 GPU in BF16 precision. For Llama-2 based models, we recommend using Flash Attention for optimization and saving GPU memory.
Prediction for retrieval based methods
To evaluate the effectiveness of retrieval techniques for long-context dependency questions, we undertook an extensive experiments by replacing the base LLM model in LlamaIndex with different baseline LLMs.
For retrieval based methods (take chatglm2-6b-32k as an example):
python Retrieval/pred_retrieval_based_method.py --model_name chatglm2-6b-32k --task shortdep_qa --max_length 500 --emb_model_name sentence-transformers/all-mpnet-base-v2
Use --emb_model_name
to set embedding models for retrieval based methods. Here we used all-mpnet-base-v2 as default.
๐ Evaluation
Given the prediction file generated in Step 2, we run the evaluation code in Evaluation/.
For automatic evaluation in short and long dependency QA, summarization task (eg. short dependency QA):
python Evaluation/automatic_eval.py --model_name chatglm2-6b-32k --task shortdep_qa --eval_metric automatic_sim
For automatic evaluation in cloze task:
python Evaluation/automatic_eval.py --model_name chatglm2-6b-32k --task shortdshortdep_cloze --eval_metric automatic_match
For LLM-as-judge in short and long dependency QA, summarization task (eg. short dependency QA):
python Evaluation/llm_eval.py --model_name chatglm2-6b-32k --task shortdep_qa
Besides the parameters specifying the --model_name
and --task
, we provide --eval_metric
for users to choose the method for automic evaluation from [automatic_sim
, automatic_match
].
Automatic metrics based on semantic similarity matching including Bleu, Rouge, Meteor, Bertscore and exact/partial match are supported. Feel free to add other metrics for your needs in Evaluation/automatic_metrics.py. Besides, the prompt of GPT4 given in the repo can be altered for further evaluation.
Evaluation on Timeline reorder task
We provide four metrics: LSD (location square deviation), LMD (location mean deviation), SD (swap deviation), and SDD (swap distance deviation) to measure the similarity of numeric sequences for time reorder task with regularized outputs. Details of the implementations can be seen in our paper.
For LLM in long dependency timeline reorder task:
python Reorder/automatic_eval.py --model_name chatglm2-6b-32k
๐ก Main result on short and long dependency tasks
Performance of the short dependency tasks
Models | Context | Short dependency QA | Cloze | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bleu1 | Bleu4 | Rouge1 | Rouge4 | RougeL | Meteor score | Bert score | GPT4 score | Exact Match | Partial Match | ||
GPT4-32k | 32k | 24.61 | 11.14 | 61.80 | 50.73 | 60.75 | 32.94 | 78.72 | 71.52 | 70.50 | 80.81 |
GPT4-8k | 8K | 27.35 | 14.38 | 67.59 | 56.01 | 65.77 | 38.56 | 87.93 | 53.99 | 66.03 | 76.62 |
GPT3.5-turbo-16k | 16K | 22.67 | 9.62 | 62.56 | 48.63 | 60.66 | 32.58 | 87.04 | 66.82 | 54.64 | 63.42 |
LlamaIndex | - | 33.37 | 21.43 | 58.82 | 42.93 | 57.08 | 37.17 | 86.58 | 59.61 | 58.95 | 66.86 |
ChatGLM2-6B | 32k | 14.29 | 6.07 | 20.50 | 13.16 | 20.36 | 13.08 | 87.28 | 23.65 | 0.05 | 0.98 |
LongLLaMa-3B | 256k | 1.37 | 0.26 | 26.97 | 11.02 | 26.10 | 11.34 | 71.65 | 13.75 | - | 2.13 |
RWKV-4-14B-pile | 8k | 0.80 | 0.04 | 21.70 | 6.39 | 20.64 | 9.41 | 70.42 | 8.93 | - | - |
LLaMA2-7B-32K | 32k | 0.18 | 7.25*e-308 | 1.86 | 0.00 | 1.86 | 1.52 | 61.53 | 3.18 | - | 0.58 |
Performance of the long dependency tasks
Models | Context | Bleu1 | Bleu4 | Rouge1 | Rouge4 | RougeL | Meteor score | Bert score | GPT4 score |
---|---|---|---|---|---|---|---|---|---|
arXiv paper summarization | |||||||||
GPT4-32k | 32k | 24.50 | 0.73 | 27.15 | 7.10 | 24.25 | 19.03 | 84.04 | 82.84 |
GPT4-8k | 8k | 29.02 | 2.09 | 32.08 | 11.11 | 28.85 | 22.64 | 84.92 | 85.42 |
GPT3.5-turbo-16k | 16k | 28.70 | 1.59 | 32.04 | 10.69 | 28.89 | 22.34 | 84.82 | 86.84 |
LlamaIndex | - | 22.53 | 0.63 | 26.28 | 6.97 | 23.73 | 21.07 | 83.09 | 76.35 |
ChatGLM2-6B | 32k | 0.04 | 1.60e-310 | 5.97 | 8.43E-05 | 5.82 | 6.40 | 73.25 | 13.23 |
LongLLaMa-3B | 256k | 4.24 | 9.32e-309 | 4.10 | 0.52 | 3.86 | 3.82 | 73.41 | 12.28 |
RWKV-4-14B-pile | 8k | 6.28 | 4.58E-05 | 6.45 | 0.74 | 6.01 | 6.00 | 75.28 | 7.02 |
LLaMA2-7B-32K | 32k | 0.03 | 4.66e-310 | 0.12 | 0.00 | 0.12 | 0.67 | 71.21 | 7.60 |
Long dependency QA | |||||||||
GPT4-32k | 32k | 8.55 | 1.40 | 25.59 | 6.36 | 24.04 | 11.13 | 80.16 | 54.09 |
GPT4-8k | 8k | 8.94 | 1.01 | 23.45 | 6.57 | 21.69 | 10.18 | 85.36 | 42.12 |
GPT3.5-turbo-16k | 16k | 6.92 | 1.81 | 25.02 | 6.68 | 23.63 | 10.40 | 83.79 | 45.04 |
LlamaIndex | - | 7.76 | 1.24 | 23.62 | 7.10 | 22.30 | 10.47 | 83.87 | 37.63 |
ChatGLM2-6B | 32k | 5.55 | 0.11 | 9.41 | 1.93 | 8.69 | 4.39 | 85.78 | 11.50 |
LongLLaMa-3B | 256k | 1.04 | 3.12E-307 | 2.96 | 0.03 | 2.71 | 1.66 | 78.60 | 6.48 |
RWKV-4-14B-pile | 8k | 0.71 | 9.52E-307 | 18.54 | 1.55 | 17.69 | 3.45 | 71.36 | 5.33 |
LLaMA2-7B-32K | 32k | 0.08 | 2.44E-308 | 2.05 | 0.00 | 2.05 | 0.46 | 50.28 | 4.18 |
Impact of input length on long dependency tasks
Models | Context | Bleu1 | Bleu4 | Rouge1 | Rouge4 | RougeL | Meteor score | Bert score | GPT4 score |
---|---|---|---|---|---|---|---|---|---|
arXiv paper summarization | |||||||||
GPT4-32k | 32k | 24.50 | 0.73 | 27.15 | 7.10 | 24.25 | 19.03 | 84.04 | 82.84 |
GPT4-32k | 24k | 25.57 | 0.81 | 27.61 | 7.53 | 24.73 | 19.86 | 84.07 | 83.15 |
GPT4-32k | 16k | 24.8 | 0.70 | 27.29 | 7.26 | 24.28 | 19.12 | 84.11 | 82.82 |
GPT4-32k | 8k | 26.26 | 9.35 | 27.83 | 7.67 | 24.74 | 20.08 | 84.10 | 82.75 |
GPT4-8k | 8k | 29.02 | 2.09 | 32.08 | 11.11 | 28.85 | 22.64 | 84.92 | 85.42 |
Long dependency QA | |||||||||
GPT4-32k | 32k | 7.64 | 1.24 | 15.53 | 4.46 | 14.60 | 11.12 | 86.07 | 54.65 |
GPT4-32k | 24k | 8.23 | 1.66 | 14.92 | 4.12 | 13.90 | 10.60 | 86.16 | 50.61 |
GPT4-32k | 16k | 8.57 | 1.35 | 16.21 | 4.30 | 14.90 | 11.91 | 86.36 | 47.55 |
GPT4-32k | 8k | 7.46 | 1.77 | 13.75 | 5.08 | 12.89 | 10.01 | 85.77 | 38.34 |
GPT4-8k | 8k | 8.94 | 1.01 | 23.45 | 6.57 | 21.69 | 10.18 | 85.36 | 42.12 |
๐ Citation
If you would like to use our data or find our work interesting, please cite:
@article{li2023loogle,
title={LooGLE: Can Long-Context Language Models Understand Long Contexts?},
author={Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan},
journal={arXiv preprint arXiv:2311.04939},
year={2023}
}
๐ฃ Contacts
We sincerely appreciate human annotators for their valuable contributions on creating high-quality long-dependency QA tasks. We are very pleased to answer any questions about LooGLE: nlp@bigai.ai