autogen/python/packages/agbench/benchmarks/GAIA
Eric Zhu 7615c7b83b
Rename to use BaseChatMessage and BaseAgentEvent. Bring back union types. (#6144)
Rename the `ChatMessage` and `AgentEvent` base classes to `BaseChatMessage` and `BaseAgentEvent`. 

Bring back the `ChatMessage` and `AgentEvent` as union of built-in concrete types to avoid breaking existing applications that depends on Pydantic serialization. 

Why?

Many existing code uses containers like this:

```python
class AppMessage(BaseModel):
   name: str
   message: ChatMessage 

# Serialization is this:
m = AppMessage(...)
m.model_dump_json()

# Fields like HandoffMessage.target will be lost because it is now treated as a base class without content or target fields.
```

The assumption on `ChatMessage` or `AgentEvent` to be a union of concrete types could be in many existing code bases. So this PR brings back the union types, while keep method type hints such as those on `on_messages` to use the `BaseChatMessage` and `BaseAgentEvent` base classes for flexibility.
2025-03-30 09:34:40 -07:00
..
Scripts Significant updates to agbench. (#5313) 2025-02-07 18:01:44 +00:00
Templates Rename to use BaseChatMessage and BaseAgentEvent. Bring back union types. (#6144) 2025-03-30 09:34:40 -07:00
.gitignore Adding Benchmarks to agbench (#3803) 2024-10-18 06:33:33 +02:00
ENV.yaml Significant updates to agbench. (#5313) 2025-02-07 18:01:44 +00:00
README.md Significant updates to agbench. (#5313) 2025-02-07 18:01:44 +00:00
config.yaml Significant updates to agbench. (#5313) 2025-02-07 18:01:44 +00:00

README.md

GAIA Benchmark

This scenario implements the GAIA agent benchmark. Before you begin, make sure you have followed instruction in ../README.md to prepare your environment.

Setup Environment Variables for AgBench

Navigate to GAIA

cd benchmarks/GAIA

Update config.yaml to point to your model host, as appropriate. The default configuration points to 'gpt-4o'.

Now initialize the tasks.

python Scripts/init_tasks.py

Note: This will attempt to download GAIA from Hugginface, but this requires authentication.

The resulting folder structure should look like this:

.
./Downloads
./Downloads/GAIA
./Downloads/GAIA/2023
./Downloads/GAIA/2023/test
./Downloads/GAIA/2023/validation
./Scripts
./Templates
./Templates/TeamOne

Then run Scripts/init_tasks.py again.

Once the script completes, you should now see a folder in your current directory called Tasks that contains one JSONL file per template in Templates.

Running GAIA

Now to run a specific subset of GAIA use:

agbench run Tasks/gaia_validation_level_1__MagenticOne.jsonl

You should see the command line print the raw logs that shows the agents in action To see a summary of the results (e.g., task completion rates), in a new terminal run the following:

agbench tabulate Results/gaia_validation_level_1__MagenticOne/

References

GAIA: a benchmark for General AI Assistants <br/> Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom <br/> https://arxiv.org/abs/2311.12983