![]() Rename the `ChatMessage` and `AgentEvent` base classes to `BaseChatMessage` and `BaseAgentEvent`. Bring back the `ChatMessage` and `AgentEvent` as union of built-in concrete types to avoid breaking existing applications that depends on Pydantic serialization. Why? Many existing code uses containers like this: ```python class AppMessage(BaseModel): name: str message: ChatMessage # Serialization is this: m = AppMessage(...) m.model_dump_json() # Fields like HandoffMessage.target will be lost because it is now treated as a base class without content or target fields. ``` The assumption on `ChatMessage` or `AgentEvent` to be a union of concrete types could be in many existing code bases. So this PR brings back the union types, while keep method type hints such as those on `on_messages` to use the `BaseChatMessage` and `BaseAgentEvent` base classes for flexibility. |
||
---|---|---|
.. | ||
Scripts | ||
Templates | ||
.gitignore | ||
ENV.yaml | ||
README.md | ||
config.yaml |
README.md
GAIA Benchmark
This scenario implements the GAIA agent benchmark. Before you begin, make sure you have followed instruction in ../README.md
to prepare your environment.
Setup Environment Variables for AgBench
Navigate to GAIA
cd benchmarks/GAIA
Update config.yaml
to point to your model host, as appropriate. The default configuration points to 'gpt-4o'.
Now initialize the tasks.
python Scripts/init_tasks.py
Note: This will attempt to download GAIA from Hugginface, but this requires authentication.
The resulting folder structure should look like this:
.
./Downloads
./Downloads/GAIA
./Downloads/GAIA/2023
./Downloads/GAIA/2023/test
./Downloads/GAIA/2023/validation
./Scripts
./Templates
./Templates/TeamOne
Then run Scripts/init_tasks.py
again.
Once the script completes, you should now see a folder in your current directory called Tasks
that contains one JSONL file per template in Templates
.
Running GAIA
Now to run a specific subset of GAIA use:
agbench run Tasks/gaia_validation_level_1__MagenticOne.jsonl
You should see the command line print the raw logs that shows the agents in action To see a summary of the results (e.g., task completion rates), in a new terminal run the following:
agbench tabulate Results/gaia_validation_level_1__MagenticOne/
References
GAIA: a benchmark for General AI Assistants <br/>
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom <br/>
https://arxiv.org/abs/2311.12983