autogen/notebook/research/math_level5counting.ipynb

785 lines
23 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"\n",
"Licensed under the MIT License.\n",
"\n",
"# Math Study\n",
"\n",
"In this notebook, we study GPT-4 for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning. \n",
"\n",
"## Requirements\n",
"\n",
"AutoGen requires `Python>=3.8`. To run this notebook example, please install:\n",
"```bash\n",
"pip install flaml[openai]==1.2.2\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.317406Z",
"iopub.status.busy": "2023-02-13T23:40:52.316561Z",
"iopub.status.idle": "2023-02-13T23:40:52.321193Z",
"shell.execute_reply": "2023-02-13T23:40:52.320628Z"
}
},
"outputs": [],
"source": [
"# %pip install flaml[openai]==1.2.2 datasets"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Set your OpenAI key:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.324240Z",
"iopub.status.busy": "2023-02-13T23:40:52.323783Z",
"iopub.status.idle": "2023-02-13T23:40:52.330570Z",
"shell.execute_reply": "2023-02-13T23:40:52.329750Z"
}
},
"outputs": [],
"source": [
"import os\n",
"\n",
"if \"OPENAI_API_KEY\" not in os.environ:\n",
" os.environ[\"OPENAI_API_KEY\"] = \"<your OpenAI API key here>\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Uncomment the following to use Azure OpenAI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.333547Z",
"iopub.status.busy": "2023-02-13T23:40:52.333249Z",
"iopub.status.idle": "2023-02-13T23:40:52.336508Z",
"shell.execute_reply": "2023-02-13T23:40:52.335858Z"
}
},
"outputs": [],
"source": [
"# import openai\n",
"# openai.api_type = \"azure\"\n",
"# openai.api_base = \"https://<your_endpoint>.openai.azure.com/\"\n",
"# openai.api_version = \"2023-03-15-preview\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load dataset\n",
"\n",
"First, we load the competition_math dataset. We use a random sample of 50 examples for testing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.339977Z",
"iopub.status.busy": "2023-02-13T23:40:52.339556Z",
"iopub.status.idle": "2023-02-13T23:40:54.603349Z",
"shell.execute_reply": "2023-02-13T23:40:54.602630Z"
}
},
"outputs": [],
"source": [
"import datasets\n",
"\n",
"seed = 41\n",
"data = datasets.load_dataset(\"competition_math\")\n",
"train_data = data[\"train\"].shuffle(seed=seed)\n",
"test_data = data[\"test\"].shuffle(seed=seed)\n",
"n_tune_data = 20\n",
"tune_data = [\n",
" {\n",
" \"problem\": train_data[x][\"problem\"],\n",
" \"solution\": train_data[x][\"solution\"],\n",
" }\n",
" for x in range(len(train_data)) if train_data[x][\"level\"] == \"Level 5\" and train_data[x][\"type\"] == \"Counting & Probability\"\n",
"][:n_tune_data]\n",
"test_data = [\n",
" {\n",
" \"problem\": test_data[x][\"problem\"],\n",
" \"solution\": test_data[x][\"solution\"],\n",
" }\n",
" for x in range(len(test_data)) if test_data[x][\"level\"] == \"Level 5\" and test_data[x][\"type\"] == \"Counting & Probability\"\n",
"]\n",
"print(len(tune_data), len(test_data))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Check a tuning example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.607152Z",
"iopub.status.busy": "2023-02-13T23:40:54.606441Z",
"iopub.status.idle": "2023-02-13T23:40:54.610504Z",
"shell.execute_reply": "2023-02-13T23:40:54.609759Z"
},
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [],
"source": [
"print(tune_data[1][\"problem\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is one example of the canonical solution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.613590Z",
"iopub.status.busy": "2023-02-13T23:40:54.613168Z",
"iopub.status.idle": "2023-02-13T23:40:54.616873Z",
"shell.execute_reply": "2023-02-13T23:40:54.616193Z"
}
},
"outputs": [],
"source": [
"print(tune_data[1][\"solution\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Success Metric\n",
"\n",
"For each math task, we use voting to select a response with the most common answers out of all the generated responses. If it has an equivalent answer to the canonical solution, we consider the task as successfully solved. Then we can optimize the mean success rate of a collection of tasks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.626998Z",
"iopub.status.busy": "2023-02-13T23:40:54.626593Z",
"iopub.status.idle": "2023-02-13T23:40:54.631383Z",
"shell.execute_reply": "2023-02-13T23:40:54.630770Z"
}
},
"outputs": [],
"source": [
"from flaml.autogen.math_utils import eval_math_responses"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Import the oai subpackage from flaml.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.634335Z",
"iopub.status.busy": "2023-02-13T23:40:54.633929Z",
"iopub.status.idle": "2023-02-13T23:40:56.105700Z",
"shell.execute_reply": "2023-02-13T23:40:56.105085Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"from flaml.autogen import oai"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For (local) reproducibility and cost efficiency, we cache responses from OpenAI."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:56.109177Z",
"iopub.status.busy": "2023-02-13T23:40:56.108624Z",
"iopub.status.idle": "2023-02-13T23:40:56.112651Z",
"shell.execute_reply": "2023-02-13T23:40:56.112076Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"oai.ChatCompletion.set_cache(seed)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create a disk cache in \".cache/{seed}\". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:56.115383Z",
"iopub.status.busy": "2023-02-13T23:40:56.114975Z",
"iopub.status.idle": "2023-02-13T23:41:55.045654Z",
"shell.execute_reply": "2023-02-13T23:41:55.044973Z"
}
},
"outputs": [],
"source": [
"prompt = \"{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\\\boxed{{}}.\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate the success rate on the test data\n",
"\n",
"You can use `oai.ChatCompletion.test` to evaluate the performance of an entire dataset with a config."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"\n",
"config_n1 = {\"model\": 'gpt-4', \"prompt\": prompt, \"max_tokens\": 600, \"n\": 1}\n",
"n1_result = oai.ChatCompletion.test(test_data[:50], eval_math_responses, **config_n1)\n",
"print(n1_result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"oai.ChatCompletion.request_timeout = 120\n",
"config_n10 = {\"model\": 'gpt-4', \"prompt\": prompt, \"max_tokens\": 600, \"n\": 10}\n",
"n10_result = oai.ChatCompletion.test(test_data[:50], eval_math_responses, logging_level=logging.INFO, **config_n10)\n",
"print(n10_result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"config_n30 = {\"model\": 'gpt-4', \"prompt\": prompt, \"max_tokens\": 600, \"n\": 30}\n",
"n30_result = oai.ChatCompletion.test(test_data[:50], eval_math_responses, logging_level=logging.INFO, **config_n30)\n",
"print(n30_result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"import matplotlib.pyplot as plt\n",
"\n",
"prompts = [\"{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\\\boxed{{}}.\"]\n",
"markers = [\"o\", \"s\", \"D\", \"v\", \"p\", \"h\", \"d\", \"P\", \"X\", \"H\", \"8\", \"4\", \"3\", \"2\", \"1\", \"x\", \"+\", \">\", \"<\", \"^\", \"v\", \"1\", \"2\", \"3\", \"4\", \"8\", \"s\", \"p\", \"*\", \"h\", \"H\", \"d\", \"D\", \"|\", \"_\"]\n",
"for j, n in enumerate([10, 30]):\n",
" config = {\"model\": 'gpt-4', \"prompt\": prompts[0], \"max_tokens\": 600, \"n\": n}\n",
" metrics = []\n",
" x, y = [], []\n",
" votes_success = defaultdict(lambda: [0, 0])\n",
" for i, data_i in enumerate(test_data[:50]):\n",
" response = oai.ChatCompletion.create(context=data_i, allow_format_str_template=True, **config)\n",
" responses = oai.ChatCompletion.extract_text(response)\n",
" metrics.append(eval_math_responses(responses, **data_i))\n",
" votes = metrics[-1][\"votes\"]\n",
" success = metrics[-1][\"success_vote\"]\n",
" votes_success[votes][0] += 1\n",
" votes_success[votes][1] += success\n",
" for votes in votes_success:\n",
" x.append(votes)\n",
" y.append(votes_success[votes][1] / votes_success[votes][0])\n",
"\n",
" plt.scatter(x, y, marker=markers[j])\n",
" plt.xlabel(\"top vote\")\n",
" plt.ylabel(\"success rate\")\n",
"plt.legend([\"n=10\", \"n=30\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"vscode": {
"interpreter": {
"hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
}
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {
"2d910cfd2d2a4fc49fc30fbbdc5576a7": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"454146d0f7224f038689031002906e6f": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_e4ae2b6f5a974fd4bafb6abb9d12ff26",
"IPY_MODEL_577e1e3cc4db4942b0883577b3b52755",
"IPY_MODEL_b40bdfb1ac1d4cffb7cefcb870c64d45"
],
"layout": "IPY_MODEL_dc83c7bff2f241309537a8119dfc7555",
"tabbable": null,
"tooltip": null
}
},
"577e1e3cc4db4942b0883577b3b52755": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_2d910cfd2d2a4fc49fc30fbbdc5576a7",
"max": 1,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_74a6ba0c3cbc4051be0a83e152fe1e62",
"tabbable": null,
"tooltip": null,
"value": 1
}
},
"6086462a12d54bafa59d3c4566f06cb2": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"74a6ba0c3cbc4051be0a83e152fe1e62": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"7d3f3d9e15894d05a4d188ff4f466554": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"b40bdfb1ac1d4cffb7cefcb870c64d45": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_f1355871cc6f4dd4b50d9df5af20e5c8",
"placeholder": "",
"style": "IPY_MODEL_ca245376fd9f4354af6b2befe4af4466",
"tabbable": null,
"tooltip": null,
"value": " 1/1 [00:00&lt;00:00, 44.69it/s]"
}
},
"ca245376fd9f4354af6b2befe4af4466": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"dc83c7bff2f241309537a8119dfc7555": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e4ae2b6f5a974fd4bafb6abb9d12ff26": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_6086462a12d54bafa59d3c4566f06cb2",
"placeholder": "",
"style": "IPY_MODEL_7d3f3d9e15894d05a4d188ff4f466554",
"tabbable": null,
"tooltip": null,
"value": "100%"
}
},
"f1355871cc6f4dd4b50d9df5af20e5c8": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}
},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}