Merge remote-tracking branch 'upstream/main' into code

This commit is contained in:
Chi Wang 2023-09-16 10:13:38 +00:00
commit 4685f27d02
409 changed files with 105233 additions and 417 deletions

5
.coveragerc Normal file
View File

@ -0,0 +1,5 @@
[run]
branch = True
source = flaml
omit =
*test*

23
.devcontainer/Dockerfile Normal file
View File

@ -0,0 +1,23 @@
#-------------------------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See LICENSE file in the project root for license information.
#-------------------------------------------------------------------------------------------------------------
FROM mcr.microsoft.com/vscode/devcontainers/python:0-3.9
#
# Update the OS and maybe install packages
#
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update \
&& apt-get upgrade -y \
&& apt-get -y install --no-install-recommends build-essential npm \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /var/lib/apt/lists/*
ENV DEBIAN_FRONTEND=dialog
# RUN pip3 --disable-pip-version-check --no-cache-dir install flaml
# For docs
RUN npm install --global yarn
RUN pip install pydoc-markdown==4.5.0

View File

@ -0,0 +1,13 @@
{
"extensions": ["ms-python.python", "visualstudioexptteam.vscodeintellicode"],
"dockerFile": "Dockerfile",
"settings": {
"terminal.integrated.profiles.linux": {
"bash": {
"path": "/bin/bash"
}
},
"terminal.integrated.defaultProfile.linux": "bash"
},
"updateContentCommand": "pip install -e .[notebook,openai] pre-commit && pre-commit install"
}

18
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@ -0,0 +1,18 @@
<!-- Thank you for your contribution! Please review https://microsoft.github.io/FLAML/docs/Contribute before opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
## Why are these changes needed?
<!-- Please give a short summary of the change and the problem this solves. -->
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
<!-- - I've used [pre-commit](https://microsoft.github.io/FLAML/docs/Contribute#pre-commit) to lint the changes in this PR (note the same in integrated in our CI checks). -->
- [ ] I've included any doc changes needed for https://microsoft.github.io/FLAML/. See https://microsoft.github.io/FLAML/docs/Contribute#documentation to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.

52
.github/workflows/CD.yml vendored Normal file
View File

@ -0,0 +1,52 @@
# This workflows will build and upload a Python Package using Twine when a release is published
# Conda-forge bot will pick up new PyPI version and automatically create new version
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
name: CD
on:
release:
types: [published]
jobs:
deploy:
strategy:
matrix:
os: ['ubuntu-latest']
python-version: [3.8]
runs-on: ${{ matrix.os }}
environment: package
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Cache conda
uses: actions/cache@v3
with:
path: ~/conda_pkgs_dir
key: conda-${{ matrix.os }}-python-${{ matrix.python-version }}-${{ hashFiles('environment.yml') }}
- name: Setup Miniconda
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
auto-activate-base: false
activate-environment: hcrystalball
python-version: ${{ matrix.python-version }}
use-only-tar-bz2: true
- name: Install from source
# This is required for the pre-commit tests
shell: pwsh
run: pip install .
- name: Conda list
shell: pwsh
run: conda list
- name: Build
shell: pwsh
run: |
pip install twine
python setup.py sdist bdist_wheel
- name: Publish to PyPI
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
shell: pwsh
run: twine upload dist/*

View File

@ -4,11 +4,13 @@ on:
pull_request:
branches: [main]
path:
- 'autogen/*'
- 'website/*'
- '.github/workflows/deploy-website.yml'
push:
branches: [main]
path:
- 'autogen/*'
- 'website/*'
- '.github/workflows/deploy-website.yml'
workflow_dispatch:
@ -31,6 +33,13 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: "3.8"
- name: pydoc-markdown install
run: |
python -m pip install --upgrade pip
pip install pydoc-markdown
- name: pydoc-markdown run
run: |
pydoc-markdown
- name: Test Build
run: |
if [ -e yarn.lock ]; then
@ -58,6 +67,13 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: "3.8"
- name: pydoc-markdown install
run: |
python -m pip install --upgrade pip
pip install pydoc-markdown
- name: pydoc-markdown run
run: |
pydoc-markdown
- name: Build website
run: |
if [ -e yarn.lock ]; then
@ -74,4 +90,5 @@ jobs:
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./website/build
# Build output to publish to the `gh-pages` branch:
publish_dir: ./website/build

76
.github/workflows/openai.yml vendored Normal file
View File

@ -0,0 +1,76 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: OpenAI
on:
pull_request:
branches: ['main']
paths:
- 'flaml/autogen/**'
- 'test/autogen/**'
- 'notebook/autogen_openai_completion.ipynb'
- 'notebook/autogen_chatgpt_gpt4.ipynb'
- '.github/workflows/openai.yml'
jobs:
test:
strategy:
matrix:
os: [ubuntu-latest]
python-version: ["3.9", "3.10", "3.11"]
runs-on: ${{ matrix.os }}
environment: openai
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install packages and dependencies
run: |
docker --version
python -m pip install --upgrade pip wheel
pip install -e .[autogen,blendsearch]
python -c "import flaml"
pip install coverage pytest datasets
- name: Install packages for test when needed
if: matrix.python-version == '3.9'
run: |
pip install docker
- name: Install packages for MathChat when needed
if: matrix.python-version != '3.11'
run: |
pip install -e .[mathchat]
- name: Install packages for RetrieveChat when needed
if: matrix.python-version != '3.11'
run: |
pip install -e .[retrievechat]
- name: Coverage
if: matrix.python-version == '3.9'
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
AZURE_OPENAI_API_BASE: ${{ secrets.AZURE_OPENAI_API_BASE }}
OAI_CONFIG_LIST: ${{ secrets.OAI_CONFIG_LIST }}
run: |
coverage run -a -m pytest test/autogen
coverage xml
- name: Coverage and check notebook outputs
if: matrix.python-version != '3.9'
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
AZURE_OPENAI_API_BASE: ${{ secrets.AZURE_OPENAI_API_BASE }}
WOLFRAM_ALPHA_APPID: ${{ secrets.WOLFRAM_ALPHA_APPID }}
OAI_CONFIG_LIST: ${{ secrets.OAI_CONFIG_LIST }}
run: |
pip install nbconvert nbformat ipykernel
coverage run -a -m pytest test/autogen/test_notebook.py
coverage xml
cat "$(pwd)/test/autogen/executed_openai_notebook_output.txt"
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests

26
.github/workflows/pre-commit.yml vendored Normal file
View File

@ -0,0 +1,26 @@
name: Code formatting
# see: https://help.github.com/en/actions/reference/events-that-trigger-workflows
on: # Trigger the workflow on push or pull request, but only for the main branch
push:
branches: [main]
pull_request: {}
defaults:
run:
shell: bash
jobs:
pre-commit-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- name: Set $PY environment variable
run: echo "PY=$(python -VV | sha256sum | cut -d' ' -f1)" >> $GITHUB_ENV
- uses: actions/cache@v3
with:
path: ~/.cache/pre-commit
key: pre-commit|${{ env.PY }}|${{ hashFiles('.pre-commit-config.yaml') }}
- uses: pre-commit/action@v3.0.0

124
.github/workflows/python-package.yml vendored Normal file
View File

@ -0,0 +1,124 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build
on:
push:
branches: ['main']
paths:
- 'flaml/**'
- 'test/**'
- 'notebook/**'
- '.github/workflows/python-package.yml'
- 'setup.py'
pull_request:
branches: ['main']
merge_group:
types: [checks_requested]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-2019]
python-version: ["3.8", "3.9", "3.10"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: On mac + python 3.10, install libomp to facilitate lgbm and xgboost install
if: matrix.os == 'macOS-latest' && matrix.python-version == '3.10'
run: |
# remove libomp version constraint after xgboost works with libomp>11.1.0 on python 3.10
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb -O $(find $(brew --repository) -name libomp.rb)
brew unlink libomp
brew install libomp
export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
export CPPFLAGS="$CPPFLAGS -Xpreprocessor -fopenmp"
export CFLAGS="$CFLAGS -I/usr/local/opt/libomp/include"
export CXXFLAGS="$CXXFLAGS -I/usr/local/opt/libomp/include"
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
- name: Install packages and dependencies
run: |
python -m pip install --upgrade pip wheel
pip install -e .
python -c "import flaml"
pip install -e .[test]
- name: On Ubuntu python 3.8, install pyspark 3.2.3
if: matrix.python-version == '3.8' && matrix.os == 'ubuntu-latest'
run: |
pip install pyspark==3.2.3
pip list | grep "pyspark"
- name: If linux, install ray 2
if: matrix.os == 'ubuntu-latest'
run: |
pip install "ray[tune]<2.5.0"
- name: If mac, install ray
if: matrix.os == 'macOS-latest'
run: |
pip install -e .[ray]
- name: If linux or mac, install prophet on python < 3.9
if: (matrix.os == 'macOS-latest' || matrix.os == 'ubuntu-latest') && matrix.python-version != '3.9' && matrix.python-version != '3.10'
run: |
pip install -e .[forecast]
- name: Install vw on python < 3.10
if: matrix.python-version != '3.10'
run: |
pip install -e .[vw]
- name: Uninstall pyspark on (python 3.9) or (python 3.8 + windows)
if: matrix.python-version == '3.9' || (matrix.python-version == '3.8' && matrix.os == 'windows-2019')
run: |
# Uninstall pyspark to test env without pyspark
pip uninstall -y pyspark
- name: Test with pytest
if: matrix.python-version != '3.10'
run: |
pytest test
- name: Coverage
if: matrix.python-version == '3.10'
run: |
pip install coverage
coverage run -a -m pytest test
coverage xml
- name: Upload coverage to Codecov
if: matrix.python-version == '3.10'
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests
# docs:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v3
# - name: Setup Python
# uses: actions/setup-python@v4
# with:
# python-version: '3.8'
# - name: Compile documentation
# run: |
# pip install -e .
# python -m pip install sphinx sphinx_rtd_theme
# cd docs
# make html
# - name: Deploy to GitHub pages
# if: ${{ github.ref == 'refs/heads/main' }}
# uses: JamesIves/github-pages-deploy-action@3.6.2
# with:
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# BRANCH: gh-pages
# FOLDER: docs/_build/html
# CLEAN: true

161
.gitignore vendored
View File

@ -1,2 +1,161 @@
.docusaurus/
node_modules/
node_modules/
# Project
/.vs
.vscode
# Log files
*.log
# Python virtualenv
.venv
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
logs
.idea/*
.DS_Store
output/
*.pkl
# local config files
*.config.local

33
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,33 @@
default_language_version:
python: python3
ci:
autofix_prs: true
autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions'
autoupdate_schedule: 'quarterly'
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: check-added-large-files
- id: check-ast
- id: check-yaml
- id: check-toml
- id: check-json
- id: check-byte-order-marker
exclude: .gitignore
- id: check-merge-conflict
- id: detect-private-key
- id: trailing-whitespace
- id: end-of-file-fixer
- id: no-commit-to-branch
- repo: https://github.com/psf/black
rev: 23.3.0
hooks:
- id: black
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.0.261
hooks:
- id: ruff
args: ["--fix"]

18
CITATION.cff Normal file
View File

@ -0,0 +1,18 @@
preferred-citation:
type: inproceedings
authors:
- family-names: "Wang"
given-names: "Chi"
affiliation: "Microsoft Research, Redmond WA USA"
- family-names: "Wu"
given-names: "Qingyun"
affiliation: "Microsoft Research, Redmond WA USA"
- family-names: "Weimer"
given-names: "Markus"
affiliation: "Microsoft Corporation, Redmond WA USA"
- family-names: "Zhu"
given-names: "Eric"
affiliation: "Microsoft Research, Redmond WA USA"
booktitle: "Proceedings of the 4th MLSys Conference"
title: "FLAML: A Fast and Lightweight AutoML Library"
year: 2021

40
Dockerfile Normal file
View File

@ -0,0 +1,40 @@
# basic setup
FROM python:3.7
RUN apt-get update && apt-get -y update
RUN apt-get install -y sudo git npm
# Install Spark
RUN sudo apt-get update && sudo apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
ca-certificates-java ca-certificates openjdk-17-jdk-headless \
wget \
&& sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*
RUN wget --progress=dot:giga "https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz?action=download" -O - | tar -xzC /tmp; archive=$(basename "spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz") bash -c "sudo mv -v /tmp/\${archive/%.tgz/} /spark"
ENV SPARK_HOME=/spark \
PYTHONPATH=/spark/python/lib/py4j-0.10.9.5-src.zip:/spark/python
ENV PATH="${PATH}:${SPARK_HOME}/bin"
# Setup user to not run as root
RUN adduser --disabled-password --gecos '' flaml-dev
RUN adduser flaml-dev sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER flaml-dev
# Pull repo
RUN cd /home/flaml-dev && git clone https://github.com/microsoft/FLAML.git
WORKDIR /home/flaml-dev/FLAML
# Install FLAML (Note: extra components can be installed if needed)
RUN sudo pip install -e .[test,notebook]
# Install precommit hooks
RUN pre-commit install
# For docs
RUN sudo npm install --global yarn
RUN sudo pip install pydoc-markdown
RUN cd website
RUN yarn install --frozen-lockfile --ignore-engines
# override default image starting point
CMD /bin/bash
ENTRYPOINT []

416
LICENSE
View File

@ -1,395 +1,21 @@
Attribution 4.0 International
=======================================================================
Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.
Using Creative Commons Public Licenses
Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.
Considerations for licensors: Our public licenses are
intended for use by those authorized to give the public
permission to use material in ways otherwise restricted by
copyright and certain other rights. Our licenses are
irrevocable. Licensors should read and understand the terms
and conditions of the license they choose before applying it.
Licensors should also secure all rights necessary before
applying our licenses so that the public can reuse the
material as expected. Licensors should clearly mark any
material not subject to the license. This includes other CC-
licensed material, or material used under an exception or
limitation to copyright. More considerations for licensors:
wiki.creativecommons.org/Considerations_for_licensors
Considerations for the public: By using one of our public
licenses, a licensor grants the public permission to use the
licensed material under specified terms and conditions. If
the licensor's permission is not necessary for any reason--for
example, because of any applicable exception or limitation to
copyright--then that use is not regulated by the license. Our
licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of
the licensed material may still be restricted for other
reasons, including because others have copyright or other
rights in the material. A licensor may make special requests,
such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to
respect those requests where reasonable. More_considerations
for the public:
wiki.creativecommons.org/Considerations_for_licensees
=======================================================================
Creative Commons Attribution 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution 4.0 International Public License ("Public License"). To the
extent this Public License may be interpreted as a contract, You are
granted the Licensed Rights in consideration of Your acceptance of
these terms and conditions, and the Licensor grants You such rights in
consideration of benefits the Licensor receives from making the
Licensed Material available under these terms and conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material
and in which the Licensed Material is translated, altered,
arranged, transformed, or otherwise modified in a manner requiring
permission under the Copyright and Similar Rights held by the
Licensor. For purposes of this Public License, where the Licensed
Material is a musical work, performance, or sound recording,
Adapted Material is always produced where the Licensed Material is
synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright
and Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
d. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright
Treaty adopted on December 20, 1996, and/or similar international
agreements.
e. Exceptions and Limitations means fair use, fair dealing, and/or
any other exception or limitation to Copyright and Similar Rights
that applies to Your use of the Licensed Material.
f. Licensed Material means the artistic or literary work, database,
or other material to which the Licensor applied this Public
License.
g. Licensed Rights means the rights granted to You subject to the
terms and conditions of this Public License, which are limited to
all Copyright and Similar Rights that apply to Your use of the
Licensed Material and that the Licensor has authority to license.
h. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
i. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such
as reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the
public may access the material from a place and at a time
individually chosen by them.
j. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially
equivalent rights anywhere in the world.
k. You means the individual or entity exercising the Licensed Rights
under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part; and
b. produce, reproduce, and Share Adapted Material.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
4. If You Share Adapted Material You produce, the Adapter's
License You apply must not prevent recipients of the Adapted
Material from complying with this Public License.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
to extract, reuse, reproduce, and Share all or a substantial
portion of the contents of the database;
b. if You include all or a substantial portion of the database
contents in a database in which You have Sui Generis Database
Rights, then the database in which You have Sui Generis Database
Rights (but not its individual contents) is Adapted Material; and
c. You must comply with the conditions in Section 3(a) if You Share
all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent
possible, most closely approximates an absolute disclaimer and
waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the
Licensed Material under separate terms or conditions or stop
distributing the Licensed Material at any time; however, doing so
will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different
terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and
shall not be interpreted to, reduce, limit, restrict, or impose
conditions on any use of the Licensed Material that could lawfully
be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted
as a limitation upon, or waiver of, any privileges and immunities
that apply to the Licensor or You, including from the legal
processes of any jurisdiction or authority.
=======================================================================
Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.
Creative Commons may be contacted at creativecommons.org.
MIT License
Copyright (c) Microsoft Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE

290
NOTICE.md Normal file
View File

@ -0,0 +1,290 @@
NOTICES
This repository incorporates material as listed below or described in the code.
#
## Component. Ray.
Code in tune/[analysis.py, sample.py, trial.py, result.py],
searcher/[suggestion.py, variant_generator.py], and scheduler/trial_scheduler.py is adapted from
https://github.com/ray-project/ray/blob/master/python/ray/tune/
## Open Source License/Copyright Notice.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
Code in python/ray/rllib/{evolution_strategies, dqn} adapted from
https://github.com/openai (MIT License)
Copyright (c) 2016 OpenAI (http://openai.com)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
--------------------------------------------------------------------------------
Code in python/ray/rllib/impala/vtrace.py from
https://github.com/deepmind/scalable_agent
Copyright 2018 Google LLC
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
Code in python/ray/rllib/ars is adapted from https://github.com/modestyachts/ARS
Copyright (c) 2018, ARS contributors (Horia Mania, Aurelia Guy, Benjamin Recht)
All rights reserved.
Redistribution and use of ARS in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
------------------
Code in python/ray/_private/prometheus_exporter.py is adapted from https://github.com/census-instrumentation/opencensus-python/blob/master/contrib/opencensus-ext-prometheus/opencensus/ext/prometheus/stats_exporter/__init__.py

108
README.md
View File

@ -4,6 +4,114 @@
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
[![PyPI version](https://badge.fury.io/py/pyautogen.svg)](https://badge.fury.io/py/pyautogen)
<!-- ![Conda version](https://img.shields.io/conda/vn/conda-forge/flaml) -->
[![Build](https://github.com/microsoft/autogen/actions/workflows/python-package.yml/badge.svg)](https://github.com/microsoft/autogen/actions/workflows/python-package.yml)
![Python Version](https://img.shields.io/badge/3.8%20%7C%203.9%20%7C%203.10-blue)
<!-- [![Downloads](https://pepy.tech/badge/flaml)](https://pepy.tech/project/flaml) -->
[![](https://img.shields.io/discord/1025786666260111483?logo=discord&style=flat)](https://discord.gg/Cppx2vSPVP)
<!-- [![Join the chat at https://gitter.im/FLAMLer/community](https://badges.gitter.im/FLAMLer/community.svg)](https://gitter.im/FLAMLer/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) -->
# AutoGen
<!-- <p align="center">
<img src="https://github.com/microsoft/FLAML/blob/main/website/static/img/flaml.svg" width=200>
<br>
</p> -->
<!-- :fire: Heads-up: We're preparing to migrate [autogen](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) into a dedicated github repository. Alongside this move, we'll also launch a dedicated Discord server and a website for comprehensive documentation.
:fire: The automated multi-agent chat framework in [autogen](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) is in preview from v2.0.0.
:fire: FLAML is highlighted in OpenAI's [cookbook](https://github.com/openai/openai-cookbook#related-resources-from-around-the-web).
:fire: [autogen](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) is released with support for ChatGPT and GPT-4, based on [Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference](https://arxiv.org/abs/2303.04673).
:fire: FLAML supports Code-First AutoML & Tuning Private Preview in [Microsoft Fabric Data Science](https://learn.microsoft.com/en-us/fabric/data-science/). -->
## What is AutoGen
AutoGen is a framework that enables development of LLM applications using multiple agents that can converse with each other to solve task. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.
![AutoGen Overview](https://github.com/microsoft/autogen/blob/main/website/static/img/autogen_agentchat.png)
* AutoGen enables building next-gen LLM applications based on **multi-agent conversations** with minimal effort. It simplifies the orchestration, automation and optimization of a complex LLM workflow. It maximizes the performance of LLM models and overcome their weaknesses.
* It supports **diverse conversation patterns** for complex workflows. With customizable and conversable agents, developers can use AutoGen to build a wide range of conversation patterns concerning conversation autonomy,
the number of agents, and agent conversation topology.
* It provides a collection of working systems with different complexities. These systems span a **wide range of applications** from various domains and complexities. They demonstrate how AutoGen can easily support different conversation patterns.
* AutoGen provides a drop-in replacement of `openai.Completion` or `openai.ChatCompletion` as an **enhanced inference API**. It allows easy performance tuning, utilities like API unification & caching, and advanced usage patterns, such as error handling, multi-config inference, context programming etc.
AutoGen is powered by collaborative [research studies](/docs/Research) from Microsoft, Penn State University, and University of Washington.
## Installation
AutoGen requires **Python version >= 3.8**. It can be installed from pip:
```bash
pip install pyautogen
```
<!-- Minimal dependencies are installed without extra options. You can install extra options based on the feature you need.
For example, use the following to install the dependencies needed by the [`autogen`](https://microsoft.github.io/FLAML/docs/Use-Cases/Autogen) package.
```bash
pip install "flaml[autogen]"
```
Find more options in [Installation](https://microsoft.github.io/autogen/docs/Installation).
Each of the [`notebook examples`](https://github.com/microsoft/FLAML/tree/main/notebook) may require a specific option to be installed. -->
## Quickstart
* Autogen enables the next-gen LLM applications with a generic multi-agent conversation framework. It offers customizable and conversable agents which integrate LLMs, tools and human.
By automating chat among multiple capable agents, one can easily make them collectively perform tasks autonomously or with human feedback, including tasks that require using tools via code. For example,
```python
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent("assistant")
user_proxy = UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.")
# This initiates an automated chat between the two agents to solve the task
```
The figure below shows an example conversation flow with AutoGen.
![Agent Chat Example](https://github.com/microsoft/autogen/blob/main/website/static/img/chat_example.png)
* Autogen also helps maximize the utility out of the expensive LLMs such as ChatGPT and GPT-4. It offers a drop-in replacement of `openai.Completion` or `openai.ChatCompletion` with powerful functionalites like tuning, caching, error handling, templating. For example, you can optimize generations by LLM with your own tuning data, success metrics and budgets.
```python
# perform tuning
config, analysis = autogen.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=eval_func,
inference_budget=0.05,
optimization_budget=3,
num_samples=-1,
)
# perform inference for a test instance
response = autogen.Completion.create(context=test_instance, **config)
```
## Documentation
You can find a detailed documentation about AutoGen [here](https://microsoft.github.io/autogen/).
In addition, you can find:
- [Research](https://microsoft.github.io/autogen/docs/Research) and [blogposts](https://microsoft.github.io/autogen/blog) around AutoGen.
- [Discord](https://discord.gg/Cppx2vSPVP).
- [Contributing guide](https://microsoft.github.io/autogen/docs/Contribute).
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit <https://cla.opensource.microsoft.com>.
If you are new to GitHub [here](https://help.github.com/categories/collaborating-with-issues-and-pull-requests/) is a detailed help source on getting involved with development on GitHub.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions

10
flaml/__init__.py Normal file
View File

@ -0,0 +1,10 @@
import logging
from flaml.automl import AutoML, logger_formatter
from flaml.tune.searcher import CFO, BlendSearch, FLOW2, BlendSearchTuner, RandomSearch
from flaml.onlineml.autovw import AutoVW
from flaml.version import __version__
# Set the root logger.
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

View File

@ -0,0 +1,3 @@
from .oai import *
from .agentchat import *
from .code_utils import DEFAULT_MODEL, FAST_MODEL

View File

@ -0,0 +1,14 @@
from .agent import Agent
from .conversable_agent import ConversableAgent
from .assistant_agent import AssistantAgent
from .user_proxy_agent import UserProxyAgent
from .groupchat import GroupChat, GroupChatManager
__all__ = [
"Agent",
"ConversableAgent",
"AssistantAgent",
"UserProxyAgent",
"GroupChat",
"GroupChatManager",
]

View File

@ -0,0 +1,70 @@
from typing import Dict, List, Optional, Union
class Agent:
"""(In preview) An abstract class for AI agent.
An agent can communicate with other agents and perform actions.
Different agents can differ in what actions they perform in the `receive` method.
"""
def __init__(
self,
name: str,
):
"""
Args:
name (str): name of the agent.
"""
# a dictionary of conversations, default value is list
self._name = name
@property
def name(self):
"""Get the name of the agent."""
return self._name
def send(self, message: Union[Dict, str], recipient: "Agent", request_reply: Optional[bool] = None):
"""(Aabstract method) Send a message to another agent."""
async def a_send(self, message: Union[Dict, str], recipient: "Agent", request_reply: Optional[bool] = None):
"""(Aabstract async method) Send a message to another agent."""
def receive(self, message: Union[Dict, str], sender: "Agent", request_reply: Optional[bool] = None):
"""(Abstract method) Receive a message from another agent."""
async def a_receive(self, message: Union[Dict, str], sender: "Agent", request_reply: Optional[bool] = None):
"""(Abstract async method) Receive a message from another agent."""
def reset(self):
"""(Abstract method) Reset the agent."""
def generate_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional["Agent"] = None,
**kwargs,
) -> Union[str, Dict, None]:
"""(Abstract method) Generate a reply based on the received messages.
Args:
messages (list[dict]): a list of messages received.
sender: sender of an Agent instance.
Returns:
str or dict or None: the generated reply. If None, no reply is generated.
"""
async def a_generate_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional["Agent"] = None,
**kwargs,
) -> Union[str, Dict, None]:
"""(Abstract async method) Generate a reply based on the received messages.
Args:
messages (list[dict]): a list of messages received.
sender: sender of an Agent instance.
Returns:
str or dict or None: the generated reply. If None, no reply is generated.
"""

View File

@ -0,0 +1,66 @@
from .conversable_agent import ConversableAgent
from typing import Callable, Dict, Optional, Union
class AssistantAgent(ConversableAgent):
"""(In preview) Assistant agent, designed to solve a task with LLM.
AssistantAgent is a subclass of ConversableAgent configured with a default system message.
The default system message is designed to solve a task with LLM,
including suggesting python code blocks and debugging.
`human_input_mode` is default to "NEVER"
and `code_execution_config` is default to False.
This agent doesn't execute code by default, and expects the user to execute the code.
"""
DEFAULT_SYSTEM_MESSAGE = """You are a helpful AI assistant.
Solve tasks using your coding and language skills.
In the following cases, suggest python code (in a python coding block) or shell script (in a sh coding block) for the user to execute.
1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, get the current date/time. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly.
Solve the task step by step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill.
When using code, you must indicate the script type in the code block. The user cannot provide any other feedback or perform any other action beyond executing the code you suggest. The user can't modify your code. So do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user.
If you want the user to save the code in a file before executing it, put # filename: <filename> inside the code block as the first line. Don't include multiple code blocks in one response. Do not ask users to copy and paste the result. Instead, use 'print' function for the output when relevant. Check the execution result returned by the user.
If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.
When you find an answer, verify the answer carefully. Include verifiable evidence in your response if possible.
Reply "TERMINATE" in the end when everything is done.
"""
def __init__(
self,
name: str,
system_message: Optional[str] = DEFAULT_SYSTEM_MESSAGE,
llm_config: Optional[Union[Dict, bool]] = None,
is_termination_msg: Optional[Callable[[Dict], bool]] = None,
max_consecutive_auto_reply: Optional[int] = None,
human_input_mode: Optional[str] = "NEVER",
code_execution_config: Optional[Union[Dict, bool]] = False,
**kwargs,
):
"""
Args:
name (str): agent name.
system_message (str): system message for the ChatCompletion inference.
Please override this attribute if you want to reprogram the agent.
llm_config (dict): llm inference configuration.
Please refer to [autogen.Completion.create](/docs/reference/autogen/oai/completion#create)
for available options.
is_termination_msg (function): a function that takes a message in the form of a dictionary
and returns a boolean value indicating if this received message is a termination message.
The dict can contain the following keys: "content", "role", "name", "function_call".
max_consecutive_auto_reply (int): the maximum number of consecutive auto replies.
default to None (no limit provided, class attribute MAX_CONSECUTIVE_AUTO_REPLY will be used as the limit in this case).
The limit only plays a role when human_input_mode is not "ALWAYS".
**kwargs (dict): Please refer to other kwargs in
[ConversableAgent](conversable_agent#__init__).
"""
super().__init__(
name,
system_message,
is_termination_msg,
max_consecutive_auto_reply,
human_input_mode,
code_execution_config=code_execution_config,
llm_config=llm_config,
**kwargs,
)

View File

@ -0,0 +1,456 @@
import re
import os
from pydantic import BaseModel, Extra, root_validator
from typing import Any, Callable, Dict, List, Optional, Union
from time import sleep
from flaml.autogen.agentchat import Agent, UserProxyAgent
from flaml.autogen.code_utils import UNKNOWN, extract_code, execute_code, infer_lang
from flaml.autogen.math_utils import get_answer
PROMPTS = {
# default
"default": """Let's use Python to solve a math problem.
Query requirements:
You should always use the 'print' function for the output and use fractions/radical forms instead of decimals.
You can use packages like sympy to help you.
You must follow the formats below to write your code:
```python
# your code
```
First state the key idea to solve the problem. You may choose from three ways to solve the problem:
Case 1: If the problem can be solved with Python code directly, please write a program to solve it. You can enumerate all possible arrangements if needed.
Case 2: If the problem is mostly reasoning, you can solve it by yourself directly.
Case 3: If the problem cannot be handled in the above two ways, please follow this process:
1. Solve the problem step by step (do not over-divide the steps).
2. Take out any queries that can be asked through Python (for example, any calculations or equations that can be calculated).
3. Wait for me to give the results.
4. Continue if you think the result is correct. If the result is invalid or unexpected, please correct your query or reasoning.
After all the queries are run and you get the answer, put the answer in \\boxed{}.
Problem:
""",
# select python or wolfram
"two_tools": """Let's use two tools (Python and Wolfram alpha) to solve a math problem.
Query requirements:
You must follow the formats below to write your query:
For Wolfram Alpha:
```wolfram
# one wolfram query
```
For Python:
```python
# your code
```
When using Python, you should always use the 'print' function for the output and use fractions/radical forms instead of decimals. You can use packages like sympy to help you.
When using wolfram, give one query in each code block.
Please follow this process:
1. Solve the problem step by step (do not over-divide the steps).
2. Take out any queries that can be asked through Python or Wolfram Alpha, select the most suitable tool to be used (for example, any calculations or equations that can be calculated).
3. Wait for me to give the results.
4. Continue if you think the result is correct. If the result is invalid or unexpected, please correct your query or reasoning.
After all the queries are run and you get the answer, put the final answer in \\boxed{}.
Problem: """,
# use python step by step
"python": """Let's use Python to solve a math problem.
Query requirements:
You should always use the 'print' function for the output and use fractions/radical forms instead of decimals.
You can use packages like sympy to help you.
You must follow the formats below to write your code:
```python
# your code
```
Please follow this process:
1. Solve the problem step by step (do not over-divide the steps).
2. Take out any queries that can be asked through Python (for example, any calculations or equations that can be calculated).
3. Wait for me to give the results.
4. Continue if you think the result is correct. If the result is invalid or unexpected, please correct your query or reasoning.
After all the queries are run and you get the answer, put the answer in \\boxed{}.
Problem: """,
}
def _is_termination_msg_mathchat(message):
"""Check if a message is a termination message."""
if isinstance(message, dict):
message = message.get("content")
if message is None:
return False
cb = extract_code(message)
contain_code = False
for c in cb:
if c[0] == "python" or c[0] == "wolfram":
contain_code = True
break
return not contain_code and get_answer(message) is not None and get_answer(message) != ""
def _add_print_to_last_line(code):
"""Add print() to the last line of a string."""
# 1. check if there is already a print statement
if "print(" in code:
return code
# 2. extract the last line, enclose it in print() and return the new string
lines = code.splitlines()
last_line = lines[-1]
if "\t" in last_line or "=" in last_line:
return code
if "=" in last_line:
last_line = "print(" + last_line.split(" = ")[0] + ")"
lines.append(last_line)
else:
lines[-1] = "print(" + last_line + ")"
# 3. join the lines back together
return "\n".join(lines)
def _remove_print(code):
"""remove all print statements from a string."""
lines = code.splitlines()
lines = [line for line in lines if not line.startswith("print(")]
return "\n".join(lines)
class MathUserProxyAgent(UserProxyAgent):
"""(Experimental) A MathChat agent that can handle math problems."""
MAX_CONSECUTIVE_AUTO_REPLY = 15 # maximum number of consecutive auto replies (subject to future change)
DEFAULT_REPLY = "Continue. Please keep solving the problem until you need to query. (If you get to the answer, put it in \\boxed{}.)"
def __init__(
self,
name: Optional[str] = "MathChatAgent", # default set to MathChatAgent
is_termination_msg: Optional[
Callable[[Dict], bool]
] = _is_termination_msg_mathchat, # terminate if \boxed{} in message
human_input_mode: Optional[str] = "NEVER", # Fully automated
default_auto_reply: Optional[Union[str, Dict, None]] = DEFAULT_REPLY,
max_invalid_q_per_step=3, # a parameter needed in MathChat
**kwargs,
):
"""
Args:
name (str): name of the agent
is_termination_msg (function): a function that takes a message in the form of a dictionary and returns a boolean value indicating if this received message is a termination message.
The dict can contain the following keys: "content", "role", "name", "function_call".
human_input_mode (str): whether to ask for human inputs every time a message is received.
Possible values are "ALWAYS", "TERMINATE", "NEVER".
(1) When "ALWAYS", the agent prompts for human input every time a message is received.
Under this mode, the conversation stops when the human input is "exit",
or when is_termination_msg is True and there is no human input.
(2) When "TERMINATE", the agent only prompts for human input only when a termination message is received or
the number of auto reply reaches the max_consecutive_auto_reply.
(3) (Default) When "NEVER", the agent will never prompt for human input. Under this mode, the conversation stops
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
default_auto_reply (str or dict or None): the default auto reply message when no code execution or llm based reply is generated.
max_invalid_q_per_step (int): (ADDED) the maximum number of invalid queries per step.
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
"""
super().__init__(
name=name,
is_termination_msg=is_termination_msg,
human_input_mode=human_input_mode,
default_auto_reply=default_auto_reply,
**kwargs,
)
self.register_reply([Agent, None], MathUserProxyAgent._generate_math_reply, 1)
# fixed var
self._max_invalid_q_per_step = max_invalid_q_per_step
# mutable
self._valid_q_count = 0
self._total_q_count = 0
self._accum_invalid_q_per_step = 0
self._previous_code = ""
self.last_reply = None
def generate_init_message(self, problem, prompt_type="default", customized_prompt=None):
"""Generate a prompt for the assitant agent with the given problem and prompt.
Args:
problem (str): the problem to be solved.
prompt_type (str): the type of the prompt. Possible values are "default", "python", "wolfram".
(1) "default": the prompt that allows the agent to choose between 3 ways to solve a problem:
1. write a python program to solve it directly.
2. solve it directly without python.
3. solve it step by step with python.
(2) "python":
a simplified prompt from the third way of the "default" prompt, that asks the assistant
to solve the problem step by step with python.
(3) "two_tools":
a simplified prompt similar to the "python" prompt, but allows the model to choose between
Python and Wolfram Alpha to solve the problem.
customized_prompt (str): a customized prompt to be used. If it is not None, the prompt_type will be ignored.
Returns:
str: the generated prompt ready to be sent to the assistant agent.
"""
self._reset()
if customized_prompt is not None:
return customized_prompt + problem
return PROMPTS[prompt_type] + problem
def _reset(self):
# super().reset()
self._valid_q_count = 0
self._total_q_count = 0
self._accum_invalid_q_per_step = 0
self._previous_code = ""
self.last_reply = None
def execute_one_python_code(self, pycode):
"""Execute python code blocks.
Previous python code will be saved and executed together with the new code.
the "print" function will also be added to the last line of the code if needed
"""
# Need to replace all "; " with "\n" to avoid syntax error when adding `print` to the last line
pycode = pycode.replace("; ", "\n").replace(";", "\n")
pycode = self._previous_code + _add_print_to_last_line(pycode)
return_code, output, _ = execute_code(pycode, **self._code_execution_config, timeout=5)
is_success = return_code == 0
if not is_success:
# Remove the file information from the error string
pattern = r'File "/[^"]+\.py", line \d+, in .+\n'
if isinstance(output, str):
output = re.sub(pattern, "", output)
output = "Error: " + output
elif output == "":
# Check if there is any print statement
if "print" not in pycode:
output = "No output found. Make sure you print the results."
is_success = False
else:
output = "No output found."
is_success = True
if len(output) > 2000:
output = "Your requested query response is too long. You might have made a mistake. Please revise your reasoning and query."
is_success = False
if is_success:
# remove print and check if it still works
tmp = self._previous_code + "\n" + _remove_print(pycode) + "\n"
rcode, _, _ = execute_code(tmp, **self._code_execution_config)
else:
# only add imports and check if it works
tmp = self._previous_code + "\n"
for line in pycode.split("\n"):
if "import" in line:
tmp += line + "\n"
rcode, _, _ = execute_code(tmp, **self._code_execution_config)
if rcode == 0:
self._previous_code = tmp
return output, is_success
def execute_one_wolfram_query(self, query: str):
"""Run one wolfram query and return the output.
Args:
query: string of the query.
Returns:
output: string with the output of the query.
is_success: boolean indicating whether the query was successful.
"""
# wolfram query handler
wolfram = WolframAlphaAPIWrapper()
output, is_success = wolfram.run(query)
if output == "":
output = "Error: The wolfram query is invalid."
is_success = False
return output, is_success
def _generate_math_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
):
"""Generate an auto reply."""
if messages is None:
messages = self._oai_messages[sender]
message = messages[-1]
message = message.get("content", "")
code_blocks = extract_code(message)
if len(code_blocks) == 1 and code_blocks[0][0] == UNKNOWN:
# no code block is found, lang should be `UNKNOWN``
return True, self._default_auto_reply
is_success, all_success = True, True
reply = ""
for code_block in code_blocks:
lang, code = code_block
if not lang:
lang = infer_lang(code)
if lang == "python":
output, is_success = self.execute_one_python_code(code)
elif lang == "wolfram":
output, is_success = self.execute_one_wolfram_query(code)
else:
output = "Error: Unknown language."
is_success = False
reply += output + "\n"
if not is_success:
all_success = False
self._valid_q_count -= 1 # count invalid queries
reply = reply.strip()
if self.last_reply == reply:
return True, reply + "\nYour query or result is same from the last, please try a new approach."
self.last_reply = reply
if not all_success:
self._accum_invalid_q_per_step += 1
if self._accum_invalid_q_per_step > self._max_invalid_q_per_step:
self._accum_invalid_q_per_step = 0
reply = "Please revisit the problem statement and your reasoning. If you think this step is correct, solve it yourself and continue the next step. Otherwise, correct this step."
return True, reply
# Modified based on langchain. Langchain is licensed under MIT License:
# The MIT License
# Copyright (c) Harrison Chase
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
def get_from_dict_or_env(data: Dict[str, Any], key: str, env_key: str, default: Optional[str] = None) -> str:
"""Get a value from a dictionary or an environment variable."""
if key in data and data[key]:
return data[key]
elif env_key in os.environ and os.environ[env_key]:
return os.environ[env_key]
elif default is not None:
return default
else:
raise ValueError(
f"Did not find {key}, please add an environment variable"
f" `{env_key}` which contains it, or pass"
f" `{key}` as a named parameter."
)
class WolframAlphaAPIWrapper(BaseModel):
"""Wrapper for Wolfram Alpha.
Docs for using:
1. Go to wolfram alpha and sign up for a developer account
2. Create an app and get your APP ID
3. Save your APP ID into WOLFRAM_ALPHA_APPID env variable
4. pip install wolframalpha
"""
wolfram_client: Any #: :meta private:
wolfram_alpha_appid: Optional[str] = None
class Config:
"""Configuration for this pydantic object."""
extra = Extra.forbid
@root_validator(skip_on_failure=True)
def validate_environment(cls, values: Dict) -> Dict:
"""Validate that api key and python package exists in environment."""
wolfram_alpha_appid = get_from_dict_or_env(values, "wolfram_alpha_appid", "WOLFRAM_ALPHA_APPID")
values["wolfram_alpha_appid"] = wolfram_alpha_appid
try:
import wolframalpha
except ImportError:
raise ImportError("wolframalpha is not installed. " "Please install it with `pip install wolframalpha`")
client = wolframalpha.Client(wolfram_alpha_appid)
values["wolfram_client"] = client
return values
def run(self, query: str) -> str:
"""Run query through WolframAlpha and parse result."""
from urllib.error import HTTPError
is_success = False # added
res = None
for _ in range(20):
try:
res = self.wolfram_client.query(query)
break
except HTTPError:
sleep(1)
except Exception:
return (
"Wolfram Alpha wasn't able to answer it. Please try a new query for wolfram or use python.",
is_success,
)
if res is None:
return (
"Wolfram Alpha wasn't able to answer it (may due to web error), you can try again or use python.",
is_success,
)
try:
if not res["@success"]:
return (
"Your Wolfram query is invalid. Please try a new query for wolfram or use python.",
is_success,
)
assumption = next(res.pods).text
answer = ""
for result in res["pod"]:
if result["@title"] == "Solution":
answer = result["subpod"]["plaintext"]
if result["@title"] == "Results" or result["@title"] == "Solutions":
for i, sub in enumerate(result["subpod"]):
answer += f"ans {i}: " + sub["plaintext"] + "\n"
break
if answer == "":
answer = next(res.results).text
except Exception:
return (
"Wolfram Alpha wasn't able to answer it. Please try a new query for wolfram or use python.",
is_success,
)
if answer is None or answer == "":
# We don't want to return the assumption alone if answer is empty
return "No good Wolfram Alpha Result was found", is_success
is_success = True
return f"Assumption: {assumption} \nAnswer: {answer}", is_success

View File

@ -0,0 +1,43 @@
from flaml.autogen.agentchat.agent import Agent
from flaml.autogen.agentchat.assistant_agent import AssistantAgent
from typing import Callable, Dict, Optional, Union, List, Tuple, Any
class RetrieveAssistantAgent(AssistantAgent):
"""(Experimental) Retrieve Assistant agent, designed to solve a task with LLM.
RetrieveAssistantAgent is a subclass of AssistantAgent configured with a default system message.
The default system message is designed to solve a task with LLM,
including suggesting python code blocks and debugging.
`human_input_mode` is default to "NEVER"
and `code_execution_config` is default to False.
This agent doesn't execute code by default, and expects the user to execute the code.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.register_reply(Agent, RetrieveAssistantAgent._generate_retrieve_assistant_reply)
def _generate_retrieve_assistant_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
) -> Tuple[bool, Union[str, Dict, None]]:
if config is None:
config = self
if messages is None:
messages = self._oai_messages[sender]
message = messages[-1]
if "exitcode: 0 (execution succeeded)" in message.get("content", ""):
# Terminate the conversation when the code execution succeeds. Although sometimes even when the
# code execution succeeds, the task is not solved, but it's hard to tell. If the human_input_mode
# of RetrieveUserProxyAgent is "TERMINATE" or "ALWAYS", user can still continue the conversation.
return True, "TERMINATE"
elif (
"UPDATE CONTEXT" in message.get("content", "")[-20:].upper()
or "UPDATE CONTEXT" in message.get("content", "")[:20].upper()
):
return True, "UPDATE CONTEXT"
else:
return False, None

View File

@ -0,0 +1,305 @@
import chromadb
from flaml.autogen.agentchat.agent import Agent
from flaml.autogen.agentchat import UserProxyAgent
from flaml.autogen.retrieve_utils import create_vector_db_from_dir, query_vector_db, num_tokens_from_text
from flaml.autogen.code_utils import extract_code
from typing import Callable, Dict, Optional, Union, List, Tuple, Any
from IPython import get_ipython
try:
from termcolor import colored
except ImportError:
def colored(x, *args, **kwargs):
return x
PROMPT_DEFAULT = """You're a retrieve augmented chatbot. You answer user's questions based on your own knowledge and the
context provided by the user. You should follow the following steps to answer a question:
Step 1, you estimate the user's intent based on the question and context. The intent can be a code generation task or
a question answering task.
Step 2, you reply based on the intent.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
If user's intent is code generation, you must obey the following rules:
Rule 1. You MUST NOT install any packages because all the packages needed are already installed.
Rule 2. You must follow the formats below to write your code:
```language
# your code
```
If user's intent is question answering, you must give as short an answer as possible.
User's question is: {input_question}
Context is: {input_context}
"""
PROMPT_CODE = """You're a retrieve augmented coding assistant. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
For code generation, you must obey the following rules:
Rule 1. You MUST NOT install any packages because all the packages needed are already installed.
Rule 2. You must follow the formats below to write your code:
```language
# your code
```
User's question is: {input_question}
Context is: {input_context}
"""
PROMPT_QA = """You're a retrieve augmented chatbot. You answer user's questions based on your own knowledge and the
context provided by the user.
If you can't answer the question with or without the current context, you should reply exactly `UPDATE CONTEXT`.
You must give as short an answer as possible.
User's question is: {input_question}
Context is: {input_context}
"""
def _is_termination_msg_retrievechat(message):
"""Check if a message is a termination message."""
if isinstance(message, dict):
message = message.get("content")
if message is None:
return False
cb = extract_code(message)
contain_code = False
for c in cb:
if c[0] == "python":
contain_code = True
break
return not contain_code
class RetrieveUserProxyAgent(UserProxyAgent):
def __init__(
self,
name="RetrieveChatAgent", # default set to RetrieveChatAgent
is_termination_msg: Optional[Callable[[Dict], bool]] = _is_termination_msg_retrievechat,
human_input_mode: Optional[str] = "ALWAYS",
retrieve_config: Optional[Dict] = None, # config for the retrieve agent
**kwargs,
):
"""
Args:
name (str): name of the agent.
human_input_mode (str): whether to ask for human inputs every time a message is received.
Possible values are "ALWAYS", "TERMINATE", "NEVER".
(1) When "ALWAYS", the agent prompts for human input every time a message is received.
Under this mode, the conversation stops when the human input is "exit",
or when is_termination_msg is True and there is no human input.
(2) When "TERMINATE", the agent only prompts for human input only when a termination message is received or
the number of auto reply reaches the max_consecutive_auto_reply.
(3) When "NEVER", the agent will never prompt for human input. Under this mode, the conversation stops
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
retrieve_config (dict or None): config for the retrieve agent.
To use default config, set to None. Otherwise, set to a dictionary with the following keys:
- task (Optional, str): the task of the retrieve chat. Possible values are "code", "qa" and "default". System
prompt will be different for different tasks. The default value is `default`, which supports both code and qa.
- client (Optional, chromadb.Client): the chromadb client.
If key not provided, a default client `chromadb.Client()` will be used.
- docs_path (Optional, str): the path to the docs directory. It can also be the path to a single file,
or the url to a single file. If key not provided, a default path `./docs` will be used.
- collection_name (Optional, str): the name of the collection.
If key not provided, a default name `flaml-docs` will be used.
- model (Optional, str): the model to use for the retrieve chat.
If key not provided, a default model `gpt-4` will be used.
- chunk_token_size (Optional, int): the chunk token size for the retrieve chat.
If key not provided, a default size `max_tokens * 0.4` will be used.
- context_max_tokens (Optional, int): the context max token size for the retrieve chat.
If key not provided, a default size `max_tokens * 0.8` will be used.
- chunk_mode (Optional, str): the chunk mode for the retrieve chat. Possible values are
"multi_lines" and "one_line". If key not provided, a default mode `multi_lines` will be used.
- must_break_at_empty_line (Optional, bool): chunk will only break at empty line if True. Default is True.
If chunk_mode is "one_line", this parameter will be ignored.
- embedding_model (Optional, str): the embedding model to use for the retrieve chat.
If key not provided, a default model `all-MiniLM-L6-v2` will be used. All available models
can be found at `https://www.sbert.net/docs/pretrained_models.html`. The default model is a
fast model. If you want to use a high performance model, `all-mpnet-base-v2` is recommended.
- customized_prompt (Optional, str): the customized prompt for the retrieve chat. Default is None.
**kwargs (dict): other kwargs in [UserProxyAgent](user_proxy_agent#__init__).
"""
super().__init__(
name=name,
is_termination_msg=is_termination_msg,
human_input_mode=human_input_mode,
**kwargs,
)
self._retrieve_config = {} if retrieve_config is None else retrieve_config
self._task = self._retrieve_config.get("task", "default")
self._client = self._retrieve_config.get("client", chromadb.Client())
self._docs_path = self._retrieve_config.get("docs_path", "./docs")
self._collection_name = self._retrieve_config.get("collection_name", "flaml-docs")
self._model = self._retrieve_config.get("model", "gpt-4")
self._max_tokens = self.get_max_tokens(self._model)
self._chunk_token_size = int(self._retrieve_config.get("chunk_token_size", self._max_tokens * 0.4))
self._chunk_mode = self._retrieve_config.get("chunk_mode", "multi_lines")
self._must_break_at_empty_line = self._retrieve_config.get("must_break_at_empty_line", True)
self._embedding_model = self._retrieve_config.get("embedding_model", "all-MiniLM-L6-v2")
self.customized_prompt = self._retrieve_config.get("customized_prompt", None)
self._context_max_tokens = self._max_tokens * 0.8
self._collection = False # the collection is not created
self._ipython = get_ipython()
self._doc_idx = -1 # the index of the current used doc
self._results = {} # the results of the current query
self.register_reply(Agent, RetrieveUserProxyAgent._generate_retrieve_user_reply)
@staticmethod
def get_max_tokens(model="gpt-3.5-turbo"):
if "32k" in model:
return 32000
elif "16k" in model:
return 16000
elif "gpt-4" in model:
return 8000
else:
return 4000
def _reset(self):
self._doc_idx = -1 # the index of the current used doc
self._results = {} # the results of the current query
def _get_context(self, results):
doc_contents = ""
current_tokens = 0
_doc_idx = self._doc_idx
for idx, doc in enumerate(results["documents"][0]):
if idx <= _doc_idx:
continue
_doc_tokens = num_tokens_from_text(doc)
if _doc_tokens > self._context_max_tokens:
func_print = f"Skip doc_id {results['ids'][0][idx]} as it is too long to fit in the context."
print(colored(func_print, "green"), flush=True)
self._doc_idx = idx
continue
if current_tokens + _doc_tokens > self._context_max_tokens:
break
func_print = f"Adding doc_id {results['ids'][0][idx]} to context."
print(colored(func_print, "green"), flush=True)
current_tokens += _doc_tokens
doc_contents += doc + "\n"
self._doc_idx = idx
return doc_contents
def _generate_message(self, doc_contents, task="default"):
if not doc_contents:
print(colored("No more context, will terminate.", "green"), flush=True)
return "TERMINATE"
if self.customized_prompt:
message = self.customized_prompt + "\nUser's question is: " + self.problem + "\nContext is: " + doc_contents
elif task.upper() == "CODE":
message = PROMPT_CODE.format(input_question=self.problem, input_context=doc_contents)
elif task.upper() == "QA":
message = PROMPT_QA.format(input_question=self.problem, input_context=doc_contents)
elif task.upper() == "DEFAULT":
message = PROMPT_DEFAULT.format(input_question=self.problem, input_context=doc_contents)
else:
raise NotImplementedError(f"task {task} is not implemented.")
return message
def _generate_retrieve_user_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
) -> Tuple[bool, Union[str, Dict, None]]:
if config is None:
config = self
if messages is None:
messages = self._oai_messages[sender]
message = messages[-1]
if (
"UPDATE CONTEXT" in message.get("content", "")[-20:].upper()
or "UPDATE CONTEXT" in message.get("content", "")[:20].upper()
):
print(colored("Updating context and resetting conversation.", "green"), flush=True)
self.clear_history()
sender.clear_history()
doc_contents = self._get_context(self._results)
return True, self._generate_message(doc_contents, task=self._task)
return False, None
def retrieve_docs(self, problem: str, n_results: int = 20, search_string: str = ""):
if not self._collection:
create_vector_db_from_dir(
dir_path=self._docs_path,
max_tokens=self._chunk_token_size,
client=self._client,
collection_name=self._collection_name,
chunk_mode=self._chunk_mode,
must_break_at_empty_line=self._must_break_at_empty_line,
embedding_model=self._embedding_model,
)
self._collection = True
results = query_vector_db(
query_texts=[problem],
n_results=n_results,
search_string=search_string,
client=self._client,
collection_name=self._collection_name,
embedding_model=self._embedding_model,
)
self._results = results
print("doc_ids: ", results["ids"])
def generate_init_message(self, problem: str, n_results: int = 20, search_string: str = ""):
"""Generate an initial message with the given problem and prompt.
Args:
problem (str): the problem to be solved.
n_results (int): the number of results to be retrieved.
search_string (str): only docs containing this string will be retrieved.
Returns:
str: the generated prompt ready to be sent to the assistant agent.
"""
self._reset()
self.retrieve_docs(problem, n_results, search_string)
self.problem = problem
doc_contents = self._get_context(self._results)
message = self._generate_message(doc_contents, self._task)
return message
def run_code(self, code, **kwargs):
lang = kwargs.get("lang", None)
if code.startswith("!") or code.startswith("pip") or lang in ["bash", "shell", "sh"]:
return (
0,
"You MUST NOT install any packages because all the packages needed are already installed.",
None,
)
if self._ipython is None or lang != "python":
return super().run_code(code, **kwargs)
else:
# # capture may not work as expected
# result = self._ipython.run_cell("%%capture --no-display cap\n" + code)
# log = self._ipython.ev("cap.stdout")
# log += self._ipython.ev("cap.stderr")
# if result.result is not None:
# log += str(result.result)
# exitcode = 0 if result.success else 1
# if result.error_before_exec is not None:
# log += f"\n{result.error_before_exec}"
# exitcode = 1
# if result.error_in_exec is not None:
# log += f"\n{result.error_in_exec}"
# exitcode = 1
# return exitcode, log, None
result = self._ipython.run_cell(code)
log = str(result.result)
exitcode = 0 if result.success else 1
if result.error_before_exec is not None:
log += f"\n{result.error_before_exec}"
exitcode = 1
if result.error_in_exec is not None:
log += f"\n{result.error_in_exec}"
exitcode = 1
return exitcode, log, None

View File

@ -0,0 +1,998 @@
import asyncio
from collections import defaultdict
import copy
import json
from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union
from flaml.autogen import oai
from .agent import Agent
from flaml.autogen.code_utils import (
DEFAULT_MODEL,
UNKNOWN,
execute_code,
extract_code,
infer_lang,
)
try:
from termcolor import colored
except ImportError:
def colored(x, *args, **kwargs):
return x
class ConversableAgent(Agent):
"""(In preview) A class for generic conversable agents which can be configured as assistant or user proxy.
After receiving each message, the agent will send a reply to the sender unless the msg is a termination msg.
For example, AssistantAgent and UserProxyAgent are subclasses of this class,
configured with different default settings.
To modify auto reply, override `generate_reply` method.
To disable/enable human response in every turn, set `human_input_mode` to "NEVER" or "ALWAYS".
To modify the way to get human input, override `get_human_input` method.
To modify the way to execute code blocks, single code block, or function call, override `execute_code_blocks`,
`run_code`, and `execute_function` methods respectively.
To customize the initial message when a conversation starts, override `generate_init_message` method.
"""
DEFAULT_CONFIG = {
"model": DEFAULT_MODEL,
}
MAX_CONSECUTIVE_AUTO_REPLY = 100 # maximum number of consecutive auto replies (subject to future change)
def __init__(
self,
name: str,
system_message: Optional[str] = "You are a helpful AI Assistant.",
is_termination_msg: Optional[Callable[[Dict], bool]] = None,
max_consecutive_auto_reply: Optional[int] = None,
human_input_mode: Optional[str] = "TERMINATE",
function_map: Optional[Dict[str, Callable]] = None,
code_execution_config: Optional[Union[Dict, bool]] = None,
llm_config: Optional[Union[Dict, bool]] = None,
default_auto_reply: Optional[Union[str, Dict, None]] = "",
):
"""
Args:
name (str): name of the agent.
system_message (str): system message for the ChatCompletion inference.
is_termination_msg (function): a function that takes a message in the form of a dictionary
and returns a boolean value indicating if this received message is a termination message.
The dict can contain the following keys: "content", "role", "name", "function_call".
max_consecutive_auto_reply (int): the maximum number of consecutive auto replies.
default to None (no limit provided, class attribute MAX_CONSECUTIVE_AUTO_REPLY will be used as the limit in this case).
When set to 0, no auto reply will be generated.
human_input_mode (str): whether to ask for human inputs every time a message is received.
Possible values are "ALWAYS", "TERMINATE", "NEVER".
(1) When "ALWAYS", the agent prompts for human input every time a message is received.
Under this mode, the conversation stops when the human input is "exit",
or when is_termination_msg is True and there is no human input.
(2) When "TERMINATE", the agent only prompts for human input only when a termination message is received or
the number of auto reply reaches the max_consecutive_auto_reply.
(3) When "NEVER", the agent will never prompt for human input. Under this mode, the conversation stops
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
function_map (dict[str, callable]): Mapping function names (passed to openai) to callable functions.
code_execution_config (dict or False): config for the code execution.
To disable code execution, set to False. Otherwise, set to a dictionary with the following keys:
- work_dir (Optional, str): The working directory for the code execution.
If None, a default working directory will be used.
The default working directory is the "extensions" directory under
"path_to_flaml/autogen".
- use_docker (Optional, list, str or bool): The docker image to use for code execution.
If a list or a str of image name(s) is provided, the code will be executed in a docker container
with the first image successfully pulled.
If None, False or empty, the code will be executed in the current environment.
Default is True, which will be converted into a list.
If the code is executed in the current environment,
the code must be trusted.
- timeout (Optional, int): The maximum execution time in seconds.
- last_n_messages (Experimental, Optional, int): The number of messages to look back for code execution. Default to 1.
llm_config (dict or False): llm inference configuration.
Please refer to [autogen.Completion.create](/docs/reference/autogen/oai/completion#create)
for available options.
To disable llm-based auto reply, set to False.
default_auto_reply (str or dict or None): default auto reply when no code execution or llm-based reply is generated.
"""
super().__init__(name)
# a dictionary of conversations, default value is list
self._oai_messages = defaultdict(list)
self._oai_system_message = [{"content": system_message, "role": "system"}]
self._is_termination_msg = (
is_termination_msg if is_termination_msg is not None else (lambda x: x.get("content") == "TERMINATE")
)
if llm_config is False:
self.llm_config = False
else:
self.llm_config = self.DEFAULT_CONFIG.copy()
if isinstance(llm_config, dict):
self.llm_config.update(llm_config)
self._code_execution_config = {} if code_execution_config is None else code_execution_config
self.human_input_mode = human_input_mode
self._max_consecutive_auto_reply = (
max_consecutive_auto_reply if max_consecutive_auto_reply is not None else self.MAX_CONSECUTIVE_AUTO_REPLY
)
self._consecutive_auto_reply_counter = defaultdict(int)
self._max_consecutive_auto_reply_dict = defaultdict(self.max_consecutive_auto_reply)
self._function_map = {} if function_map is None else function_map
self._default_auto_reply = default_auto_reply
self._reply_func_list = []
self.reply_at_receive = defaultdict(bool)
self.register_reply([Agent, None], ConversableAgent.generate_oai_reply)
self.register_reply([Agent, None], ConversableAgent.generate_code_execution_reply)
self.register_reply([Agent, None], ConversableAgent.generate_function_call_reply)
self.register_reply([Agent, None], ConversableAgent.check_termination_and_human_reply)
def register_reply(
self,
trigger: Union[Type[Agent], str, Agent, Callable[[Agent], bool], List],
reply_func: Callable,
position: Optional[int] = 0,
config: Optional[Any] = None,
reset_config: Optional[Callable] = None,
):
"""Register a reply function.
The reply function will be called when the trigger matches the sender.
The function registered later will be checked earlier by default.
To change the order, set the position to a positive integer.
Args:
trigger (Agent class, str, Agent instance, callable, or list): the trigger.
- If a class is provided, the reply function will be called when the sender is an instance of the class.
- If a string is provided, the reply function will be called when the sender's name matches the string.
- If an agent instance is provided, the reply function will be called when the sender is the agent instance.
- If a callable is provided, the reply function will be called when the callable returns True.
- If a list is provided, the reply function will be called when any of the triggers in the list is activated.
- If None is provided, the reply function will be called only when the sender is None.
Note: Be sure to register `None` as a trigger if you would like to trigger an auto-reply function with non-empty messages and `sender=None`.
reply_func (Callable): the reply function.
The function takes a recipient agent, a list of messages, a sender agent and a config as input and returns a reply message.
```python
def reply_func(
recipient: ConversableAgent,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
) -> Union[str, Dict, None]:
```
position (int): the position of the reply function in the reply function list.
The function registered later will be checked earlier by default.
To change the order, set the position to a positive integer.
config (Any): the config to be passed to the reply function.
When an agent is reset, the config will be reset to the original value.
reset_config (Callable): the function to reset the config.
The function returns None. Signature: ```def reset_config(config: Any)```
"""
if not isinstance(trigger, (type, str, Agent, Callable, list)):
raise ValueError("trigger must be a class, a string, an agent, a callable or a list.")
self._reply_func_list.insert(
position,
{
"trigger": trigger,
"reply_func": reply_func,
"config": copy.copy(config),
"init_config": config,
"reset_config": reset_config,
},
)
@property
def system_message(self):
"""Return the system message."""
return self._oai_system_message[0]["content"]
def update_system_message(self, system_message: str):
"""Update the system message.
Args:
system_message (str): system message for the ChatCompletion inference.
"""
self._oai_system_message[0]["content"] = system_message
def update_max_consecutive_auto_reply(self, value: int, sender: Optional[Agent] = None):
"""Update the maximum number of consecutive auto replies.
Args:
value (int): the maximum number of consecutive auto replies.
sender (Agent): when the sender is provided, only update the max_consecutive_auto_reply for that sender.
"""
if sender is None:
self._max_consecutive_auto_reply = value
for k in self._max_consecutive_auto_reply_dict:
self._max_consecutive_auto_reply_dict[k] = value
else:
self._max_consecutive_auto_reply_dict[sender] = value
def max_consecutive_auto_reply(self, sender: Optional[Agent] = None) -> int:
"""The maximum number of consecutive auto replies."""
return self._max_consecutive_auto_reply if sender is None else self._max_consecutive_auto_reply_dict[sender]
@property
def chat_messages(self) -> Dict[str, List[Dict]]:
"""A dictionary of conversations from name to list of ChatCompletion messages."""
return self._oai_messages
def last_message(self, agent: Optional[Agent] = None) -> Dict:
"""The last message exchanged with the agent.
Args:
agent (Agent): The agent in the conversation.
If None and more than one agent's conversations are found, an error will be raised.
If None and only one conversation is found, the last message of the only conversation will be returned.
Returns:
The last message exchanged with the agent.
"""
if agent is None:
n_conversations = len(self._oai_messages)
if n_conversations == 0:
return None
if n_conversations == 1:
for conversation in self._oai_messages.values():
return conversation[-1]
raise ValueError("More than one conversation is found. Please specify the sender to get the last message.")
return self._oai_messages[agent][-1]
@property
def use_docker(self) -> Union[bool, str, None]:
"""Bool value of whether to use docker to execute the code,
or str value of the docker image name to use, or None when code execution is disabled.
"""
return None if self._code_execution_config is False else self._code_execution_config.get("use_docker")
@staticmethod
def _message_to_dict(message: Union[Dict, str]):
"""Convert a message to a dictionary.
The message can be a string or a dictionary. The string will be put in the "content" field of the new dictionary.
"""
if isinstance(message, str):
return {"content": message}
else:
return message
def _append_oai_message(self, message: Union[Dict, str], role, conversation_id: Agent) -> bool:
"""Append a message to the ChatCompletion conversation.
If the message received is a string, it will be put in the "content" field of the new dictionary.
If the message received is a dictionary but does not have any of the two fields "content" or "function_call",
this message is not a valid ChatCompletion message.
Args:
message (dict or str): message to be appended to the ChatCompletion conversation.
role (str): role of the message, can be "assistant" or "function".
conversation_id (Agent): id of the conversation, should be the recipient or sender.
Returns:
bool: whether the message is appended to the ChatCompletion conversation.
"""
message = self._message_to_dict(message)
# create oai message to be appended to the oai conversation that can be passed to oai directly.
oai_message = {k: message[k] for k in ("content", "function_call", "name", "context") if k in message}
if "content" not in oai_message and "function_call" not in oai_message:
return False
oai_message["role"] = "function" if message.get("role") == "function" else role
self._oai_messages[conversation_id].append(oai_message)
return True
def send(
self,
message: Union[Dict, str],
recipient: Agent,
request_reply: Optional[bool] = None,
silent: Optional[bool] = False,
) -> bool:
"""Send a message to another agent.
Args:
message (dict or str): message to be sent.
The message could contain the following fields (either content or function_call must be provided):
- content (str): the content of the message.
- function_call (str): the name of the function to be called.
- name (str): the name of the function to be called.
- role (str): the role of the message, any role that is not "function"
will be modified to "assistant".
- context (dict): the context of the message, which will be passed to
[autogen.Completion.create](../oai/Completion#create).
For example, one agent can send a message A as:
```python
{
"content": lambda context: context["use_tool_msg"],
"context": {
"use_tool_msg": "Use tool X if they are relevant."
}
}
```
Next time, one agent can send a message B with a different "use_tool_msg".
Then the content of message A will be refreshed to the new "use_tool_msg".
So effectively, this provides a way for an agent to send a "link" and modify
the content of the "link" later.
recipient (Agent): the recipient of the message.
request_reply (bool or None): whether to request a reply from the recipient.
silent (bool or None): (Experimental) whether to print the message sent.
Raises:
ValueError: if the message can't be converted into a valid ChatCompletion message.
"""
# When the agent composes and sends the message, the role of the message is "assistant"
# unless it's "function".
valid = self._append_oai_message(message, "assistant", recipient)
if valid:
recipient.receive(message, self, request_reply, silent)
else:
raise ValueError(
"Message can't be converted into a valid ChatCompletion message. Either content or function_call must be provided."
)
async def a_send(
self,
message: Union[Dict, str],
recipient: Agent,
request_reply: Optional[bool] = None,
silent: Optional[bool] = False,
) -> bool:
"""(async) Send a message to another agent.
Args:
message (dict or str): message to be sent.
The message could contain the following fields (either content or function_call must be provided):
- content (str): the content of the message.
- function_call (str): the name of the function to be called.
- name (str): the name of the function to be called.
- role (str): the role of the message, any role that is not "function"
will be modified to "assistant".
- context (dict): the context of the message, which will be passed to
[autogen.Completion.create](../oai/Completion#create).
For example, one agent can send a message A as:
```python
{
"content": lambda context: context["use_tool_msg"],
"context": {
"use_tool_msg": "Use tool X if they are relevant."
}
}
```
Next time, one agent can send a message B with a different "use_tool_msg".
Then the content of message A will be refreshed to the new "use_tool_msg".
So effectively, this provides a way for an agent to send a "link" and modify
the content of the "link" later.
recipient (Agent): the recipient of the message.
request_reply (bool or None): whether to request a reply from the recipient.
silent (bool or None): (Experimental) whether to print the message sent.
Raises:
ValueError: if the message can't be converted into a valid ChatCompletion message.
"""
# When the agent composes and sends the message, the role of the message is "assistant"
# unless it's "function".
valid = self._append_oai_message(message, "assistant", recipient)
if valid:
await recipient.a_receive(message, self, request_reply, silent)
else:
raise ValueError(
"Message can't be converted into a valid ChatCompletion message. Either content or function_call must be provided."
)
def _print_received_message(self, message: Union[Dict, str], sender: Agent):
# print the message received
print(colored(sender.name, "yellow"), "(to", f"{self.name}):\n", flush=True)
if message.get("role") == "function":
func_print = f"***** Response from calling function \"{message['name']}\" *****"
print(colored(func_print, "green"), flush=True)
print(message["content"], flush=True)
print(colored("*" * len(func_print), "green"), flush=True)
else:
content = message.get("content")
if content is not None:
if "context" in message:
content = oai.ChatCompletion.instantiate(
content,
message["context"],
self.llm_config and self.llm_config.get("allow_format_str_template", False),
)
print(content, flush=True)
if "function_call" in message:
func_print = f"***** Suggested function Call: {message['function_call'].get('name', '(No function name found)')} *****"
print(colored(func_print, "green"), flush=True)
print(
"Arguments: \n",
message["function_call"].get("arguments", "(No arguments found)"),
flush=True,
sep="",
)
print(colored("*" * len(func_print), "green"), flush=True)
print("\n", "-" * 80, flush=True, sep="")
def _process_received_message(self, message, sender, silent):
message = self._message_to_dict(message)
# When the agent receives a message, the role of the message is "user". (If 'role' exists and is 'function', it will remain unchanged.)
valid = self._append_oai_message(message, "user", sender)
if not valid:
raise ValueError(
"Received message can't be converted into a valid ChatCompletion message. Either content or function_call must be provided."
)
if not silent:
self._print_received_message(message, sender)
def receive(
self,
message: Union[Dict, str],
sender: Agent,
request_reply: Optional[bool] = None,
silent: Optional[bool] = False,
):
"""Receive a message from another agent.
Once a message is received, this function sends a reply to the sender or stop.
The reply can be generated automatically or entered manually by a human.
Args:
message (dict or str): message from the sender. If the type is dict, it may contain the following reserved fields (either content or function_call need to be provided).
1. "content": content of the message, can be None.
2. "function_call": a dictionary containing the function name and arguments.
3. "role": role of the message, can be "assistant", "user", "function".
This field is only needed to distinguish between "function" or "assistant"/"user".
4. "name": In most cases, this field is not needed. When the role is "function", this field is needed to indicate the function name.
5. "context" (dict): the context of the message, which will be passed to
[autogen.Completion.create](../oai/Completion#create).
sender: sender of an Agent instance.
request_reply (bool or None): whether a reply is requested from the sender.
If None, the value is determined by `self.reply_at_receive[sender]`.
silent (bool or None): (Experimental) whether to print the message received.
Raises:
ValueError: if the message can't be converted into a valid ChatCompletion message.
"""
self._process_received_message(message, sender, silent)
if request_reply is False or request_reply is None and self.reply_at_receive[sender] is False:
return
reply = self.generate_reply(messages=self.chat_messages[sender], sender=sender)
if reply is not None:
self.send(reply, sender, silent=silent)
async def a_receive(
self,
message: Union[Dict, str],
sender: Agent,
request_reply: Optional[bool] = None,
silent: Optional[bool] = False,
):
"""(async) Receive a message from another agent.
Once a message is received, this function sends a reply to the sender or stop.
The reply can be generated automatically or entered manually by a human.
Args:
message (dict or str): message from the sender. If the type is dict, it may contain the following reserved fields (either content or function_call need to be provided).
1. "content": content of the message, can be None.
2. "function_call": a dictionary containing the function name and arguments.
3. "role": role of the message, can be "assistant", "user", "function".
This field is only needed to distinguish between "function" or "assistant"/"user".
4. "name": In most cases, this field is not needed. When the role is "function", this field is needed to indicate the function name.
5. "context" (dict): the context of the message, which will be passed to
[autogen.Completion.create](../oai/Completion#create).
sender: sender of an Agent instance.
request_reply (bool or None): whether a reply is requested from the sender.
If None, the value is determined by `self.reply_at_receive[sender]`.
silent (bool or None): (Experimental) whether to print the message received.
Raises:
ValueError: if the message can't be converted into a valid ChatCompletion message.
"""
self._process_received_message(message, sender, silent)
if request_reply is False or request_reply is None and self.reply_at_receive[sender] is False:
return
reply = await self.a_generate_reply(sender=sender)
if reply is not None:
await self.a_send(reply, sender, silent=silent)
def _prepare_chat(self, recipient, clear_history):
self.reset_consecutive_auto_reply_counter(recipient)
recipient.reset_consecutive_auto_reply_counter(self)
self.reply_at_receive[recipient] = recipient.reply_at_receive[self] = True
if clear_history:
self.clear_history(recipient)
recipient.clear_history(self)
def initiate_chat(
self,
recipient: "ConversableAgent",
clear_history: Optional[bool] = True,
silent: Optional[bool] = False,
**context,
):
"""Initiate a chat with the recipient agent.
Reset the consecutive auto reply counter.
If `clear_history` is True, the chat history with the recipient agent will be cleared.
`generate_init_message` is called to generate the initial message for the agent.
Args:
recipient: the recipient agent.
clear_history (bool): whether to clear the chat history with the agent.
silent (bool or None): (Experimental) whether to print the messages for this conversation.
**context: any context information.
"message" needs to be provided if the `generate_init_message` method is not overridden.
"""
self._prepare_chat(recipient, clear_history)
self.send(self.generate_init_message(**context), recipient, silent=silent)
async def a_initiate_chat(
self,
recipient: "ConversableAgent",
clear_history: Optional[bool] = True,
silent: Optional[bool] = False,
**context,
):
"""(async) Initiate a chat with the recipient agent.
Reset the consecutive auto reply counter.
If `clear_history` is True, the chat history with the recipient agent will be cleared.
`generate_init_message` is called to generate the initial message for the agent.
Args:
recipient: the recipient agent.
clear_history (bool): whether to clear the chat history with the agent.
silent (bool or None): (Experimental) whether to print the messages for this conversation.
**context: any context information.
"message" needs to be provided if the `generate_init_message` method is not overridden.
"""
self._prepare_chat(recipient, clear_history)
await self.a_send(self.generate_init_message(**context), recipient, silent=silent)
def reset(self):
"""Reset the agent."""
self.clear_history()
self.reset_consecutive_auto_reply_counter()
self.stop_reply_at_receive()
for reply_func_tuple in self._reply_func_list:
if reply_func_tuple["reset_config"] is not None:
reply_func_tuple["reset_config"](reply_func_tuple["config"])
else:
reply_func_tuple["config"] = copy.copy(reply_func_tuple["init_config"])
def stop_reply_at_receive(self, sender: Optional[Agent] = None):
"""Reset the reply_at_receive of the sender."""
if sender is None:
self.reply_at_receive.clear()
else:
self.reply_at_receive[sender] = False
def reset_consecutive_auto_reply_counter(self, sender: Optional[Agent] = None):
"""Reset the consecutive_auto_reply_counter of the sender."""
if sender is None:
self._consecutive_auto_reply_counter.clear()
else:
self._consecutive_auto_reply_counter[sender] = 0
def clear_history(self, agent: Optional[Agent] = None):
"""Clear the chat history of the agent.
Args:
agent: the agent with whom the chat history to clear. If None, clear the chat history with all agents.
"""
if agent is None:
self._oai_messages.clear()
else:
self._oai_messages[agent].clear()
def generate_oai_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
) -> Tuple[bool, Union[str, Dict, None]]:
"""Generate a reply using autogen.oai."""
llm_config = self.llm_config if config is None else config
if llm_config is False:
return False, None
if messages is None:
messages = self._oai_messages[sender]
# TODO: #1143 handle token limit exceeded error
response = oai.ChatCompletion.create(
context=messages[-1].pop("context", None), messages=self._oai_system_message + messages, **llm_config
)
return True, oai.ChatCompletion.extract_text_or_function_call(response)[0]
def generate_code_execution_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
):
"""Generate a reply using code execution."""
code_execution_config = config if config is not None else self._code_execution_config
if code_execution_config is False:
return False, None
if messages is None:
messages = self._oai_messages[sender]
last_n_messages = code_execution_config.pop("last_n_messages", 1)
for i in range(min(len(messages), last_n_messages)):
message = messages[-(i + 1)]
code_blocks = extract_code(message["content"])
if len(code_blocks) == 1 and code_blocks[0][0] == UNKNOWN:
# no code block is found, lang should be `UNKNOWN`
if i == last_n_messages - 1:
code_execution_config["last_n_messages"] = last_n_messages
return False, None
continue
# code_blocks, _ = find_code(messages, sys_msg=self._oai_system_message, **self.llm_config)
# if len(code_blocks) == 1 and code_blocks[0][0] == UNKNOWN:
# return code_blocks[0][1]
# try to execute the code
exitcode, logs = self.execute_code_blocks(code_blocks)
exitcode2str = "execution succeeded" if exitcode == 0 else "execution failed"
break
code_execution_config["last_n_messages"] = last_n_messages
return True, f"exitcode: {exitcode} ({exitcode2str})\nCode output: {logs}"
def generate_function_call_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
):
"""Generate a reply using function call."""
if config is None:
config = self
if messages is None:
messages = self._oai_messages[sender]
message = messages[-1]
if "function_call" in message:
_, func_return = self.execute_function(message["function_call"])
return True, func_return
return False, None
def check_termination_and_human_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[Any] = None,
) -> Tuple[bool, Union[str, Dict, None]]:
"""Check if the conversation should be terminated, and if human reply is provided."""
if config is None:
config = self
if messages is None:
messages = self._oai_messages[sender]
message = messages[-1]
reply = ""
no_human_input_msg = ""
if self.human_input_mode == "ALWAYS":
reply = self.get_human_input(
f"Provide feedback to {sender.name}. Press enter to skip and use auto-reply, or type 'exit' to end the conversation: "
)
no_human_input_msg = "NO HUMAN INPUT RECEIVED." if not reply else ""
# if the human input is empty, and the message is a termination message, then we will terminate the conversation
reply = reply if reply or not self._is_termination_msg(message) else "exit"
else:
if self._consecutive_auto_reply_counter[sender] >= self._max_consecutive_auto_reply_dict[sender]:
if self.human_input_mode == "NEVER":
reply = "exit"
else:
# self.human_input_mode == "TERMINATE":
terminate = self._is_termination_msg(message)
reply = self.get_human_input(
f"Please give feedback to {sender.name}. Press enter or type 'exit' to stop the conversation: "
if terminate
else f"Please give feedback to {sender.name}. Press enter to skip and use auto-reply, or type 'exit' to stop the conversation: "
)
no_human_input_msg = "NO HUMAN INPUT RECEIVED." if not reply else ""
# if the human input is empty, and the message is a termination message, then we will terminate the conversation
reply = reply if reply or not terminate else "exit"
elif self._is_termination_msg(message):
if self.human_input_mode == "NEVER":
reply = "exit"
else:
# self.human_input_mode == "TERMINATE":
reply = self.get_human_input(
f"Please give feedback to {sender.name}. Press enter or type 'exit' to stop the conversation: "
)
no_human_input_msg = "NO HUMAN INPUT RECEIVED." if not reply else ""
# if the human input is empty, and the message is a termination message, then we will terminate the conversation
reply = reply or "exit"
# print the no_human_input_msg
if no_human_input_msg:
print(colored(f"\n>>>>>>>> {no_human_input_msg}", "red"), flush=True)
# stop the conversation
if reply == "exit":
# reset the consecutive_auto_reply_counter
self._consecutive_auto_reply_counter[sender] = 0
return True, None
# send the human reply
if reply or self._max_consecutive_auto_reply_dict[sender] == 0:
# reset the consecutive_auto_reply_counter
self._consecutive_auto_reply_counter[sender] = 0
return True, reply
# increment the consecutive_auto_reply_counter
self._consecutive_auto_reply_counter[sender] += 1
if self.human_input_mode != "NEVER":
print(colored("\n>>>>>>>> USING AUTO REPLY...", "red"), flush=True)
return False, None
def generate_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
exclude: Optional[List[Callable]] = None,
) -> Union[str, Dict, None]:
"""Reply based on the conversation history and the sender.
Either messages or sender must be provided.
Register a reply_func with `None` as one trigger for it to be activated when `messages` is non-empty and `sender` is `None`.
Use registered auto reply functions to generate replies.
By default, the following functions are checked in order:
1. check_termination_and_human_reply
2. generate_function_call_reply
3. generate_code_execution_reply
4. generate_oai_reply
Every function returns a tuple (final, reply).
When a function returns final=False, the next function will be checked.
So by default, termination and human reply will be checked first.
If not terminating and human reply is skipped, execute function or code and return the result.
AI replies are generated only when no code execution is performed.
Args:
messages: a list of messages in the conversation history.
default_reply (str or dict): default reply.
sender: sender of an Agent instance.
exclude: a list of functions to exclude.
Returns:
str or dict or None: reply. None if no reply is generated.
"""
assert messages is not None or sender is not None, "Either messages or sender must be provided."
if messages is None:
messages = self._oai_messages[sender]
for reply_func_tuple in self._reply_func_list:
reply_func = reply_func_tuple["reply_func"]
if exclude and reply_func in exclude:
continue
if asyncio.coroutines.iscoroutinefunction(reply_func):
continue
if self._match_trigger(reply_func_tuple["trigger"], sender):
final, reply = reply_func(self, messages=messages, sender=sender, config=reply_func_tuple["config"])
if final:
return reply
return self._default_auto_reply
async def a_generate_reply(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
exclude: Optional[List[Callable]] = None,
) -> Union[str, Dict, None]:
"""(async) Reply based on the conversation history and the sender.
Either messages or sender must be provided.
Register a reply_func with `None` as one trigger for it to be activated when `messages` is non-empty and `sender` is `None`.
Use registered auto reply functions to generate replies.
By default, the following functions are checked in order:
1. check_termination_and_human_reply
2. generate_function_call_reply
3. generate_code_execution_reply
4. generate_oai_reply
Every function returns a tuple (final, reply).
When a function returns final=False, the next function will be checked.
So by default, termination and human reply will be checked first.
If not terminating and human reply is skipped, execute function or code and return the result.
AI replies are generated only when no code execution is performed.
Args:
messages: a list of messages in the conversation history.
default_reply (str or dict): default reply.
sender: sender of an Agent instance.
exclude: a list of functions to exclude.
Returns:
str or dict or None: reply. None if no reply is generated.
"""
assert messages is not None or sender is not None, "Either messages or sender must be provided."
if messages is None:
messages = self._oai_messages[sender]
for reply_func_tuple in self._reply_func_list:
reply_func = reply_func_tuple["reply_func"]
if exclude and reply_func in exclude:
continue
if self._match_trigger(reply_func_tuple["trigger"], sender):
if asyncio.coroutines.iscoroutinefunction(reply_func):
final, reply = await reply_func(
self, messages=messages, sender=sender, config=reply_func_tuple["config"]
)
else:
final, reply = reply_func(self, messages=messages, sender=sender, config=reply_func_tuple["config"])
if final:
return reply
return self._default_auto_reply
def _match_trigger(self, trigger, sender):
"""Check if the sender matches the trigger."""
if trigger is None:
return sender is None
elif isinstance(trigger, str):
return trigger == sender.name
elif isinstance(trigger, type):
return isinstance(sender, trigger)
elif isinstance(trigger, Agent):
return trigger == sender
elif isinstance(trigger, Callable):
return trigger(sender)
elif isinstance(trigger, list):
return any(self._match_trigger(t, sender) for t in trigger)
else:
raise ValueError(f"Unsupported trigger type: {type(trigger)}")
def get_human_input(self, prompt: str) -> str:
"""Get human input.
Override this method to customize the way to get human input.
Args:
prompt (str): prompt for the human input.
Returns:
str: human input.
"""
reply = input(prompt)
return reply
def run_code(self, code, **kwargs):
"""Run the code and return the result.
Override this function to modify the way to run the code.
Args:
code (str): the code to be executed.
**kwargs: other keyword arguments.
Returns:
A tuple of (exitcode, logs, image).
exitcode (int): the exit code of the code execution.
logs (str): the logs of the code execution.
image (str or None): the docker image used for the code execution.
"""
return execute_code(code, **kwargs)
def execute_code_blocks(self, code_blocks):
"""Execute the code blocks and return the result."""
logs_all = ""
for i, code_block in enumerate(code_blocks):
lang, code = code_block
if not lang:
lang = infer_lang(code)
print(
colored(
f"\n>>>>>>>> EXECUTING CODE BLOCK {i} (inferred language is {lang})...",
"red",
),
flush=True,
)
if lang in ["bash", "shell", "sh"]:
exitcode, logs, image = self.run_code(code, lang=lang, **self._code_execution_config)
elif lang in ["python", "Python"]:
if code.startswith("# filename: "):
filename = code[11 : code.find("\n")].strip()
else:
filename = None
exitcode, logs, image = self.run_code(
code,
lang="python",
filename=filename,
**self._code_execution_config,
)
else:
# In case the language is not supported, we return an error message.
exitcode, logs, image = (
1,
f"unknown language {lang}",
None,
)
# raise NotImplementedError
if image is not None:
self._code_execution_config["use_docker"] = image
logs_all += "\n" + logs
if exitcode != 0:
return exitcode, logs_all
return exitcode, logs_all
@staticmethod
def _format_json_str(jstr):
"""Remove newlines outside of quotes, and handle JSON escape sequences.
1. this function removes the newline in the query outside of quotes otherwise json.loads(s) will fail.
Ex 1:
"{\n"tool": "python",\n"query": "print('hello')\nprint('world')"\n}" -> "{"tool": "python","query": "print('hello')\nprint('world')"}"
Ex 2:
"{\n \"location\": \"Boston, MA\"\n}" -> "{"location": "Boston, MA"}"
2. this function also handles JSON escape sequences inside quotes,
Ex 1:
'{"args": "a\na\na\ta"}' -> '{"args": "a\\na\\na\\ta"}'
"""
result = []
inside_quotes = False
last_char = " "
for char in jstr:
if last_char != "\\" and char == '"':
inside_quotes = not inside_quotes
last_char = char
if not inside_quotes and char == "\n":
continue
if inside_quotes and char == "\n":
char = "\\n"
if inside_quotes and char == "\t":
char = "\\t"
result.append(char)
return "".join(result)
def execute_function(self, func_call):
"""Execute a function call and return the result.
Override this function to modify the way to execute a function call.
Args:
func_call: a dictionary extracted from openai message at key "function_call" with keys "name" and "arguments".
Returns:
A tuple of (is_exec_success, result_dict).
is_exec_success (boolean): whether the execution is successful.
result_dict: a dictionary with keys "name", "role", and "content". Value of "role" is "function".
"""
func_name = func_call.get("name", "")
func = self._function_map.get(func_name, None)
is_exec_success = False
if func is not None:
# Extract arguments from a json-like string and put it into a dict.
input_string = self._format_json_str(func_call.get("arguments", "{}"))
try:
arguments = json.loads(input_string)
except json.JSONDecodeError as e:
arguments = None
content = f"Error: {e}\n You argument should follow json format."
# Try to execute the function
if arguments is not None:
print(
colored(f"\n>>>>>>>> EXECUTING FUNCTION {func_name}...", "magenta"),
flush=True,
)
try:
content = func(**arguments)
is_exec_success = True
except Exception as e:
content = f"Error: {e}"
else:
content = f"Error: Function {func_name} not found."
return is_exec_success, {
"name": func_name,
"role": "function",
"content": str(content),
}
def generate_init_message(self, **context) -> Union[str, Dict]:
"""Generate the initial message for the agent.
Override this function to customize the initial message based on user's request.
If not overriden, "message" needs to be provided in the context.
"""
return context["message"]
def register_function(self, function_map: Dict[str, Callable]):
"""Register functions to the agent.
Args:
function_map: a dictionary mapping function names to functions.
"""
self._function_map.update(function_map)

View File

@ -0,0 +1,133 @@
from dataclasses import dataclass
import sys
from typing import Dict, List, Optional, Union
from .agent import Agent
from .conversable_agent import ConversableAgent
@dataclass
class GroupChat:
"""A group chat class that contains a list of agents and the maximum number of rounds."""
agents: List[Agent]
messages: List[Dict]
max_round: int = 10
admin_name: str = "Admin" # the name of the admin agent
@property
def agent_names(self) -> List[str]:
"""Return the names of the agents in the group chat."""
return [agent.name for agent in self.agents]
def reset(self):
"""Reset the group chat."""
self.messages.clear()
def agent_by_name(self, name: str) -> Agent:
"""Find the next speaker based on the message."""
return self.agents[self.agent_names.index(name)]
def next_agent(self, agent: Agent) -> Agent:
"""Return the next agent in the list."""
return self.agents[(self.agent_names.index(agent.name) + 1) % len(self.agents)]
def select_speaker_msg(self):
"""Return the message for selecting the next speaker."""
return f"""You are in a role play game. The following roles are available:
{self._participant_roles()}.
Read the following conversation.
Then select the next role from {self.agent_names} to play. Only return the role."""
def select_speaker(self, last_speaker: Agent, selector: ConversableAgent):
"""Select the next speaker."""
selector.update_system_message(self.select_speaker_msg())
final, name = selector.generate_oai_reply(
self.messages
+ [
{
"role": "system",
"content": f"Read the above conversation. Then select the next role from {self.agent_names} to play. Only return the role.",
}
]
)
if not final:
# i = self._random.randint(0, len(self._agent_names) - 1) # randomly pick an id
return self.next_agent(last_speaker)
try:
return self.agent_by_name(name)
except ValueError:
return self.next_agent(last_speaker)
def _participant_roles(self):
return "\n".join([f"{agent.name}: {agent.system_message}" for agent in self.agents])
class GroupChatManager(ConversableAgent):
"""(In preview) A chat manager agent that can manage a group chat of multiple agents."""
def __init__(
self,
groupchat: GroupChat,
name: Optional[str] = "chat_manager",
# unlimited consecutive auto reply by default
max_consecutive_auto_reply: Optional[int] = sys.maxsize,
human_input_mode: Optional[str] = "NEVER",
system_message: Optional[str] = "Group chat manager.",
# seed: Optional[int] = 4,
**kwargs,
):
super().__init__(
name=name,
max_consecutive_auto_reply=max_consecutive_auto_reply,
human_input_mode=human_input_mode,
system_message=system_message,
**kwargs,
)
self.register_reply(Agent, GroupChatManager.run_chat, config=groupchat, reset_config=GroupChat.reset)
# self._random = random.Random(seed)
def run_chat(
self,
messages: Optional[List[Dict]] = None,
sender: Optional[Agent] = None,
config: Optional[GroupChat] = None,
) -> Union[str, Dict, None]:
"""Run a group chat."""
if messages is None:
messages = self._oai_messages[sender]
message = messages[-1]
speaker = sender
groupchat = config
for i in range(groupchat.max_round):
# set the name to speaker's name if the role is not function
if message["role"] != "function":
message["name"] = speaker.name
groupchat.messages.append(message)
# broadcast the message to all agents except the speaker
for agent in groupchat.agents:
if agent != speaker:
self.send(message, agent, request_reply=False, silent=True)
if i == groupchat.max_round - 1:
# the last round
break
try:
# select the next speaker
speaker = groupchat.select_speaker(speaker, self)
# let the speaker speak
reply = speaker.generate_reply(sender=self)
except KeyboardInterrupt:
# let the admin agent speak if interrupted
if groupchat.admin_name in groupchat.agent_names:
# admin agent is one of the participants
speaker = groupchat.agent_by_name(groupchat.admin_name)
reply = speaker.generate_reply(sender=self)
else:
# admin agent is not found in the participants
raise
if reply is None:
break
# The speaker sends the message without requesting a reply
speaker.send(reply, self, request_reply=False)
message = self.last_message(speaker)
return True, None

View File

@ -0,0 +1,82 @@
from .conversable_agent import ConversableAgent
from typing import Callable, Dict, Optional, Union
class UserProxyAgent(ConversableAgent):
"""(In preview) A proxy agent for the user, that can execute code and provide feedback to the other agents.
UserProxyAgent is a subclass of ConversableAgent configured with `human_input_mode` to ALWAYS
and `llm_config` to False. By default, the agent will prompt for human input every time a message is received.
Code execution is enabled by default. LLM-based auto reply is disabled by default.
To modify auto reply, register a method with (`register_reply`)[conversable_agent#register_reply].
To modify the way to get human input, override `get_human_input` method.
To modify the way to execute code blocks, single code block, or function call, override `execute_code_blocks`,
`run_code`, and `execute_function` methods respectively.
To customize the initial message when a conversation starts, override `generate_init_message` method.
"""
def __init__(
self,
name: str,
is_termination_msg: Optional[Callable[[Dict], bool]] = None,
max_consecutive_auto_reply: Optional[int] = None,
human_input_mode: Optional[str] = "ALWAYS",
function_map: Optional[Dict[str, Callable]] = None,
code_execution_config: Optional[Union[Dict, bool]] = None,
default_auto_reply: Optional[Union[str, Dict, None]] = "",
llm_config: Optional[Union[Dict, bool]] = False,
system_message: Optional[str] = "",
):
"""
Args:
name (str): name of the agent.
is_termination_msg (function): a function that takes a message in the form of a dictionary
and returns a boolean value indicating if this received message is a termination message.
The dict can contain the following keys: "content", "role", "name", "function_call".
max_consecutive_auto_reply (int): the maximum number of consecutive auto replies.
default to None (no limit provided, class attribute MAX_CONSECUTIVE_AUTO_REPLY will be used as the limit in this case).
The limit only plays a role when human_input_mode is not "ALWAYS".
human_input_mode (str): whether to ask for human inputs every time a message is received.
Possible values are "ALWAYS", "TERMINATE", "NEVER".
(1) When "ALWAYS", the agent prompts for human input every time a message is received.
Under this mode, the conversation stops when the human input is "exit",
or when is_termination_msg is True and there is no human input.
(2) When "TERMINATE", the agent only prompts for human input only when a termination message is received or
the number of auto reply reaches the max_consecutive_auto_reply.
(3) When "NEVER", the agent will never prompt for human input. Under this mode, the conversation stops
when the number of auto reply reaches the max_consecutive_auto_reply or when is_termination_msg is True.
function_map (dict[str, callable]): Mapping function names (passed to openai) to callable functions.
code_execution_config (dict or False): config for the code execution.
To disable code execution, set to False. Otherwise, set to a dictionary with the following keys:
- work_dir (Optional, str): The working directory for the code execution.
If None, a default working directory will be used.
The default working directory is the "extensions" directory under
"path_to_flaml/autogen".
- use_docker (Optional, list, str or bool): The docker image to use for code execution.
If a list or a str of image name(s) is provided, the code will be executed in a docker container
with the first image successfully pulled.
If None, False or empty, the code will be executed in the current environment.
Default is True, which will be converted into a list.
If the code is executed in the current environment,
the code must be trusted.
- timeout (Optional, int): The maximum execution time in seconds.
- last_n_messages (Experimental, Optional, int): The number of messages to look back for code execution. Default to 1.
default_auto_reply (str or dict or None): the default auto reply message when no code execution or llm based reply is generated.
llm_config (dict or False): llm inference configuration.
Please refer to [autogen.Completion.create](/docs/reference/autogen/oai/completion#create)
for available options.
Default to false, which disables llm-based auto reply.
system_message (str): system message for ChatCompletion inference.
Only used when llm_config is not False. Use it to reprogram the agent.
"""
super().__init__(
name,
system_message,
is_termination_msg,
max_consecutive_auto_reply,
human_input_mode,
function_map,
code_execution_config,
llm_config,
default_auto_reply,
)

548
flaml/autogen/code_utils.py Normal file
View File

@ -0,0 +1,548 @@
import signal
import subprocess
import sys
import os
import pathlib
from typing import List, Dict, Tuple, Optional, Union, Callable
import re
import time
from hashlib import md5
import logging
from flaml.autogen import oai
try:
import docker
except ImportError:
docker = None
DEFAULT_MODEL = "gpt-4"
FAST_MODEL = "gpt-3.5-turbo"
# Regular expression for finding a code block
CODE_BLOCK_PATTERN = r"```(\w*)\n(.*?)\n```"
WORKING_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), "extensions")
UNKNOWN = "unknown"
TIMEOUT_MSG = "Timeout"
DEFAULT_TIMEOUT = 600
def infer_lang(code):
"""infer the language for the code.
TODO: make it robust.
"""
if code.startswith("python ") or code.startswith("pip") or code.startswith("python3 "):
return "sh"
return "python"
def extract_code(text: str, pattern: str = CODE_BLOCK_PATTERN) -> List[Tuple[str, str]]:
"""Extract code from a text.
Args:
text (str): The text to extract code from.
pattern (Optional, str): The regular expression pattern for finding the code block.
Returns:
list: A list of tuples, each containing the language and the code.
"""
# Use a regular expression to find all the code blocks
match = re.findall(pattern, text, flags=re.DOTALL)
# match = re.search(pattern, text, flags=re.DOTALL)
# If a match is found, return the code
# if match:
# return match.group(2), match.group(1)
# If no code block is found, return the whole text
return match if match else [(UNKNOWN, text)]
# _FIND_CODE_SYS_MSG = [
# {
# "role": "system",
# "content": """In the following conversation, an assistant suggests code and a user is expected to run it.
# Read the conversation, and then find all the right code blocks for the user to run next in the right order.
# Only return the code blocks that are expected to run.
# Don't include code blocks which have been executed unless the user is requested to run the same block again.
# When the user needs to run multiple blocks in sequence, make sure to output all the blocks to run in a right order.
# If the line beginning with "# filename" is put before a code block, move it into the code block as the first line.
# Make sure to add the right "python" or "sh" identifier if the language identifier is missing for a code block.
# Don't make other changes to the code blocks.
# Don't reply anything else if at least one code block is expected to run.
# If no code block is expeted to run, check whether the task has been successfully finished at full satisfaction.
# If not, reply with the reason why the task is not finished.""",
# },
# ]
# _FIND_CODE_CONFIG = {
# "model": FAST_MODEL,
# }
# def find_code(messages: List[Dict], sys_msg=None, **config) -> Tuple[List[Tuple[str, str]], str]:
# """Find code from a list of messages.
# Args:
# messages (str): The list of messages to find code from.
# sys_msg (Optional, str): The system message to prepend to the messages.
# config (Optional, dict): The configuration for the API call.
# Returns:
# list: A list of tuples, each containing the language and the code.
# str: The generated text by llm.
# """
# params = {**_FIND_CODE_CONFIG, **config}
# if sys_msg is None or not sys_msg[0]["content"]:
# sys_msg = _FIND_CODE_SYS_MSG
# response = oai.ChatCompletion.create(messages=sys_msg + messages, **params)
# content = oai.Completion.extract_text(response)[0]
# return extract_code(content), content
def generate_code(pattern: str = CODE_BLOCK_PATTERN, **config) -> Tuple[str, float]:
"""Generate code.
Args:
pattern (Optional, str): The regular expression pattern for finding the code block.
The default pattern is for finding a code block in a markdown file.
config (Optional, dict): The configuration for the API call.
Returns:
str: The generated code.
float: The cost of the generation.
"""
response = oai.Completion.create(**config)
return extract_code(oai.Completion.extract_text(response)[0], pattern), response["cost"]
_IMPROVE_FUNCTION_CONFIG = {
"prompt": """Improve the function '{func_name}' to achieve the objective '{objective}'.
The current implementation of the function is as follows:
{file_string}""",
"model": DEFAULT_MODEL,
"request_timeout": 600,
}
def improve_function(file_name, func_name, objective, **config):
"""(work in progress) Improve the function to achieve the objective."""
params = {**_IMPROVE_FUNCTION_CONFIG, **config}
# read the entire file into a str
with open(file_name, "r") as f:
file_string = f.read()
response = oai.Completion.create(
{"func_name": func_name, "objective": objective, "file_string": file_string}, **params
)
return oai.Completion.extract_text(response)[0], response["cost"]
_IMPROVE_CODE_CONFIG = {
"prompt": """Analyze the code in the following files and return a list of suggestions for improvement{followup}, to achieve the objective of '{objective}'.
{code}
""",
"model": DEFAULT_MODEL,
"request_timeout": 900,
}
def improve_code(files, objective, suggest_only=True, **config):
"""Improve the code to achieve a given objective.
Args:
files (list): A list of file names containing the source code.
objective (str): The objective to achieve.
suggest_only (bool): Whether to return only the suggestions or the improved code.
config (Optional, dict): The configuration for the API call.
Returns:
str: The improved code if suggest_only=False; a list of suggestions if suggest_only=True (default).
float: The cost of the generation.
"""
code = ""
for file_name in files:
# read the entire file into a string
with open(file_name, "r") as f:
file_string = f.read()
code += f"""{file_name}:
{file_string}
"""
params = {**_IMPROVE_CODE_CONFIG, **config}
followup = "" if suggest_only else " followed by the improved code"
response = oai.Completion.create({"objective": objective, "code": code, "followup": followup}, **params)
return oai.Completion.extract_text(response)[0], response["cost"]
def timeout_handler(signum, frame):
raise TimeoutError("Timed out!")
def _cmd(lang):
if lang.startswith("python") or lang in ["bash", "sh"]:
return lang
if lang == "shell":
return "sh"
raise NotImplementedError(f"{lang} not recognized in code execution")
def execute_code(
code: Optional[str] = None,
timeout: Optional[int] = None,
filename: Optional[str] = None,
work_dir: Optional[str] = None,
use_docker: Optional[Union[List[str], str, bool]] = docker is not None,
lang: Optional[str] = "python",
) -> Tuple[int, str, str]:
"""Execute code in a docker container.
This function is not tested on MacOS.
Args:
code (Optional, str): The code to execute.
If None, the code from the file specified by filename will be executed.
Either code or filename must be provided.
timeout (Optional, int): The maximum execution time in seconds.
If None, a default timeout will be used. The default timeout is 600 seconds. On Windows, the timeout is not enforced when use_docker=False.
filename (Optional, str): The file name to save the code or where the code is stored when `code` is None.
If None, a file with a randomly generated name will be created.
The randomly generated file will be deleted after execution.
The file name must be a relative path. Relative paths are relative to the working directory.
work_dir (Optional, str): The working directory for the code execution.
If None, a default working directory will be used.
The default working directory is the "extensions" directory under
"path_to_flaml/autogen".
use_docker (Optional, list, str or bool): The docker image to use for code execution.
If a list or a str of image name(s) is provided, the code will be executed in a docker container
with the first image successfully pulled.
If None, False or empty, the code will be executed in the current environment.
Default is True, which will be converted into a list.
If the code is executed in the current environment,
the code must be trusted.
lang (Optional, str): The language of the code. Default is "python".
Returns:
int: 0 if the code executes successfully.
str: The error message if the code fails to execute; the stdout otherwise.
image: The docker image name after container run when docker is used.
"""
assert code is not None or filename is not None, "Either code or filename must be provided."
timeout = timeout or DEFAULT_TIMEOUT
original_filename = filename
if filename is None:
code_hash = md5(code.encode()).hexdigest()
# create a file with a automatically generated name
filename = f"tmp_code_{code_hash}.{'py' if lang.startswith('python') else lang}"
if work_dir is None:
work_dir = WORKING_DIR
filepath = os.path.join(work_dir, filename)
file_dir = os.path.dirname(filepath)
os.makedirs(file_dir, exist_ok=True)
if code is not None:
with open(filepath, "w") as fout:
fout.write(code)
# check if already running in a docker container
in_docker_container = os.path.exists("/.dockerenv")
if not use_docker or in_docker_container:
# already running in a docker container
cmd = [sys.executable if lang.startswith("python") else _cmd(lang), filename]
if sys.platform == "win32":
logging.warning("SIGALRM is not supported on Windows. No timeout will be enforced.")
result = subprocess.run(
cmd,
cwd=work_dir,
capture_output=True,
)
else:
signal.signal(signal.SIGALRM, timeout_handler)
try:
signal.alarm(timeout)
# run the code in a subprocess in the current docker container in the working directory
result = subprocess.run(
cmd,
cwd=work_dir,
capture_output=True,
)
signal.alarm(0)
except TimeoutError:
if original_filename is None:
os.remove(filepath)
return 1, TIMEOUT_MSG, None
if original_filename is None:
os.remove(filepath)
abs_path = str(pathlib.Path(filepath).absolute())
else:
abs_path = str(pathlib.Path(work_dir).absolute()) + "/"
if result.returncode:
logs = result.stderr.decode("utf-8")
logs = logs.replace(str(abs_path), "")
else:
logs = result.stdout.decode("utf-8")
return result.returncode, logs, None
# create a docker client
client = docker.from_env()
image_list = (
["python:3-alpine", "python:3", "python:3-windowsservercore"]
if use_docker is True
else [use_docker]
if isinstance(use_docker, str)
else use_docker
)
for image in image_list:
# check if the image exists
try:
client.images.get(image)
break
except docker.errors.ImageNotFound:
# pull the image
print("Pulling image", image)
try:
client.images.pull(image)
break
except docker.errors.DockerException:
print("Failed to pull image", image)
# get a randomized str based on current time to wrap the exit code
exit_code_str = f"exitcode{time.time()}"
abs_path = pathlib.Path(work_dir).absolute()
# if sys.platform == "win32":
# abs_path = str(abs_path).replace("\\", "/")
# abs_path = f"/{abs_path[0].lower()}{abs_path[2:]}"
cmd = [
"sh",
"-c",
f"{_cmd(lang)} {filename}; exit_code=$?; echo -n {exit_code_str}; echo -n $exit_code; echo {exit_code_str}",
]
# create a docker container
container = client.containers.run(
image,
command=cmd,
working_dir="/workspace",
detach=True,
# get absolute path to the working directory
volumes={abs_path: {"bind": "/workspace", "mode": "rw"}},
)
start_time = time.time()
while container.status != "exited" and time.time() - start_time < timeout:
# Reload the container object
container.reload()
if container.status != "exited":
container.stop()
container.remove()
if original_filename is None:
os.remove(filepath)
return 1, TIMEOUT_MSG, image
# try:
# container.wait(timeout=timeout)
# except (ReadTimeout, ConnectionError):
# container.stop()
# container.remove()
# if original_filename is None:
# os.remove(filepath)
# return 1, "Timeout"
# get the container logs
logs = container.logs().decode("utf-8").rstrip()
# commit the image
tag = filename.replace("/", "")
container.commit(repository="python", tag=tag)
# remove the container
container.remove()
# check if the code executed successfully
exit_code = container.attrs["State"]["ExitCode"]
if exit_code == 0:
# extract the exit code from the logs
pattern = re.compile(f"{exit_code_str}(\\d+){exit_code_str}")
match = pattern.search(logs)
exit_code = 1 if match is None else int(match.group(1))
# remove the exit code from the logs
logs = logs if match is None else pattern.sub("", logs)
if original_filename is None:
os.remove(filepath)
if exit_code:
logs = logs.replace(f"/workspace/{filename if original_filename is None else ''}", "")
# return the exit code, logs and image
return exit_code, logs, f"python:{tag}"
_GENERATE_ASSERTIONS_CONFIG = {
"prompt": """Given the signature and docstring, write the exactly same number of assertion(s) for the provided example(s) in the docstring, without assertion messages.
func signature:
{definition}
assertions:""",
"model": FAST_MODEL,
"max_tokens": 256,
"stop": "\n\n",
}
def generate_assertions(definition: str, **config) -> Tuple[str, float]:
"""Generate assertions for a function.
Args:
definition (str): The function definition, including the signature and docstr.
config (Optional, dict): The configuration for the API call.
Returns:
str: The generated assertions.
float: The cost of the generation.
"""
params = {**_GENERATE_ASSERTIONS_CONFIG, **config}
response = oai.Completion.create(
{"definition": definition},
**params,
)
assertions = oai.Completion.extract_text(response)[0]
return assertions, response["cost"]
def _remove_check(response):
"""Remove the check function from the response."""
# find the position of the check function
pos = response.find("def check(")
if pos == -1:
return response
return response[:pos]
def eval_function_completions(
responses: List[str],
definition: str,
test: Optional[str] = None,
entry_point: Optional[str] = None,
assertions: Optional[Union[str, Callable[[str], Tuple[str, float]]]] = None,
timeout: Optional[float] = 3,
use_docker: Optional[bool] = True,
) -> Dict:
"""Select a response from a list of responses for the function completion task (using generated assertions), and/or evaluate if the task is successful using a gold test.
Args:
responses (list): The list of responses.
definition (str): The input definition.
test (Optional, str): The test code.
entry_point (Optional, str): The name of the function.
assertions (Optional, str or Callable): The assertion code which serves as a filter of the responses, or an assertion generator.
When provided, only the responses that pass the assertions will be considered for the actual test (if provided).
timeout (Optional, float): The timeout for executing the code.
Returns:
dict: The success metrics.
"""
n = len(responses)
if assertions is None:
# no assertion filter
success_list = []
for i in range(n):
response = _remove_check(responses[i])
code = (
f"{response}\n{test}\ncheck({entry_point})"
if response.startswith("def")
else f"{definition}{response}\n{test}\ncheck({entry_point})"
)
success = execute_code(code, timeout=timeout, use_docker=use_docker)[0] == 0
success_list.append(success)
return {
"expected_success": 1 - pow(1 - sum(success_list) / n, n),
"success": any(s for s in success_list),
}
if callable(assertions) and n > 1:
# assertion generator
assertions, gen_cost = assertions(definition)
else:
gen_cost = 0
if n > 1 or test is None:
for i in range(n):
response = responses[i] = _remove_check(responses[i])
code = (
f"{response}\n{assertions}" if response.startswith("def") else f"{definition}{response}\n{assertions}"
)
succeed_assertions = execute_code(code, timeout=timeout, use_docker=use_docker)[0] == 0
if succeed_assertions:
break
else:
# just test, no need to check assertions
succeed_assertions = False
i, response = 0, responses[0]
if test is None:
# no test code
return {
"index_selected": i,
"succeed_assertions": succeed_assertions,
"gen_cost": gen_cost,
"assertions": assertions,
}
code_test = (
f"{response}\n{test}\ncheck({entry_point})"
if response.startswith("def")
else f"{definition}{response}\n{test}\ncheck({entry_point})"
)
success = execute_code(code_test, timeout=timeout, use_docker=use_docker)[0] == 0
return {
"index_selected": i,
"succeed_assertions": succeed_assertions,
"success": success,
"gen_cost": gen_cost,
"assertions": assertions,
}
_FUNC_COMPLETION_PROMPT = "# Python 3{definition}"
_FUNC_COMPLETION_STOP = ["\nclass", "\ndef", "\nif", "\nprint"]
_IMPLEMENT_CONFIGS = [
{"model": FAST_MODEL, "prompt": _FUNC_COMPLETION_PROMPT, "temperature": 0, "seed": 0},
{"model": FAST_MODEL, "prompt": _FUNC_COMPLETION_PROMPT, "stop": _FUNC_COMPLETION_STOP, "n": 7, "seed": 0},
{"model": DEFAULT_MODEL, "prompt": _FUNC_COMPLETION_PROMPT, "temperature": 0, "seed": 1},
{"model": DEFAULT_MODEL, "prompt": _FUNC_COMPLETION_PROMPT, "stop": _FUNC_COMPLETION_STOP, "n": 2, "seed": 2},
{"model": DEFAULT_MODEL, "prompt": _FUNC_COMPLETION_PROMPT, "stop": _FUNC_COMPLETION_STOP, "n": 1, "seed": 2},
]
class PassAssertionFilter:
def __init__(self, assertions):
self._assertions = assertions
self.cost = 0
self.metrics = self.responses = None
def pass_assertions(self, context, response, **_):
"""Check if the response passes the assertions."""
responses = oai.Completion.extract_text(response)
metrics = eval_function_completions(responses, context["definition"], assertions=self._assertions)
self._assertions = metrics["assertions"]
self.cost += metrics["gen_cost"]
self.metrics = metrics
self.responses = responses
return metrics["succeed_assertions"]
def implement(
definition: str,
configs: Optional[List[Dict]] = None,
assertions: Optional[Union[str, Callable[[str], Tuple[str, float]]]] = generate_assertions,
) -> Tuple[str, float]:
"""Implement a function from a definition.
Args:
definition (str): The function definition, including the signature and docstr.
configs (list): The list of configurations for completion.
assertions (Optional, str or Callable): The assertion code which serves as a filter of the responses, or an assertion generator.
Returns:
str: The implementation.
float: The cost of the implementation.
int: The index of the configuration which generates the implementation.
"""
cost = 0
configs = configs or _IMPLEMENT_CONFIGS
if len(configs) > 1 and callable(assertions):
assertions, cost = assertions(definition)
assertion_filter = PassAssertionFilter(assertions)
response = oai.Completion.create(
{"definition": definition}, config_list=configs, filter_func=assertion_filter.pass_assertions
)
cost += assertion_filter.cost + response["cost"]
return assertion_filter.responses[assertion_filter.metrics["index_selected"]], cost, response["config_id"]
# for i, config in enumerate(configs):
# response = oai.Completion.create({"definition": definition}, **config)
# cost += oai.Completion.cost(response)
# responses = oai.Completion.extract_text(response)
# metrics = eval_function_completions(responses, definition, assertions=assertions)
# assertions = metrics["assertions"]
# cost += metrics["gen_cost"]
# if metrics["succeed_assertions"] or i == len(configs) - 1:
# return responses[metrics["index_selected"]], cost, i

View File

345
flaml/autogen/math_utils.py Normal file
View File

@ -0,0 +1,345 @@
from typing import Optional
from flaml.autogen import oai, DEFAULT_MODEL
_MATH_PROMPT = "{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."
_MATH_CONFIG = {
"model": DEFAULT_MODEL,
"prompt": _MATH_PROMPT,
}
def solve_problem(problem: str, **config) -> str:
"""(Experimental) Solve the math problem.
Args:
problem (str): The problem statement.
config (Optional, dict): The configuration for the API call.
Returns:
str: The solution to the problem.
"""
params = {**_MATH_CONFIG, **config}
response = oai.Completion.create({"problem": problem}, **params)
results = eval_math_responses(oai.Completion.extract_text(response))
return results.get("voted_answer"), response["cost"]
def remove_boxed(string: str) -> Optional[str]:
"""Source: https://github.com/hendrycks/math
Extract the text within a \\boxed{...} environment.
Example:
>> remove_boxed("\\boxed{\\frac{2}{3}}")
\\frac{2}{3}
"""
left = "\\boxed{"
try:
assert string[: len(left)] == left
assert string[-1] == "}"
return string[len(left) : -1]
except Exception:
return None
def last_boxed_only_string(string: str) -> Optional[str]:
"""Source: https://github.com/hendrycks/math
Extract the last \\boxed{...} or \\fbox{...} element from a string.
"""
idx = string.rfind("\\boxed")
if idx < 0:
idx = string.rfind("\\fbox")
if idx < 0:
return None
i = idx
right_brace_idx = None
num_left_braces_open = 0
while i < len(string):
if string[i] == "{":
num_left_braces_open += 1
if string[i] == "}":
num_left_braces_open -= 1
if num_left_braces_open == 0:
right_brace_idx = i
break
i += 1
if right_brace_idx is None:
retval = None
else:
retval = string[idx : right_brace_idx + 1]
return retval
def _fix_fracs(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat fractions.
Examples:
>>> _fix_fracs("\\frac1b")
\frac{1}{b}
>>> _fix_fracs("\\frac12")
\frac{1}{2}
>>> _fix_fracs("\\frac1{72}")
\frac{1}{72}
"""
substrs = string.split("\\frac")
new_str = substrs[0]
if len(substrs) > 1:
substrs = substrs[1:]
for substr in substrs:
new_str += "\\frac"
if substr[0] == "{":
new_str += substr
else:
try:
assert len(substr) >= 2
except Exception:
return string
a = substr[0]
b = substr[1]
if b != "{":
if len(substr) > 2:
post_substr = substr[2:]
new_str += "{" + a + "}{" + b + "}" + post_substr
else:
new_str += "{" + a + "}{" + b + "}"
else:
if len(substr) > 2:
post_substr = substr[2:]
new_str += "{" + a + "}" + b + post_substr
else:
new_str += "{" + a + "}" + b
string = new_str
return string
def _fix_a_slash_b(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat fractions formatted as a/b to \\frac{a}{b}.
Example:
>>> _fix_a_slash_b("2/3")
\frac{2}{3}
"""
if len(string.split("/")) != 2:
return string
a_str = string.split("/")[0]
b_str = string.split("/")[1]
try:
a = int(a_str)
b = int(b_str)
assert string == "{}/{}".format(a, b)
new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
return new_string
except Exception:
return string
def _remove_right_units(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Remove units (on the right).
"\\text{ " only ever occurs (at least in the val set) when describing units.
"""
if "\\text{ " in string:
splits = string.split("\\text{ ")
assert len(splits) == 2
return splits[0]
else:
return string
def _fix_sqrt(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat square roots.
Example:
>>> _fix_sqrt("\\sqrt3")
\\sqrt{3}
"""
if "\\sqrt" not in string:
return string
splits = string.split("\\sqrt")
new_string = splits[0]
for split in splits[1:]:
if split[0] != "{":
a = split[0]
new_substr = "\\sqrt{" + a + "}" + split[1:]
else:
new_substr = "\\sqrt" + split
new_string += new_substr
return new_string
def _strip_string(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Apply the reformatting helper functions above.
"""
# linebreaks
string = string.replace("\n", "")
# print(string)
# remove inverse spaces
string = string.replace("\\!", "")
# print(string)
# replace \\ with \
string = string.replace("\\\\", "\\")
# print(string)
# replace tfrac and dfrac with frac
string = string.replace("tfrac", "frac")
string = string.replace("dfrac", "frac")
# print(string)
# remove \left and \right
string = string.replace("\\left", "")
string = string.replace("\\right", "")
# print(string)
# Remove circ (degrees)
string = string.replace("^{\\circ}", "")
string = string.replace("^\\circ", "")
# remove dollar signs
string = string.replace("\\$", "")
# remove units (on the right)
string = _remove_right_units(string)
# remove percentage
string = string.replace("\\%", "")
string = string.replace("%", "")
# " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
string = string.replace(" .", " 0.")
string = string.replace("{.", "{0.")
# if empty, return empty string
if len(string) == 0:
return string
if string[0] == ".":
string = "0" + string
# to consider: get rid of e.g. "k = " or "q = " at beginning
if len(string.split("=")) == 2:
if len(string.split("=")[0]) <= 2:
string = string.split("=")[1]
# fix sqrt3 --> sqrt{3}
string = _fix_sqrt(string)
# remove spaces
string = string.replace(" ", "")
# \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc.
# Even works with \frac1{72} (but not \frac{72}1).
# Also does a/b --> \\frac{a}{b}
string = _fix_fracs(string)
# manually change 0.5 --> \frac{1}{2}
if string == "0.5":
string = "\\frac{1}{2}"
# NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
string = _fix_a_slash_b(string)
return string
def get_answer(solution: Optional[str]) -> Optional[str]:
if solution is None:
return None
last_boxed = last_boxed_only_string(solution)
if last_boxed is None:
return None
answer = remove_boxed(last_boxed)
if answer is None:
return None
return answer
def is_equiv(str1: Optional[str], str2: Optional[str]) -> float:
"""Returns (as a float) whether two strings containing math are equivalent up to differences of formatting in
- units
- fractions
- square roots
- superfluous LaTeX.
Source: https://github.com/hendrycks/math
"""
if str1 is None and str2 is None:
print("WARNING: Both None")
return 1.0
if str1 is None or str2 is None:
return 0.0
try:
ss1 = _strip_string(str1)
ss2 = _strip_string(str2)
return float(ss1 == ss2)
except Exception:
return float(str1 == str2)
def is_equiv_chain_of_thought(str1: str, str2: str) -> float:
"""Strips the solution first before calling `is_equiv`."""
ans1 = get_answer(str1)
ans2 = get_answer(str2)
return is_equiv(ans1, ans2)
def voting_counts(responses):
answers = {}
for i in range(len(responses)):
equiv = i
if get_answer(responses[i]) is None:
# ignore None answers
continue
for j in answers:
if is_equiv_chain_of_thought(responses[i], responses[j]):
equiv = j
break
if equiv in answers:
answers[equiv] += 1
else:
answers[equiv] = 1
return answers
def eval_math_responses(responses, solution=None, **args):
"""Select a response for a math problem using voting, and check if the response is correct if the solution is provided.
Args:
responses (list): The list of responses.
solution (str): The canonical solution.
Returns:
dict: The success metrics.
"""
n = len(responses)
if not n:
return {
"expected_success": 0,
"success": False,
"success_vote": 0,
"voted_answer": None,
"votes": 0,
}
success_list = []
if solution is not None:
for i in range(n):
response = responses[i]
succeed = is_equiv_chain_of_thought(response, solution)
success_list.append(succeed)
# voting
answers = voting_counts(responses)
# find the answer with highest votes in answers
answer, votes = max(answers.items(), key=lambda x: x[1], default=(0, 0))
# check if the answer is correct
success_vote = is_equiv_chain_of_thought(responses[answer], solution)
return {
"expected_success": 1 - pow(1 - sum(success_list) / n, n),
"success": any(s for s in success_list),
"success_vote": success_vote,
"voted_answer": responses[answer],
"votes": votes,
}

View File

@ -0,0 +1,18 @@
from flaml.autogen.oai.completion import Completion, ChatCompletion
from flaml.autogen.oai.openai_utils import (
get_config_list,
config_list_gpt4_gpt35,
config_list_openai_aoai,
config_list_from_models,
config_list_from_json,
)
__all__ = [
"Completion",
"ChatCompletion",
"get_config_list",
"config_list_gpt4_gpt35",
"config_list_openai_aoai",
"config_list_from_models",
"config_list_from_json",
]

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,241 @@
import os
import json
from typing import List, Optional, Dict, Set, Union
import logging
NON_CACHE_KEY = ["api_key", "api_base", "api_type", "api_version"]
def get_key(config):
"""Get a unique identifier of a configuration.
Args:
config (dict or list): A configuration.
Returns:
tuple: A unique identifier which can be used as a key for a dict.
"""
copied = False
for key in NON_CACHE_KEY:
if key in config:
config, copied = config.copy() if not copied else config, True
config.pop(key)
# if isinstance(config, dict):
# return tuple(get_key(x) for x in sorted(config.items()))
# if isinstance(config, list):
# return tuple(get_key(x) for x in config)
# return config
return json.dumps(config, sort_keys=True)
def get_config_list(
api_keys: List, api_bases: Optional[List] = None, api_type: Optional[str] = None, api_version: Optional[str] = None
) -> List[Dict]:
"""Get a list of configs for openai api calls.
Args:
api_keys (list): The api keys for openai api calls.
api_bases (list, optional): The api bases for openai api calls.
api_type (str, optional): The api type for openai api calls.
api_version (str, optional): The api version for openai api calls.
"""
config_list = []
for i, api_key in enumerate(api_keys):
if not api_key.strip():
continue
config = {"api_key": api_key}
if api_bases:
config["api_base"] = api_bases[i]
if api_type:
config["api_type"] = api_type
if api_version:
config["api_version"] = api_version
config_list.append(config)
return config_list
def config_list_openai_aoai(
key_file_path: Optional[str] = ".",
openai_api_key_file: Optional[str] = "key_openai.txt",
aoai_api_key_file: Optional[str] = "key_aoai.txt",
aoai_api_base_file: Optional[str] = "base_aoai.txt",
exclude: Optional[str] = None,
) -> List[Dict]:
"""Get a list of configs for openai + azure openai api calls.
Args:
key_file_path (str, optional): The path to the key files.
openai_api_key_file (str, optional): The file name of the openai api key.
aoai_api_key_file (str, optional): The file name of the azure openai api key.
aoai_api_base_file (str, optional): The file name of the azure openai api base.
exclude (str, optional): The api type to exclude, "openai" or "aoai".
Returns:
list: A list of configs for openai api calls.
"""
if "OPENAI_API_KEY" not in os.environ and exclude != "openai":
try:
with open(f"{key_file_path}/{openai_api_key_file}") as key_file:
os.environ["OPENAI_API_KEY"] = key_file.read().strip()
except FileNotFoundError:
logging.info(
"To use OpenAI API, please set OPENAI_API_KEY in os.environ "
"or create key_openai.txt in the specified path, or specify the api_key in config_list."
)
if "AZURE_OPENAI_API_KEY" not in os.environ and exclude != "aoai":
try:
with open(f"{key_file_path}/{aoai_api_key_file}") as key_file:
os.environ["AZURE_OPENAI_API_KEY"] = key_file.read().strip()
except FileNotFoundError:
logging.info(
"To use Azure OpenAI API, please set AZURE_OPENAI_API_KEY in os.environ "
"or create key_aoai.txt in the specified path, or specify the api_key in config_list."
)
if "AZURE_OPENAI_API_BASE" not in os.environ and exclude != "aoai":
try:
with open(f"{key_file_path}/{aoai_api_base_file}") as key_file:
os.environ["AZURE_OPENAI_API_BASE"] = key_file.read().strip()
except FileNotFoundError:
logging.info(
"To use Azure OpenAI API, please set AZURE_OPENAI_API_BASE in os.environ "
"or create base_aoai.txt in the specified path, or specify the api_base in config_list."
)
aoai_config = (
get_config_list(
# Assuming Azure OpenAI api keys in os.environ["AZURE_OPENAI_API_KEY"], in separated lines
api_keys=os.environ.get("AZURE_OPENAI_API_KEY", "").split("\n"),
# Assuming Azure OpenAI api bases in os.environ["AZURE_OPENAI_API_BASE"], in separated lines
api_bases=os.environ.get("AZURE_OPENAI_API_BASE", "").split("\n"),
api_type="azure",
api_version="2023-06-01-preview", # change if necessary
)
if exclude != "aoai"
else []
)
openai_config = (
get_config_list(
# Assuming OpenAI API_KEY in os.environ["OPENAI_API_KEY"]
api_keys=os.environ.get("OPENAI_API_KEY", "").split("\n"),
# "api_type": "open_ai",
# "api_base": "https://api.openai.com/v1",
)
if exclude != "openai"
else []
)
config_list = openai_config + aoai_config
return config_list
def config_list_from_models(
key_file_path: Optional[str] = ".",
openai_api_key_file: Optional[str] = "key_openai.txt",
aoai_api_key_file: Optional[str] = "key_aoai.txt",
aoai_api_base_file: Optional[str] = "base_aoai.txt",
exclude: Optional[str] = None,
model_list: Optional[list] = None,
) -> List[Dict]:
"""Get a list of configs for api calls with models in the model list.
Args:
key_file_path (str, optional): The path to the key files.
openai_api_key_file (str, optional): The file name of the openai api key.
aoai_api_key_file (str, optional): The file name of the azure openai api key.
aoai_api_base_file (str, optional): The file name of the azure openai api base.
exclude (str, optional): The api type to exclude, "openai" or "aoai".
model_list (list, optional): The model list.
Returns:
list: A list of configs for openai api calls.
"""
config_list = config_list_openai_aoai(
key_file_path,
openai_api_key_file,
aoai_api_key_file,
aoai_api_base_file,
exclude,
)
if model_list:
config_list = [{**config, "model": model} for model in model_list for config in config_list]
return config_list
def config_list_gpt4_gpt35(
key_file_path: Optional[str] = ".",
openai_api_key_file: Optional[str] = "key_openai.txt",
aoai_api_key_file: Optional[str] = "key_aoai.txt",
aoai_api_base_file: Optional[str] = "base_aoai.txt",
exclude: Optional[str] = None,
) -> List[Dict]:
"""Get a list of configs for gpt-4 followed by gpt-3.5 api calls.
Args:
key_file_path (str, optional): The path to the key files.
openai_api_key_file (str, optional): The file name of the openai api key.
aoai_api_key_file (str, optional): The file name of the azure openai api key.
aoai_api_base_file (str, optional): The file name of the azure openai api base.
exclude (str, optional): The api type to exclude, "openai" or "aoai".
Returns:
list: A list of configs for openai api calls.
"""
return config_list_from_models(
key_file_path,
openai_api_key_file,
aoai_api_key_file,
aoai_api_base_file,
exclude,
model_list=["gpt-4", "gpt-3.5-turbo"],
)
def filter_config(config_list, filter_dict):
"""Filter the config list by provider and model.
Args:
config_list (list): The config list.
filter_dict (dict, optional): The filter dict with keys corresponding to a field in each config,
and values corresponding to lists of acceptable values for each key.
Returns:
list: The filtered config list.
"""
if filter_dict:
config_list = [
config for config in config_list if all(config.get(key) in value for key, value in filter_dict.items())
]
return config_list
def config_list_from_json(
env_or_file: str,
file_location: Optional[str] = "",
filter_dict: Optional[Dict[str, Union[List[Union[str, None]], Set[Union[str, None]]]]] = None,
) -> List[Dict]:
"""Get a list of configs from a json parsed from an env variable or a file.
Args:
env_or_file (str): The env variable name or file name.
file_location (str, optional): The file location.
filter_dict (dict, optional): The filter dict with keys corresponding to a field in each config,
and values corresponding to lists of acceptable values for each key.
e.g.,
```python
filter_dict = {
"api_type": ["open_ai", None], # None means a missing key is acceptable
"model": ["gpt-3.5-turbo", "gpt-4"],
}
```
Returns:
list: A list of configs for openai api calls.
"""
json_str = os.environ.get(env_or_file)
if json_str:
config_list = json.loads(json_str)
else:
try:
with open(os.path.join(file_location, env_or_file)) as json_file:
config_list = json.load(json_file)
except FileNotFoundError:
return []
return filter_config(config_list, filter_dict)

View File

@ -0,0 +1,242 @@
from typing import List, Union, Dict, Tuple
import os
import requests
from urllib.parse import urlparse
import glob
import tiktoken
import chromadb
from chromadb.api import API
import chromadb.utils.embedding_functions as ef
import logging
logger = logging.getLogger(__name__)
TEXT_FORMATS = ["txt", "json", "csv", "tsv", "md", "html", "htm", "rtf", "rst", "jsonl", "log", "xml", "yaml", "yml"]
def num_tokens_from_text(
text: str, model: str = "gpt-3.5-turbo-0613", return_tokens_per_name_and_message: bool = False
) -> Union[int, Tuple[int, int, int]]:
"""Return the number of tokens used by a text."""
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
logger.debug("Warning: model not found. Using cl100k_base encoding.")
encoding = tiktoken.get_encoding("cl100k_base")
if model in {
"gpt-3.5-turbo-0613",
"gpt-3.5-turbo-16k-0613",
"gpt-4-0314",
"gpt-4-32k-0314",
"gpt-4-0613",
"gpt-4-32k-0613",
}:
tokens_per_message = 3
tokens_per_name = 1
elif model == "gpt-3.5-turbo-0301":
tokens_per_message = 4 # every message follows <|start|>{role/name}\n{content}<|end|>\n
tokens_per_name = -1 # if there's a name, the role is omitted
elif "gpt-3.5-turbo" in model or "gpt-35-turbo" in model:
print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
return num_tokens_from_text(text, model="gpt-3.5-turbo-0613")
elif "gpt-4" in model:
print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
return num_tokens_from_text(text, model="gpt-4-0613")
else:
raise NotImplementedError(
f"""num_tokens_from_text() is not implemented for model {model}. See """
f"""https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are """
f"""converted to tokens."""
)
if return_tokens_per_name_and_message:
return len(encoding.encode(text)), tokens_per_message, tokens_per_name
else:
return len(encoding.encode(text))
def num_tokens_from_messages(messages: dict, model: str = "gpt-3.5-turbo-0613"):
"""Return the number of tokens used by a list of messages."""
num_tokens = 0
for message in messages:
for key, value in message.items():
_num_tokens, tokens_per_message, tokens_per_name = num_tokens_from_text(
value, model=model, return_tokens_per_name_and_message=True
)
num_tokens += _num_tokens
if key == "name":
num_tokens += tokens_per_name
num_tokens += tokens_per_message
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
return num_tokens
def split_text_to_chunks(
text: str,
max_tokens: int = 4000,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
overlap: int = 10,
):
"""Split a long text into chunks of max_tokens."""
assert chunk_mode in {"one_line", "multi_lines"}
if chunk_mode == "one_line":
must_break_at_empty_line = False
chunks = []
lines = text.split("\n")
lines_tokens = [num_tokens_from_text(line) for line in lines]
sum_tokens = sum(lines_tokens)
while sum_tokens > max_tokens:
if chunk_mode == "one_line":
estimated_line_cut = 2
else:
estimated_line_cut = int(max_tokens / sum_tokens * len(lines)) + 1
cnt = 0
prev = ""
for cnt in reversed(range(estimated_line_cut)):
if must_break_at_empty_line and lines[cnt].strip() != "":
continue
if sum(lines_tokens[:cnt]) <= max_tokens:
prev = "\n".join(lines[:cnt])
break
if cnt == 0:
logger.warning(
f"max_tokens is too small to fit a single line of text. Breaking this line:\n\t{lines[0][:100]} ..."
)
if not must_break_at_empty_line:
split_len = int(max_tokens / lines_tokens[0] * 0.9 * len(lines[0]))
prev = lines[0][:split_len]
lines[0] = lines[0][split_len:]
lines_tokens[0] = num_tokens_from_text(lines[0])
else:
logger.warning("Failed to split docs with must_break_at_empty_line being True, set to False.")
must_break_at_empty_line = False
chunks.append(prev) if len(prev) > 10 else None # don't add chunks less than 10 characters
lines = lines[cnt:]
lines_tokens = lines_tokens[cnt:]
sum_tokens = sum(lines_tokens)
text_to_chunk = "\n".join(lines)
chunks.append(text_to_chunk) if len(text_to_chunk) > 10 else None # don't add chunks less than 10 characters
return chunks
def split_files_to_chunks(
files: list, max_tokens: int = 4000, chunk_mode: str = "multi_lines", must_break_at_empty_line: bool = True
):
"""Split a list of files into chunks of max_tokens."""
chunks = []
for file in files:
with open(file, "r") as f:
text = f.read()
chunks += split_text_to_chunks(text, max_tokens, chunk_mode, must_break_at_empty_line)
return chunks
def get_files_from_dir(dir_path: str, types: list = TEXT_FORMATS, recursive: bool = True):
"""Return a list of all the files in a given directory."""
if len(types) == 0:
raise ValueError("types cannot be empty.")
types = [t[1:].lower() if t.startswith(".") else t.lower() for t in set(types)]
types += [t.upper() for t in types]
# If the path is a file, return it
if os.path.isfile(dir_path):
return [dir_path]
# If the path is a url, download it and return the downloaded file
if is_url(dir_path):
return [get_file_from_url(dir_path)]
files = []
if os.path.exists(dir_path):
for type in types:
if recursive:
files += glob.glob(os.path.join(dir_path, f"**/*.{type}"), recursive=True)
else:
files += glob.glob(os.path.join(dir_path, f"*.{type}"), recursive=False)
else:
logger.error(f"Directory {dir_path} does not exist.")
raise ValueError(f"Directory {dir_path} does not exist.")
return files
def get_file_from_url(url: str, save_path: str = None):
"""Download a file from a URL."""
if save_path is None:
save_path = os.path.join("/tmp/chromadb", os.path.basename(url))
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(save_path, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
return save_path
def is_url(string: str):
"""Return True if the string is a valid URL."""
try:
result = urlparse(string)
return all([result.scheme, result.netloc])
except ValueError:
return False
def create_vector_db_from_dir(
dir_path: str,
max_tokens: int = 4000,
client: API = None,
db_path: str = "/tmp/chromadb.db",
collection_name: str = "all-my-documents",
get_or_create: bool = False,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
embedding_model: str = "all-MiniLM-L6-v2",
):
"""Create a vector db from all the files in a given directory."""
if client is None:
client = chromadb.PersistentClient(path=db_path)
try:
embedding_function = ef.SentenceTransformerEmbeddingFunction(embedding_model)
collection = client.create_collection(
collection_name,
get_or_create=get_or_create,
embedding_function=embedding_function,
# https://github.com/nmslib/hnswlib#supported-distances
# https://github.com/chroma-core/chroma/blob/566bc80f6c8ee29f7d99b6322654f32183c368c4/chromadb/segment/impl/vector/local_hnsw.py#L184
# https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
metadata={"hnsw:space": "ip", "hnsw:construction_ef": 30, "hnsw:M": 32}, # ip, l2, cosine
)
chunks = split_files_to_chunks(get_files_from_dir(dir_path), max_tokens, chunk_mode, must_break_at_empty_line)
# updates existing items, or adds them if they don't yet exist.
collection.upsert(
documents=chunks, # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
ids=[f"doc_{i}" for i in range(len(chunks))], # unique for each doc
)
except ValueError as e:
logger.warning(f"{e}")
def query_vector_db(
query_texts: List[str],
n_results: int = 10,
client: API = None,
db_path: str = "/tmp/chromadb.db",
collection_name: str = "all-my-documents",
search_string: str = "",
embedding_model: str = "all-MiniLM-L6-v2",
) -> Dict[str, List[str]]:
"""Query a vector db."""
if client is None:
client = chromadb.PersistentClient(path=db_path)
# the collection's embedding function is always the default one, but we want to use the one we used to create the
# collection. So we compute the embeddings ourselves and pass it to the query function.
collection = client.get_collection(collection_name)
embedding_function = ef.SentenceTransformerEmbeddingFunction(embedding_model)
query_embeddings = embedding_function(query_texts)
# Query/search n most similar results. You can also .get by id
results = collection.query(
query_embeddings=query_embeddings,
n_results=n_results,
where_document={"$contains": search_string} if search_string else None, # optional filter
)
return results

5
flaml/automl/__init__.py Normal file
View File

@ -0,0 +1,5 @@
from flaml.automl.automl import AutoML, size
from flaml.automl.logger import logger_formatter
from flaml.automl.state import SearchState, AutoMLState
__all__ = ["AutoML", "AutoMLState", "SearchState", "logger_formatter", "size"]

2703
flaml/automl/automl.py Normal file

File diff suppressed because it is too large Load Diff

443
flaml/automl/data.py Normal file
View File

@ -0,0 +1,443 @@
# !
# * Copyright (c) Microsoft Corporation. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
import numpy as np
from datetime import datetime
from typing import TYPE_CHECKING, Union
import os
from flaml.automl.training_log import training_log_reader
from flaml.automl.spark import ps, psDataFrame, psSeries, DataFrame, Series, pd
try:
from scipy.sparse import vstack, issparse
except ImportError:
pass
if TYPE_CHECKING:
from flaml.automl.task import Task
TS_TIMESTAMP_COL = "ds"
TS_VALUE_COL = "y"
def load_openml_dataset(dataset_id, data_dir=None, random_state=0, dataset_format="dataframe"):
"""Load dataset from open ML.
If the file is not cached locally, download it from open ML.
Args:
dataset_id: An integer of the dataset id in openml.
data_dir: A string of the path to store and load the data.
random_state: An integer of the random seed for splitting data.
dataset_format: A string specifying the format of returned dataset. Default is 'dataframe'.
Can choose from ['dataframe', 'array'].
If 'dataframe', the returned dataset will be a Pandas DataFrame.
If 'array', the returned dataset will be a NumPy array or a SciPy sparse matrix.
Returns:
X_train: Training data.
X_test: Test data.
y_train: A series or array of labels for training data.
y_test: A series or array of labels for test data.
"""
import openml
import pickle
from sklearn.model_selection import train_test_split
filename = "openml_ds" + str(dataset_id) + ".pkl"
filepath = os.path.join(data_dir, filename)
if os.path.isfile(filepath):
print("load dataset from", filepath)
with open(filepath, "rb") as f:
dataset = pickle.load(f)
else:
print("download dataset from openml")
dataset = openml.datasets.get_dataset(dataset_id)
if not os.path.exists(data_dir):
os.makedirs(data_dir)
with open(filepath, "wb") as f:
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
print("Dataset name:", dataset.name)
try:
X, y, *__ = dataset.get_data(target=dataset.default_target_attribute, dataset_format=dataset_format)
except ValueError:
from sklearn.datasets import fetch_openml
X, y = fetch_openml(data_id=dataset_id, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
print(
"X_train.shape: {}, y_train.shape: {};\nX_test.shape: {}, y_test.shape: {}".format(
X_train.shape,
y_train.shape,
X_test.shape,
y_test.shape,
)
)
return X_train, X_test, y_train, y_test
def load_openml_task(task_id, data_dir):
"""Load task from open ML.
Use the first fold of the task.
If the file is not cached locally, download it from open ML.
Args:
task_id: An integer of the task id in openml.
data_dir: A string of the path to store and load the data.
Returns:
X_train: A dataframe of training data.
X_test: A dataframe of test data.
y_train: A series of labels for training data.
y_test: A series of labels for test data.
"""
import openml
import pickle
task = openml.tasks.get_task(task_id)
filename = "openml_task" + str(task_id) + ".pkl"
filepath = os.path.join(data_dir, filename)
if os.path.isfile(filepath):
print("load dataset from", filepath)
with open(filepath, "rb") as f:
dataset = pickle.load(f)
else:
print("download dataset from openml")
dataset = task.get_dataset()
with open(filepath, "wb") as f:
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
X, y, _, _ = dataset.get_data(task.target_name)
train_indices, test_indices = task.get_train_test_split_indices(
repeat=0,
fold=0,
sample=0,
)
X_train = X.iloc[train_indices]
y_train = y[train_indices]
X_test = X.iloc[test_indices]
y_test = y[test_indices]
print(
"X_train.shape: {}, y_train.shape: {},\nX_test.shape: {}, y_test.shape: {}".format(
X_train.shape,
y_train.shape,
X_test.shape,
y_test.shape,
)
)
return X_train, X_test, y_train, y_test
def get_output_from_log(filename, time_budget):
"""Get output from log file.
Args:
filename: A string of the log file name.
time_budget: A float of the time budget in seconds.
Returns:
search_time_list: A list of the finished time of each logged iter.
best_error_list: A list of the best validation error after each logged iter.
error_list: A list of the validation error of each logged iter.
config_list: A list of the estimator, sample size and config of each logged iter.
logged_metric_list: A list of the logged metric of each logged iter.
"""
best_config = None
best_learner = None
best_val_loss = float("+inf")
search_time_list = []
config_list = []
best_error_list = []
error_list = []
logged_metric_list = []
best_config_list = []
with training_log_reader(filename) as reader:
for record in reader.records():
time_used = record.wall_clock_time
val_loss = record.validation_loss
config = record.config
learner = record.learner.split("_")[0]
sample_size = record.sample_size
metric = record.logged_metric
if time_used < time_budget and np.isfinite(val_loss):
if val_loss < best_val_loss:
best_val_loss = val_loss
best_config = config
best_learner = learner
best_config_list.append(best_config)
search_time_list.append(time_used)
best_error_list.append(best_val_loss)
logged_metric_list.append(metric)
error_list.append(val_loss)
config_list.append(
{
"Current Learner": learner,
"Current Sample": sample_size,
"Current Hyper-parameters": record.config,
"Best Learner": best_learner,
"Best Hyper-parameters": best_config,
}
)
return (
search_time_list,
best_error_list,
error_list,
config_list,
logged_metric_list,
)
def concat(X1, X2):
"""concatenate two matrices vertically."""
if type(X1) != type(X2):
if isinstance(X2, (psDataFrame, psSeries)):
X1 = ps.from_pandas(pd.DataFrame(X1))
elif isinstance(X1, (psDataFrame, psSeries)):
X2 = ps.from_pandas(pd.DataFrame(X2))
else:
X1 = pd.DataFrame(X1)
X2 = pd.DataFrame(X2)
if isinstance(X1, (DataFrame, Series)):
df = pd.concat([X1, X2], sort=False)
df.reset_index(drop=True, inplace=True)
if isinstance(X1, DataFrame):
cat_columns = X1.select_dtypes(include="category").columns
if len(cat_columns):
df[cat_columns] = df[cat_columns].astype("category")
return df
if isinstance(X1, (psDataFrame, psSeries)):
df = ps.concat([X1, X2], ignore_index=True)
if isinstance(X1, psDataFrame):
cat_columns = X1.select_dtypes(include="category").columns.values.tolist()
if len(cat_columns):
df[cat_columns] = df[cat_columns].astype("category")
return df
if issparse(X1):
return vstack((X1, X2))
else:
return np.concatenate([X1, X2])
def add_time_idx_col(X):
unique_dates = X[TS_TIMESTAMP_COL].drop_duplicates().sort_values(ascending=True)
# assume no missing timestamps
freq = pd.infer_freq(unique_dates)
if freq == "MS":
X["time_idx"] = X[TS_TIMESTAMP_COL].dt.year * 12 + X[TS_TIMESTAMP_COL].dt.month
elif freq == "Y":
X["time_idx"] = X[TS_TIMESTAMP_COL].dt.year
else:
# using time frequency to generate all time stamps and then indexing for time_idx
# full_range = pd.date_range(X[TS_TIMESTAMP_COL].min(), X[TS_TIMESTAMP_COL].max(), freq=freq).to_list()
# X["time_idx"] = [full_range.index(time) for time in X[TS_TIMESTAMP_COL]]
# taking minimum difference in timestamp
timestamps = unique_dates.view("int64")
freq = int(timestamps.diff().mode())
X["time_idx"] = timestamps - timestamps.min() / freq
X["time_idx"] = X["time_idx"].astype("int")
return X
class DataTransformer:
"""Transform input training data."""
def fit_transform(self, X: Union[DataFrame, np.ndarray], y, task: Union[str, "Task"]):
"""Fit transformer and process the input training data according to the task type.
Args:
X: A numpy array or a pandas dataframe of training data.
y: A numpy array or a pandas series of labels.
task: An instance of type Task, or a str such as 'classification', 'regression'.
Returns:
X: Processed numpy array or pandas dataframe of training data.
y: Processed numpy array or pandas series of labels.
"""
if isinstance(task, str):
from flaml.automl.task.factory import task_factory
task = task_factory(task, X, y)
if task.is_nlp():
# if the mode is NLP, check the type of input, each column must be either string or
# ids (input ids, token type id, attention mask, etc.)
str_columns = []
for column in X.columns:
if isinstance(X[column].iloc[0], str):
str_columns.append(column)
if len(str_columns) > 0:
X[str_columns] = X[str_columns].astype("string")
self._str_columns = str_columns
elif isinstance(X, DataFrame):
X = X.copy()
n = X.shape[0]
cat_columns, num_columns, datetime_columns = [], [], []
drop = False
if task.is_ts_forecast():
X = X.rename(columns={X.columns[0]: TS_TIMESTAMP_COL})
if task.is_ts_forecastpanel():
if "time_idx" not in X:
X = add_time_idx_col(X)
ds_col = X.pop(TS_TIMESTAMP_COL)
if isinstance(y, Series):
y = y.rename(TS_VALUE_COL)
for column in X.columns:
# sklearn\utils\validation.py needs int/float values
if X[column].dtype.name in ("object", "category"):
if X[column].nunique() == 1 or X[column].nunique(dropna=True) == n - X[column].isnull().sum():
X.drop(columns=column, inplace=True)
drop = True
elif X[column].dtype.name == "category":
current_categories = X[column].cat.categories
if "__NAN__" not in current_categories:
X[column] = X[column].cat.add_categories("__NAN__").fillna("__NAN__")
cat_columns.append(column)
else:
X[column] = X[column].fillna("__NAN__")
cat_columns.append(column)
elif X[column].nunique(dropna=True) < 2:
X.drop(columns=column, inplace=True)
drop = True
else: # datetime or numeric
if X[column].dtype.name == "datetime64[ns]":
tmp_dt = X[column].dt
new_columns_dict = {
f"year_{column}": tmp_dt.year,
f"month_{column}": tmp_dt.month,
f"day_{column}": tmp_dt.day,
f"hour_{column}": tmp_dt.hour,
f"minute_{column}": tmp_dt.minute,
f"second_{column}": tmp_dt.second,
f"dayofweek_{column}": tmp_dt.dayofweek,
f"dayofyear_{column}": tmp_dt.dayofyear,
f"quarter_{column}": tmp_dt.quarter,
}
for key, value in new_columns_dict.items():
if key not in X.columns and value.nunique(dropna=False) >= 2:
X[key] = value
num_columns.append(key)
X[column] = X[column].map(datetime.toordinal)
datetime_columns.append(column)
del tmp_dt
X[column] = X[column].fillna(np.nan)
num_columns.append(column)
X = X[cat_columns + num_columns]
if task.is_ts_forecast():
X.insert(0, TS_TIMESTAMP_COL, ds_col)
if cat_columns:
X[cat_columns] = X[cat_columns].astype("category")
if num_columns:
X_num = X[num_columns]
if np.issubdtype(X_num.columns.dtype, np.integer) and (
drop or min(X_num.columns) != 0 or max(X_num.columns) != X_num.shape[1] - 1
):
X_num.columns = range(X_num.shape[1])
drop = True
else:
drop = False
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
self.transformer = ColumnTransformer(
[
(
"continuous",
SimpleImputer(missing_values=np.nan, strategy="median"),
X_num.columns,
)
]
)
X[num_columns] = self.transformer.fit_transform(X_num)
self._cat_columns, self._num_columns, self._datetime_columns = (
cat_columns,
num_columns,
datetime_columns,
)
self._drop = drop
if task.is_classification() or not pd.api.types.is_numeric_dtype(y) and not task.is_nlg():
if not task.is_token_classification():
from sklearn.preprocessing import LabelEncoder
self.label_transformer = LabelEncoder()
else:
from flaml.automl.nlp.utils import LabelEncoderforTokenClassification
self.label_transformer = LabelEncoderforTokenClassification()
y = self.label_transformer.fit_transform(y)
else:
self.label_transformer = None
self._task = task
return X, y
def transform(self, X: Union[DataFrame, np.array]):
"""Process data using fit transformer.
Args:
X: A numpy array or a pandas dataframe of training data.
Returns:
X: Processed numpy array or pandas dataframe of training data.
"""
X = X.copy()
if self._task.is_nlp():
# if the mode is NLP, check the type of input, each column must be either string or
# ids (input ids, token type id, attention mask, etc.)
if len(self._str_columns) > 0:
X[self._str_columns] = X[self._str_columns].astype("string")
elif isinstance(X, DataFrame):
cat_columns, num_columns, datetime_columns = (
self._cat_columns,
self._num_columns,
self._datetime_columns,
)
if self._task.is_ts_forecast():
X = X.rename(columns={X.columns[0]: TS_TIMESTAMP_COL})
ds_col = X.pop(TS_TIMESTAMP_COL)
for column in datetime_columns:
tmp_dt = X[column].dt
new_columns_dict = {
f"year_{column}": tmp_dt.year,
f"month_{column}": tmp_dt.month,
f"day_{column}": tmp_dt.day,
f"hour_{column}": tmp_dt.hour,
f"minute_{column}": tmp_dt.minute,
f"second_{column}": tmp_dt.second,
f"dayofweek_{column}": tmp_dt.dayofweek,
f"dayofyear_{column}": tmp_dt.dayofyear,
f"quarter_{column}": tmp_dt.quarter,
}
for new_col_name, new_col_value in new_columns_dict.items():
if new_col_name not in X.columns and new_col_name in num_columns:
X[new_col_name] = new_col_value
X[column] = X[column].map(datetime.toordinal)
del tmp_dt
X = X[cat_columns + num_columns].copy()
if self._task.is_ts_forecast():
X.insert(0, TS_TIMESTAMP_COL, ds_col)
for column in cat_columns:
if X[column].dtype.name == "object":
X[column] = X[column].fillna("__NAN__")
elif X[column].dtype.name == "category":
current_categories = X[column].cat.categories
if "__NAN__" not in current_categories:
X[column] = X[column].cat.add_categories("__NAN__").fillna("__NAN__")
if cat_columns:
X[cat_columns] = X[cat_columns].astype("category")
if num_columns:
X_num = X[num_columns].fillna(np.nan)
if self._drop:
X_num.columns = range(X_num.shape[1])
X[num_columns] = self.transformer.transform(X_num)
return X
def group_counts(groups):
_, i, c = np.unique(groups, return_counts=True, return_index=True)
return c[np.argsort(i)]

7
flaml/automl/logger.py Normal file
View File

@ -0,0 +1,7 @@
import logging
logger = logging.getLogger(__name__)
logger_formatter = logging.Formatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
)
logger.propagate = False

606
flaml/automl/ml.py Normal file
View File

@ -0,0 +1,606 @@
# !
# * Copyright (c) FLAML authors. All rights reserved.
# * Licensed under the MIT License. See LICENSE file in the
# * project root for license information.
import time
from typing import Union, Callable, TypeVar, Optional, Tuple
import logging
import numpy as np
from flaml.automl.data import group_counts
from flaml.automl.task.task import Task
from flaml.automl.model import BaseEstimator, TransformersEstimator
from flaml.automl.spark import psDataFrame, psSeries, ERROR as SPARK_ERROR, Series, DataFrame
try:
from sklearn.metrics import (
mean_squared_error,
r2_score,
roc_auc_score,
accuracy_score,
mean_absolute_error,
log_loss,
average_precision_score,
f1_score,
mean_absolute_percentage_error,
ndcg_score,
)
except ImportError:
pass
if SPARK_ERROR is None:
from flaml.automl.spark.metrics import spark_metric_loss_score
from flaml.automl.time_series import TimeSeriesDataset
logger = logging.getLogger(__name__)
EstimatorSubclass = TypeVar("EstimatorSubclass", bound=BaseEstimator)
sklearn_metric_name_set = {
"r2",
"rmse",
"mae",
"mse",
"accuracy",
"roc_auc",
"roc_auc_ovr",
"roc_auc_ovo",
"roc_auc_weighted",
"roc_auc_ovr_weighted",
"roc_auc_ovo_weighted",
"log_loss",
"mape",
"f1",
"ap",
"ndcg",
"micro_f1",
"macro_f1",
}
huggingface_metric_to_mode = {
"accuracy": "max",
"bertscore": "max",
"bleu": "max",
"bleurt": "max",
"cer": "min",
"chrf": "min",
"code_eval": "max",
"comet": "max",
"competition_math": "max",
"coval": "max",
"cuad": "max",
"f1": "max",
"gleu": "max",
"google_bleu": "max",
"matthews_correlation": "max",
"meteor": "max",
"pearsonr": "max",
"precision": "max",
"recall": "max",
"rouge": "max",
"sacrebleu": "max",
"sari": "max",
"seqeval": "max",
"spearmanr": "max",
"ter": "min",
"wer": "min",
}
huggingface_submetric_to_metric = {"rouge1": "rouge", "rouge2": "rouge"}
def metric_loss_score(
metric_name: str,
y_processed_predict,
y_processed_true,
labels=None,
sample_weight=None,
groups=None,
):
# y_processed_predict and y_processed_true are processed id labels if the original were the token labels
if isinstance(y_processed_predict, (psDataFrame, psSeries)):
return spark_metric_loss_score(
metric_name,
y_processed_predict,
y_processed_true,
sample_weight,
groups,
)
elif is_in_sklearn_metric_name_set(metric_name):
return sklearn_metric_loss_score(
metric_name,
y_processed_predict,
y_processed_true,
labels,
sample_weight,
groups,
)
else:
try:
import datasets
datasets_metric_name = huggingface_submetric_to_metric.get(metric_name, metric_name.split(":")[0])
metric = datasets.load_metric(datasets_metric_name)
metric_mode = huggingface_metric_to_mode[datasets_metric_name]
if metric_name.startswith("seqeval"):
y_processed_true = [[labels[tr] for tr in each_list] for each_list in y_processed_true]
elif metric in ("pearsonr", "spearmanr"):
y_processed_true = (
y_processed_true.to_list() if isinstance(y_processed_true, Series) else list(y_processed_true)
)
score_dict = metric.compute(predictions=y_processed_predict, references=y_processed_true)
if "rouge" in metric_name:
score = score_dict[metric_name].mid.fmeasure
elif metric_name.startswith("seqeval"):
metric_submetric_names = metric_name.split(":")
score = score_dict[metric_submetric_names[1] if len(metric_submetric_names) > 1 else "overall_accuracy"]
else:
score = score_dict[metric_name]
except ImportError:
raise ValueError(
metric_name + " is not an built-in sklearn metric and [hf] is not installed. "
"Currently built-in sklearn metrics are: "
"r2, rmse, mae, mse, accuracy, roc_auc, roc_auc_ovr, roc_auc_ovo,"
"log_loss, mape, f1, micro_f1, macro_f1, ap. "
"If the metric is a huggingface metric, please pip install flaml[hf] ",
"or pass a customized metric function to AutoML.fit(metric=func)",
)
# If the metric is not found from huggingface dataset metric list (i.e., FileNotFoundError)
# ask the user to provide a custom metric
except FileNotFoundError:
raise ValueError(
metric_name + " is neither an sklearn metric nor a huggingface metric. "
"Currently built-in sklearn metrics are: "
"r2, rmse, mae, mse, accuracy, roc_auc, roc_auc_ovr, roc_auc_ovo,"
"log_loss, mape, f1, micro_f1, macro_f1, ap. "
"Currently built-in huggingface metrics are: "
+ ", ".join(huggingface_metric_to_mode.keys())
+ ". Please pass a customized metric function to AutoML.fit(metric=func)"
)
if metric_mode == "max":
return 1 - score
else:
return score
def is_in_sklearn_metric_name_set(metric_name: str):
return metric_name.startswith("ndcg") or metric_name in sklearn_metric_name_set
def is_min_metric(metric_name: str):
return (
metric_name in ["rmse", "mae", "mse", "log_loss", "mape"]
or huggingface_metric_to_mode.get(metric_name, None) == "min"
)
def sklearn_metric_loss_score(
metric_name: str,
y_predict,
y_true,
labels=None,
sample_weight=None,
groups=None,
):
"""Loss using the specified metric.
Args:
metric_name: A string of the metric name, one of
'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr',
'roc_auc_ovo', 'roc_auc_weighted', 'roc_auc_ovo_weighted', 'roc_auc_ovr_weighted',
'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1'.
y_predict: A 1d or 2d numpy array of the predictions which can be
used to calculate the metric. E.g., 2d for log_loss and 1d
for others.
y_true: A 1d numpy array of the true labels.
labels: A list or an array of the unique labels.
sample_weight: A 1d numpy array of the sample weight.
groups: A 1d numpy array of the group labels.
Returns:
score: A float number of the loss, the lower the better.
"""
metric_name = metric_name.lower()
if "r2" == metric_name:
score = 1.0 - r2_score(y_true, y_predict, sample_weight=sample_weight)
elif metric_name == "rmse":
score = np.sqrt(mean_squared_error(y_true, y_predict, sample_weight=sample_weight))
elif metric_name == "mae":
score = mean_absolute_error(y_true, y_predict, sample_weight=sample_weight)
elif metric_name == "mse":
score = mean_squared_error(y_true, y_predict, sample_weight=sample_weight)
elif metric_name == "accuracy":
score = 1.0 - accuracy_score(y_true, y_predict, sample_weight=sample_weight)
elif metric_name == "roc_auc":
score = 1.0 - roc_auc_score(y_true, y_predict, sample_weight=sample_weight)
elif metric_name == "roc_auc_ovr":
score = 1.0 - roc_auc_score(y_true, y_predict, sample_weight=sample_weight, multi_class="ovr")
elif metric_name == "roc_auc_ovo":
score = 1.0 - roc_auc_score(y_true, y_predict, sample_weight=sample_weight, multi_class="ovo")
elif metric_name == "roc_auc_weighted":
score = 1.0 - roc_auc_score(y_true, y_predict, sample_weight=sample_weight, average="weighted")
elif metric_name == "roc_auc_ovo_weighted":
score = 1.0 - roc_auc_score(
y_true,
y_predict,
sample_weight=sample_weight,
average="weighted",
multi_class="ovo",
)
elif metric_name == "roc_auc_ovr_weighted":
score = 1.0 - roc_auc_score(
y_true,
y_predict,
sample_weight=sample_weight,
average="weighted",
multi_class="ovr",
)
elif "log_loss" == metric_name:
score = log_loss(y_true, y_predict, labels=labels, sample_weight=sample_weight)
elif "mape" == metric_name:
try:
score = mean_absolute_percentage_error(y_true, y_predict)
except ValueError:
return np.inf
elif "micro_f1" == metric_name:
score = 1 - f1_score(y_true, y_predict, sample_weight=sample_weight, average="micro")
elif "macro_f1" == metric_name:
score = 1 - f1_score(y_true, y_predict, sample_weight=sample_weight, average="macro")
elif "f1" == metric_name:
score = 1 - f1_score(y_true, y_predict, sample_weight=sample_weight)
elif "ap" == metric_name:
score = 1 - average_precision_score(y_true, y_predict, sample_weight=sample_weight)
elif "ndcg" in metric_name:
if "@" in metric_name:
k = int(metric_name.split("@", 1)[-1])
counts = group_counts(groups)
score = 0
psum = 0
for c in counts:
score -= ndcg_score(
np.asarray([y_true[psum : psum + c]]),
np.asarray([y_predict[psum : psum + c]]),
k=k,
)
psum += c
score /= len(counts)
score += 1
else:
score = 1 - ndcg_score([y_true], [y_predict])
return score
def get_y_pred(estimator, X, eval_metric, task: Task):
if eval_metric in ["roc_auc", "ap", "roc_auc_weighted"] and task.is_binary():
y_pred_classes = estimator.predict_proba(X)
if isinstance(y_pred_classes, (psSeries, psDataFrame)):
y_pred = y_pred_classes
else:
y_pred = y_pred_classes[:, 1] if y_pred_classes.ndim > 1 else y_pred_classes
elif eval_metric in [
"log_loss",
"roc_auc",
"roc_auc_ovr",
"roc_auc_ovo",
"roc_auc_ovo_weighted",
"roc_auc_ovr_weighted",
]:
y_pred = estimator.predict_proba(X)
else:
y_pred = estimator.predict(X)
if isinstance(y_pred, Series) or isinstance(y_pred, DataFrame):
y_pred = y_pred.values
return y_pred
def to_numpy(x):
if isinstance(x, Series or isinstance(x, DataFrame)):
x = x.values
else:
x = np.ndarray(x)
return x.reshape((-1, 1))
def compute_estimator(
X_train,
y_train,
X_val,
y_val,
weight_val,
groups_val,
budget,
kf,
config_dic: dict,
task: Union[str, Task],
estimator_name: str,
eval_method: str,
eval_metric: Union[str, Callable],
best_val_loss=np.Inf,
n_jobs: Optional[int] = 1, # some estimators of EstimatorSubclass don't accept n_jobs. Should be None in that case.
estimator_class: Optional[EstimatorSubclass] = None,
cv_score_agg_func: Optional[callable] = None,
log_training_metric: Optional[bool] = False,
fit_kwargs: Optional[dict] = None,
free_mem_ratio=0,
):
if fit_kwargs is None:
fit_kwargs = {}
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
estimator = estimator_class(
**config_dic,
task=task,
n_jobs=n_jobs,
)
if isinstance(estimator, TransformersEstimator):
# TODO: move the partial function to nlp
fit_kwargs["metric"] = eval_metric
fit_kwargs["X_val"] = X_val
fit_kwargs["y_val"] = y_val
if "holdout" == eval_method:
val_loss, metric_for_logging, train_time, pred_time = get_val_loss(
config_dic,
estimator,
X_train,
y_train,
X_val,
y_val,
weight_val,
groups_val,
eval_metric,
task,
labels=fit_kwargs.get("label_list"), # pass the label list on to compute the evaluation metric
budget=budget,
log_training_metric=log_training_metric,
fit_kwargs=fit_kwargs,
free_mem_ratio=0,
)
else:
val_loss, metric_for_logging, train_time, pred_time = task.evaluate_model_CV(
config_dic,
estimator,
X_train,
y_train,
budget,
kf,
eval_metric,
best_val_loss,
cv_score_agg_func,
log_training_metric=log_training_metric,
fit_kwargs=fit_kwargs,
free_mem_ratio=0,
)
if isinstance(estimator, TransformersEstimator):
del fit_kwargs["metric"], fit_kwargs["X_val"], fit_kwargs["y_val"]
return estimator, val_loss, metric_for_logging, train_time, pred_time
def train_estimator(
config_dic: dict,
X_train,
y_train,
task: str,
estimator_name: str,
n_jobs: Optional[int] = 1, # some estimators of EstimatorSubclass don't accept n_jobs. Should be None in that case.
estimator_class: Optional[EstimatorSubclass] = None,
budget=None,
fit_kwargs: Optional[dict] = None,
eval_metric=None,
free_mem_ratio=0,
) -> Tuple[EstimatorSubclass, float]:
start_time = time.time()
estimator_class = estimator_class or task.estimator_class_from_str(estimator_name)
estimator = estimator_class(
**config_dic,
task=task,
n_jobs=n_jobs,
)
if fit_kwargs is None:
fit_kwargs = {}
if isinstance(estimator, TransformersEstimator):
fit_kwargs["metric"] = eval_metric
if X_train is not None:
train_time = estimator.fit(X_train, y_train, budget=budget, free_mem_ratio=free_mem_ratio, **fit_kwargs)
else:
estimator = estimator.estimator_class(**estimator.params)
train_time = time.time() - start_time
return estimator, train_time
def norm_confusion_matrix(y_true: Union[np.array, Series], y_pred: Union[np.array, Series]):
"""normalized confusion matrix.
Args:
estimator: A multi-class classification estimator.
y_true: A numpy array or a pandas series of true labels.
y_pred: A numpy array or a pandas series of predicted labels.
Returns:
A normalized confusion matrix.
"""
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_true, y_pred)
norm_conf_mat = conf_mat.astype("float") / conf_mat.sum(axis=1)[:, np.newaxis]
return norm_conf_mat
def multi_class_curves(
y_true: Union[np.array, Series],
y_pred_proba: Union[np.array, Series],
curve_func: Callable,
):
"""Binarize the data for multi-class tasks and produce ROC or precision-recall curves.
Args:
y_true: A numpy array or a pandas series of true labels.
y_pred_proba: A numpy array or a pandas dataframe of predicted probabilites.
curve_func: A function to produce a curve (e.g., roc_curve or precision_recall_curve).
Returns:
A tuple of two dictionaries with the same set of keys (class indices).
The first dictionary curve_x stores the x coordinates of each curve, e.g.,
curve_x[0] is an 1D array of the x coordinates of class 0.
The second dictionary curve_y stores the y coordinates of each curve, e.g.,
curve_y[0] is an 1D array of the y coordinates of class 0.
"""
from sklearn.preprocessing import label_binarize
classes = np.unique(y_true)
y_true_binary = label_binarize(y_true, classes=classes)
curve_x, curve_y = {}, {}
for i in range(len(classes)):
curve_x[i], curve_y[i], _ = curve_func(y_true_binary[:, i], y_pred_proba[:, i])
return curve_x, curve_y
def get_val_loss(
config,
estimator,
X_train,
y_train,
X_val,
y_val,
weight_val,
groups_val,
eval_metric,
task,
labels=None,
budget=None,
log_training_metric=False,
fit_kwargs={},
free_mem_ratio=0,
):
start = time.time()
# if groups_val is not None:
# fit_kwargs['groups_val'] = groups_val
# fit_kwargs['X_val'] = X_val
# fit_kwargs['y_val'] = y_val
estimator.fit(X_train, y_train, budget=budget, free_mem_ratio=free_mem_ratio, **fit_kwargs)
val_loss, metric_for_logging, pred_time, _ = _eval_estimator(
config,
estimator,
X_train,
y_train,
X_val,
y_val,
weight_val,
groups_val,
eval_metric,
task,
labels,
log_training_metric,
fit_kwargs,
)
if hasattr(estimator, "intermediate_results"):
metric_for_logging["intermediate_results"] = estimator.intermediate_results
train_time = time.time() - start
return val_loss, metric_for_logging, train_time, pred_time
def default_cv_score_agg_func(val_loss_folds, log_metrics_folds):
metric_to_minimize = sum(val_loss_folds) / len(val_loss_folds)
metrics_to_log = None
for single_fold in log_metrics_folds:
if metrics_to_log is None:
metrics_to_log = single_fold
elif isinstance(metrics_to_log, dict):
metrics_to_log = {k: metrics_to_log[k] + v for k, v in single_fold.items()}
else:
metrics_to_log += single_fold
if metrics_to_log:
n = len(val_loss_folds)
metrics_to_log = (
{k: v / n for k, v in metrics_to_log.items()} if isinstance(metrics_to_log, dict) else metrics_to_log / n
)
return metric_to_minimize, metrics_to_log
def _eval_estimator(
config,
estimator,
X_train,
y_train,
X_val,
y_val,
weight_val,
groups_val,
eval_metric,
task,
labels=None,
log_training_metric=False,
fit_kwargs={},
):
if isinstance(eval_metric, str):
pred_start = time.time()
val_pred_y = get_y_pred(estimator, X_val, eval_metric, task)
# TODO: why are integer labels being cast to str in the first place?
if isinstance(val_pred_y, Series) or isinstance(val_pred_y, DataFrame) or isinstance(val_pred_y, np.ndarray):
test = val_pred_y if isinstance(val_pred_y, np.ndarray) else val_pred_y.values
if not np.issubdtype(test.dtype, np.number):
# some NLP models return a list
val_pred_y = val_pred_y.astype(str)
if isinstance(X_val, TimeSeriesDataset):
num_val_rows = len(X_val.test_data)
y_val = X_val.test_data[X_val.target_names].values.astype(val_pred_y.dtype)
y_train = X_val.train_data[X_val.target_names].values.astype(val_pred_y.dtype)
else:
num_val_rows = X_val.shape[0]
pred_time = (time.time() - pred_start) / num_val_rows
val_loss = metric_loss_score(
eval_metric,
y_processed_predict=val_pred_y,
y_processed_true=y_val,
labels=labels,
sample_weight=weight_val,
groups=groups_val,
)
metric_for_logging = {"pred_time": pred_time}
if log_training_metric:
train_pred_y = get_y_pred(estimator, X_train, eval_metric, task)
metric_for_logging["train_loss"] = metric_loss_score(
eval_metric,
train_pred_y,
y_train,
labels,
fit_kwargs.get("sample_weight"),
fit_kwargs.get("groups"),
)
else: # customized metric function
val_loss, metric_for_logging = eval_metric(
X_val,
y_val,
estimator,
labels,
X_train,
y_train,
weight_val,
fit_kwargs.get("sample_weight"),
config,
groups_val,
fit_kwargs.get("groups"),
)
pred_time = metric_for_logging.get("pred_time", 0)
val_pred_y = None
# eval_metric may return val_pred_y but not necessarily. Setting None for now.
return val_loss, metric_for_logging, pred_time, val_pred_y

2036
flaml/automl/model.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,25 @@
# AutoML for NLP
This directory contains utility functions used by AutoNLP. Currently we support four NLP tasks: sequence classification, sequence regression, multiple choice and summarization.
Please refer to this [link](https://microsoft.github.io/FLAML/docs/Examples/AutoML-NLP) for examples.
# Troubleshooting fine-tuning HPO for pre-trained language models
The frequent updates of transformers may lead to fluctuations in the results of tuning. To help users quickly troubleshoot the result of AutoNLP when a tuning failure occurs (e.g., failing to reproduce previous results), we have provided the following jupyter notebook:
* [Troubleshooting HPO for fine-tuning pre-trained language models](https://github.com/microsoft/FLAML/blob/main/notebook/research/acl2021.ipynb)
Our findings on troubleshooting fine-tuning the Electra and RoBERTa model for the GLUE dataset can be seen in the following paper published in ACL 2021:
* [An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models](https://arxiv.org/abs/2106.09204). Xueqing Liu, Chi Wang. ACL-IJCNLP 2021.
```bibtex
@inproceedings{liu2021hpo,
title={An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models},
author={Xueqing Liu and Chi Wang},
year={2021},
booktitle={ACL-IJCNLP},
}
```

View File

View File

View File

@ -0,0 +1,50 @@
from dataclasses import dataclass
from transformers.data.data_collator import (
DataCollatorWithPadding,
DataCollatorForTokenClassification,
DataCollatorForSeq2Seq,
)
from collections import OrderedDict
from flaml.automl.task.task import (
TOKENCLASSIFICATION,
MULTICHOICECLASSIFICATION,
SUMMARIZATION,
SEQCLASSIFICATION,
SEQREGRESSION,
)
@dataclass
class DataCollatorForMultipleChoiceClassification(DataCollatorWithPadding):
def __call__(self, features):
from itertools import chain
import torch
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features] if label_name in features[0] else None
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
]
flattened_features = list(chain(*flattened_features))
batch = super(DataCollatorForMultipleChoiceClassification, self).__call__(flattened_features)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
if labels:
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
task_to_datacollator_class = OrderedDict(
[
(TOKENCLASSIFICATION, DataCollatorForTokenClassification),
(MULTICHOICECLASSIFICATION, DataCollatorForMultipleChoiceClassification),
(SUMMARIZATION, DataCollatorForSeq2Seq),
(SEQCLASSIFICATION, DataCollatorWithPadding),
(SEQREGRESSION, DataCollatorWithPadding),
]
)

View File

@ -0,0 +1,90 @@
import os
try:
from transformers import Seq2SeqTrainer
except ImportError:
Seq2SeqTrainer = object
class TrainerForAuto(Seq2SeqTrainer):
def predict(
self,
test_dataset,
ignore_keys=None,
metric_key_prefix=None,
max_length=None,
num_beams=None,
):
if getattr(self, "_is_seq2seq", None):
return super().predict(
test_dataset,
ignore_keys,
metric_key_prefix=metric_key_prefix,
max_length=max_length,
num_beams=num_beams,
)
else:
return super(Seq2SeqTrainer, self).predict(test_dataset, ignore_keys, metric_key_prefix)
def prediction_step(
self,
model,
inputs,
prediction_loss_only,
ignore_keys,
):
if getattr(self, "_is_seq2seq", None):
return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
else:
return super(Seq2SeqTrainer, self).prediction_step(model, inputs, prediction_loss_only, ignore_keys)
def log(self, logs) -> None:
if getattr(self, "_is_seq2seq", None):
super().log(logs)
else:
super(Seq2SeqTrainer, self).log(logs)
if not hasattr(self, "intermediate_results"):
self.intermediate_results = {}
epoch_num = logs.get("epoch", None)
if epoch_num:
self.intermediate_results.setdefault(epoch_num, {})
self.intermediate_results[epoch_num].update(logs)
def evaluate(
self,
eval_dataset=None,
ignore_keys=None,
metric_key_prefix="eval",
):
"""Overriding transformers.Trainer.evaluate by saving metrics and checkpoint path."""
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
ckpt_dir = os.path.join(self.args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}")
eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
# TODO: if your task is seq2seq (i.e., SUMMARIZATION), uncomment the code below (add indentation before metrics = eval_dataset...
if getattr(self, "_is_seq2seq", None):
metrics = eval_dataset and super().evaluate(
eval_dataset,
ignore_keys,
metric_key_prefix,
max_length=self.args.generation_max_length,
num_beams=self.args.generation_num_beams,
)
else:
metrics = eval_dataset and super(Seq2SeqTrainer, self).evaluate(
eval_dataset,
ignore_keys,
metric_key_prefix,
)
if hasattr(self, "ckpt_to_global_step"):
self.ckpt_to_global_step[ckpt_dir] = self.state.global_step
if metrics:
self.ckpt_to_metric[ckpt_dir] = metrics
else:
self.ckpt_to_global_step = {ckpt_dir: self.state.global_step}
self.ckpt_to_metric = {ckpt_dir: metrics} if metrics else {}
return metrics

View File

@ -0,0 +1,128 @@
import argparse
from dataclasses import dataclass, field
from typing import Optional, List
from flaml.automl.task.task import NLG_TASKS
try:
from transformers import TrainingArguments
except ImportError:
TrainingArguments = object
@dataclass
class TrainingArgumentsForAuto(TrainingArguments):
"""FLAML custom TrainingArguments.
Args:
task (str): the task name for NLP tasks, e.g., seq-classification, token-classification
output_dir (str): data root directory for outputing the log, etc.
model_path (str, optional, defaults to "facebook/muppet-roberta-base"): A string,
the path of the language model file, either a path from huggingface
model card huggingface.co/models, or a local path for the model.
fp16 (bool, optional, defaults to "False"): A bool, whether to use FP16.
max_seq_length (int, optional, defaults to 128): An integer, the max length of the sequence.
For token classification task, this argument will be ineffective.
pad_to_max_length (bool, optional, defaults to "False"):
whether to pad all samples to model maximum sentence length.
If False, will pad the samples dynamically when batching to the maximum length in the batch.
per_device_eval_batch_size (int, optional, defaults to 1): An integer, the per gpu evaluation batch size.
label_list (List[str], optional, defaults to None): A list of string, the string list of the label names.
When the task is sequence labeling/token classification, there are two formats of the labels:
(1) The token labels, i.e., [B-PER, I-PER, B-LOC]; (2) Id labels. For (2), need to pass the label_list (e.g., [B-PER, I-PER, B-LOC])
to convert the Id to token labels when computing the metric with metric_loss_score.
See the example in [a simple token classification example](/docs/Examples/AutoML-NLP#a-simple-token-classification-example).
"""
task: str = field(default="seq-classification")
output_dir: str = field(default="data/output/", metadata={"help": "data dir"})
model_path: str = field(
default="facebook/muppet-roberta-base",
metadata={
"help": "model path for HPO natural language understanding tasks, default is set to facebook/muppet-roberta-base"
},
)
fp16: bool = field(default=True, metadata={"help": "whether to use the FP16 mode"})
max_seq_length: int = field(default=128, metadata={"help": "max seq length"})
label_all_tokens: bool = field(
default=False,
metadata={
"help": "For NER task, whether to set the extra tokenized labels to the same label (instead of -100)"
},
)
pad_to_max_length: bool = field(
default=False,
metadata={
"help": "Whether to pad all samples to model maximum sentence length. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch. "
},
)
per_device_eval_batch_size: int = field(
default=1,
metadata={"help": "per gpu evaluation batch size"},
)
label_list: Optional[List[str]] = field(default=None, metadata={"help": "The string list of the label names. "})
eval_steps: int = field(default=500, metadata={"help": "Run an evaluation every X steps."})
save_steps: int = field(default=500, metadata={"help": "Save checkpoint every X updates steps."})
logging_steps: int = field(default=500, metadata={"help": "Log every X updates steps."})
@staticmethod
def load_args_from_console():
from dataclasses import fields
arg_parser = argparse.ArgumentParser()
for each_field in fields(TrainingArgumentsForAuto):
print(each_field)
arg_parser.add_argument(
"--" + each_field.name,
type=each_field.type,
help=each_field.metadata["help"],
required=each_field.metadata["required"] if "required" in each_field.metadata else False,
choices=each_field.metadata["choices"] if "choices" in each_field.metadata else None,
default=each_field.default,
)
console_args, unknown = arg_parser.parse_known_args()
return console_args
@dataclass
class Seq2SeqTrainingArgumentsForAuto(TrainingArgumentsForAuto):
model_path: str = field(
default="t5-small",
metadata={"help": "model path for HPO natural language generation tasks, default is set to t5-small"},
)
sortish_sampler: bool = field(default=False, metadata={"help": "Whether to use SortishSampler or not."})
predict_with_generate: bool = field(
default=True,
metadata={"help": "Whether to use generate to calculate generative metrics (ROUGE, BLEU)."},
)
generation_max_length: Optional[int] = field(
default=None,
metadata={
"help": "The `max_length` to use on each evaluation loop when `predict_with_generate=True`. Will default "
"to the `max_length` value of the model configuration."
},
)
generation_num_beams: Optional[int] = field(
default=None,
metadata={
"help": "The `num_beams` to use on each evaluation loop when `predict_with_generate=True`. Will default "
"to the `num_beams` value of the model configuration."
},
)
def __post_init__(self):
super().__post_init__()
if self.task in NLG_TASKS:
self.model_path = "t5-small"

View File

@ -0,0 +1,422 @@
from itertools import chain
import numpy as np
from flaml.automl.task.task import (
SUMMARIZATION,
SEQREGRESSION,
SEQCLASSIFICATION,
MULTICHOICECLASSIFICATION,
TOKENCLASSIFICATION,
NLG_TASKS,
)
from flaml.automl.data import pd
def todf(X, Y, column_name):
"""
todf converts Y from any format (list, pandas.Series, numpy array) to a DataFrame before being returned
"""
if Y is not None:
Y = pd.DataFrame(Y, index=X.index)
Y.columns = column_name
return Y
def tokenize_text(X, Y=None, task=None, hf_args=None, tokenizer=None):
label_col_name = None
# label_col_name is the name of the label column Y, label_col_name = ['labels'] for TOKENCLASSIFICATION and SUMMARIZATION,
# label_col_name = ['label'] for other tasks. todf is used by all tasks except for SUMMARIZATION,
# because the outputs of tokenize_seq2seq are already two DataFrames so no conversion needed.
if task in (SEQCLASSIFICATION, SEQREGRESSION):
X_tokenized = tokenize_onedataframe(
X,
tokenizer=tokenizer,
task=task,
hf_args=hf_args,
prefix_str="",
)
Y_tokenized = Y
label_col_name = ["label"]
elif task == TOKENCLASSIFICATION:
X_tokenized, Y_tokenized = tokenize_text_tokclassification(X, Y, tokenizer=tokenizer, hf_args=hf_args)
label_col_name = ["labels"]
elif task in NLG_TASKS:
return tokenize_seq2seq(X, Y, tokenizer=tokenizer, task=task, hf_args=hf_args)
elif task == MULTICHOICECLASSIFICATION:
X_tokenized = tokenize_text_multiplechoice(X, tokenizer=tokenizer, hf_args=hf_args)
label_col_name = ["label"]
Y_tokenized = Y
Y_tokenized = todf(X_tokenized, Y_tokenized, label_col_name)
return X_tokenized, Y_tokenized
def tokenize_seq2seq(X, Y, tokenizer, task=None, hf_args=None):
model_inputs = tokenize_onedataframe(
X,
tokenizer=tokenizer,
task=task,
hf_args=hf_args,
prefix_str="summarize: ",
)
model_outputs = None
if Y is not None:
model_outputs = tokenize_onedataframe(
Y.to_frame(),
tokenizer=tokenizer,
task=task,
hf_args=hf_args,
prefix_str="",
)
model_outputs["labels"] = [
[(each_l if each_l != tokenizer.pad_token_id else -100) for each_l in label]
for label in model_outputs["input_ids"]
]
model_outputs = model_outputs.drop(columns=["attention_mask", "input_ids", "decoder_input_ids"])
return model_inputs, model_outputs
def tokenize_and_align_labels(
examples,
tokenizer,
label_to_id,
b_to_i_label,
hf_args=None,
X_sent_key=None,
Y_sent_key=None,
return_column_name=False,
):
# tokenize_and_align_labels is only called by the token-classification task
tokenized_inputs = tokenizer(
[list(examples[X_sent_key])],
padding="max_length"
if hf_args and hf_args.pad_to_max_length
else False, # to be consistent with https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py#L394
truncation=True,
max_length=hf_args.max_seq_length if hf_args else None,
# We use this argument because the texts in our dataset are lists of words (with a label for each word).
is_split_into_words=True,
)
if Y_sent_key is not None:
previous_word_idx = None
label_ids = []
for word_idx in tokenized_inputs.word_ids(batch_index=0):
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label_to_id[examples[Y_sent_key][word_idx]])
# For the other tokens in a word, we set the label to either the current label or -100, depending on
# the label_all_tokens flag.
else:
# Use the label_all_tokens to control whether to copy the label to all subtokens or to pad the additional tokens as -100
if hf_args.label_all_tokens:
# If the B- word is converted into multiple subtokens, map the additional subtokens to I-
label_ids.append(b_to_i_label[label_to_id[examples[Y_sent_key][word_idx]]])
else:
label_ids.append(-100)
previous_word_idx = word_idx
tokenized_inputs["labels"] = label_ids
tmp_column_names = sorted(tokenized_inputs.keys())
tokenized_input_and_labels = [tokenized_inputs[x] for x in tmp_column_names]
for key_idx, each_key in enumerate(tmp_column_names):
if each_key != "labels":
tokenized_input_and_labels[key_idx] = tokenized_input_and_labels[key_idx][0]
if return_column_name:
return tokenized_input_and_labels, tmp_column_names
else:
return tokenized_input_and_labels
def tokenize_text_tokclassification(X, Y, tokenizer, hf_args=None):
# If the label_all_tokens flag is True, prepare two dicts label_to_id and b_to_i_label to convert the B- labels to I- labels
label_to_id = {i: i for i in range(len(hf_args.label_list))}
b_to_i_label = []
for idx, label in enumerate(hf_args.label_list):
if label.startswith("B-") and label.replace("B-", "I-") in hf_args.label_list:
b_to_i_label.append(hf_args.label_list.index(label.replace("B-", "I-")))
else:
b_to_i_label.append(idx)
if Y is not None:
X_and_Y = pd.concat([X, Y.to_frame()], axis=1)
X_key = list(X.keys())[0]
Y_key = list(Y.to_frame().keys())[0]
# tokenize_and_align_labels is only called by the token-classification task
_, tokenized_column_names = tokenize_and_align_labels(
X_and_Y.iloc[0],
tokenizer=tokenizer,
hf_args=hf_args,
X_sent_key=X_key,
Y_sent_key=Y_key,
return_column_name=True,
label_to_id=label_to_id,
b_to_i_label=b_to_i_label,
)
X_and_Y_tokenized = X_and_Y.apply(
lambda x: tokenize_and_align_labels(
x,
tokenizer=tokenizer,
hf_args=hf_args,
X_sent_key=X_key,
Y_sent_key=Y_key,
label_to_id=label_to_id,
b_to_i_label=b_to_i_label,
),
axis=1,
result_type="expand",
)
label_idx = tokenized_column_names.index("labels")
other_indices = sorted(set(range(len(tokenized_column_names))).difference({label_idx}))
other_column_names = [tokenized_column_names[x] for x in other_indices]
d = X_and_Y_tokenized.iloc[:, other_indices]
y_tokenized = X_and_Y_tokenized.iloc[:, label_idx]
else:
X_key = list(X.keys())[0]
_, tokenized_column_names = tokenize_and_align_labels(
X.iloc[0],
tokenizer=tokenizer,
hf_args=hf_args,
X_sent_key=X_key,
Y_sent_key=None,
return_column_name=True,
label_to_id=label_to_id,
b_to_i_label=b_to_i_label,
)
d = X.apply(
lambda x: tokenize_and_align_labels(
x,
tokenizer=tokenizer,
hf_args=hf_args,
X_sent_key=X_key,
Y_sent_key=None,
label_to_id=label_to_id,
b_to_i_label=b_to_i_label,
),
axis=1,
result_type="expand",
)
other_column_names = tokenized_column_names
y_tokenized = None
X_tokenized = pd.DataFrame(columns=other_column_names)
X_tokenized[other_column_names] = d
return X_tokenized, y_tokenized
def tokenize_onedataframe(
X,
tokenizer,
task=None,
hf_args=None,
prefix_str=None,
):
with tokenizer.as_target_tokenizer():
_, tokenized_column_names = tokenize_row(
dict(X.iloc[0]),
tokenizer,
prefix=(prefix_str,) if task is SUMMARIZATION else None,
task=task,
hf_args=hf_args,
return_column_name=True,
)
d = X.apply(
lambda x: tokenize_row(
x,
tokenizer,
prefix=(prefix_str,) if task is SUMMARIZATION else None,
task=task,
hf_args=hf_args,
),
axis=1,
result_type="expand",
)
X_tokenized = pd.DataFrame(columns=tokenized_column_names)
X_tokenized[tokenized_column_names] = d
return X_tokenized
def tokenize_row(
this_row,
tokenizer,
prefix=None,
task=None,
hf_args=None,
return_column_name=False,
):
if prefix:
this_row = tuple(["".join(x) for x in zip(prefix, this_row)])
# tokenizer.pad_token = tokenizer.eos_token
tokenized_example = tokenizer(
*tuple(this_row),
padding="max_length" if hf_args and hf_args.pad_to_max_length else False,
max_length=hf_args.max_seq_length if hf_args else None,
truncation=True,
)
if task in NLG_TASKS:
tokenized_example["decoder_input_ids"] = tokenized_example["input_ids"]
tmp_column_names = sorted(tokenized_example.keys())
if return_column_name:
return [tokenized_example[x] for x in tmp_column_names], tmp_column_names
else:
return [tokenized_example[x] for x in tmp_column_names]
def tokenize_text_multiplechoice(X, tokenizer, hf_args=None):
t = X[["sent1", "sent2", "ending0", "ending1", "ending2", "ending3"]]
_, tokenized_column_names = tokenize_swag(
t.iloc[0],
tokenizer=tokenizer,
hf_args=hf_args,
return_column_name=True,
)
d = t.apply(
lambda x: tokenize_swag(x, tokenizer=tokenizer, hf_args=hf_args),
axis=1,
result_type="expand",
)
X_tokenized = pd.DataFrame(columns=tokenized_column_names)
X_tokenized[tokenized_column_names] = d
output = X_tokenized.join(X)
return output
def tokenize_swag(this_row, tokenizer, hf_args=None, return_column_name=False):
first_sentences = [[this_row["sent1"]] * 4]
# get each 1st sentence, multiply to 4 sentences
question_headers = this_row["sent2"]
# sent2 are the noun part of 2nd line
second_sentences = [question_headers + " " + this_row[key] for key in ["ending0", "ending1", "ending2", "ending3"]]
# now the 2nd-sentences are formed by combing the noun part and 4 ending parts
# Flatten out
# From 2 dimension to 1 dimension array
first_sentences = list(chain(*first_sentences))
tokenized_example = tokenizer(
*tuple([first_sentences, second_sentences]),
truncation=True,
max_length=hf_args.max_seq_length if hf_args else None,
padding="max_length" if hf_args and hf_args.pad_to_max_length else False,
)
tmp_column_names = sorted(tokenized_example.keys())
if return_column_name:
return [tokenized_example[x] for x in tmp_column_names], tmp_column_names
else:
return [tokenized_example[x] for x in tmp_column_names]
def postprocess_prediction_and_true(task, y_pred, tokenizer, hf_args, y_true=None, X=None):
# postprocess the matrix prediction y_pred and ground truth y_true into user readable format, e.g., for summarization, decode into text
if y_pred is None:
return np.array([0.0] * len(X)), y_true
if task == SEQCLASSIFICATION:
return np.argmax(y_pred, axis=1), y_true
elif task == SEQREGRESSION:
return np.squeeze(y_pred), y_true # predictions.reshape((len(predictions),))
elif task == TOKENCLASSIFICATION:
assert (y_true is not None) or (X is not None), "One of y_true and X must not be None"
## If y_true is not None, we use y_true to remove the -100 in the prediction (postprocessing), and return the postprocessed y_true and prediction
# If y_true is None, we use X to compute y_is_pad (i.e., whether y_true is -100 in that position), and use y_is_pad to remove the -100 in the prediction, and return the postprocessed prediction (not the y_true)
y_predict = pd.Series(np.argmax(y_pred, axis=2).tolist())
if y_true is None:
_, y_is_pad_df = tokenize_text(
X,
y_predict,
task=task,
hf_args=hf_args,
tokenizer=tokenizer,
)
y_is_pad = y_is_pad_df.iloc[:, 0]
else:
y_is_pad = y_true
label_len = len(hf_args.label_list)
zip_pred_ispad = [
[(p, ispd) for (p, ispd) in zip(each_pred, each_is_pad) if ispd != -100]
for (each_pred, each_is_pad) in zip(y_predict, y_is_pad)
]
y_pred_label = [
[hf_args.label_list[p] if 0 <= p < label_len else -1 for (p, ispd) in each_list]
for each_list in zip_pred_ispad
] # To compute precision and recall, y_pred and y_true must be converted to string labels
# (B-PER, I-PER, etc.), so that the category-based precision/recall (i.e., PER, LOC, etc.) scores can be computed
if y_true is not None:
y_true_label = [[tr for (p, tr) in each_list] for each_list in zip_pred_ispad]
else:
y_true_label = None
return y_pred_label, y_true_label
elif task == SUMMARIZATION:
if isinstance(y_pred, tuple):
y_pred = np.argmax(y_pred[0], axis=2)
decoded_preds = tokenizer.batch_decode(y_pred, skip_special_tokens=True)
import nltk
nltk.download("punkt")
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in decoded_preds]
if y_true is not None:
y_true_labels = np.where(y_true != -100, y_true, tokenizer.pad_token_id)
decoded_y_true_labels = tokenizer.batch_decode(y_true_labels, skip_special_tokens=True)
decoded_y_true_labels = [label.strip() for label in decoded_y_true_labels]
decoded_y_true_labels = ["\n".join(nltk.sent_tokenize(label)) for label in decoded_y_true_labels]
else:
decoded_y_true_labels = None
return decoded_preds, decoded_y_true_labels
elif task == MULTICHOICECLASSIFICATION:
return np.argmax(y_pred, axis=1), y_true
def load_model(checkpoint_path, task, num_labels=None):
import transformers
transformers.logging.set_verbosity_error()
from transformers import AutoConfig
from flaml.automl.task.task import (
SEQCLASSIFICATION,
SEQREGRESSION,
TOKENCLASSIFICATION,
)
def get_this_model(checkpoint_path, task, model_config):
from transformers import AutoModelForSequenceClassification
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoModelForMultipleChoice
from transformers import AutoModelForTokenClassification
if task in (SEQCLASSIFICATION, SEQREGRESSION):
return AutoModelForSequenceClassification.from_pretrained(
checkpoint_path, config=model_config, ignore_mismatched_sizes=True
)
elif task == TOKENCLASSIFICATION:
return AutoModelForTokenClassification.from_pretrained(checkpoint_path, config=model_config)
elif task in NLG_TASKS:
return AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path, config=model_config)
elif task == MULTICHOICECLASSIFICATION:
return AutoModelForMultipleChoice.from_pretrained(checkpoint_path, config=model_config)
def _set_model_config(checkpoint_path):
if task in (SEQCLASSIFICATION, SEQREGRESSION, TOKENCLASSIFICATION):
model_config = AutoConfig.from_pretrained(
checkpoint_path,
num_labels=model_config_num_labels,
)
return model_config
else:
model_config = AutoConfig.from_pretrained(checkpoint_path)
return model_config
current_config = AutoConfig.from_pretrained(checkpoint_path)
this_vocab_size = current_config.vocab_size
model_config_num_labels = num_labels
new_config = _set_model_config(checkpoint_path)
this_model = get_this_model(checkpoint_path, task, new_config)
this_model.resize_token_embeddings(this_vocab_size)
return this_model

108
flaml/automl/nlp/utils.py Normal file
View File

@ -0,0 +1,108 @@
from typing import Dict, Any
import numpy as np
from flaml.automl.task.task import (
SUMMARIZATION,
SEQREGRESSION,
SEQCLASSIFICATION,
MULTICHOICECLASSIFICATION,
TOKENCLASSIFICATION,
)
def load_default_huggingface_metric_for_task(task):
if task == SEQCLASSIFICATION:
return "accuracy"
elif task == SEQREGRESSION:
return "r2"
elif task == SUMMARIZATION:
return "rouge1"
elif task == MULTICHOICECLASSIFICATION:
return "accuracy"
elif task == TOKENCLASSIFICATION:
return "seqeval"
def is_a_list_of_str(this_obj):
return (isinstance(this_obj, list) or isinstance(this_obj, np.ndarray)) and all(
isinstance(x, str) for x in this_obj
)
def _clean_value(value: Any) -> str:
if isinstance(value, float):
return "{:.5}".format(value)
else:
return str(value).replace("/", "_")
def format_vars(resolved_vars: Dict) -> str:
"""Formats the resolved variable dict into a single string."""
out = []
for path, value in sorted(resolved_vars.items()):
if path[0] in ["run", "env", "resources_per_trial"]:
continue # TrialRunner already has these in the experiment_tag
pieces = []
last_string = True
for k in path[::-1]:
if isinstance(k, int):
pieces.append(str(k))
elif last_string:
last_string = False
pieces.append(k)
pieces.reverse()
out.append(_clean_value("_".join(pieces)) + "=" + _clean_value(value))
return ",".join(out)
counter = 0
def date_str():
from datetime import datetime
return datetime.today().strftime("%Y-%m-%d_%H-%M-%S")
def _generate_dirname(experiment_tag, trial_id):
generated_dirname = f"train_{str(trial_id)}_{experiment_tag}"
generated_dirname = generated_dirname[:130]
generated_dirname += f"_{date_str()}"
return generated_dirname.replace("/", "_")
def get_logdir_name(dirname, local_dir):
import os
local_dir = os.path.expanduser(local_dir)
logdir = os.path.join(local_dir, dirname)
return logdir
class Counter:
counter = 0
@staticmethod
def get_trial_fold_name(local_dir, trial_config, trial_id):
Counter.counter += 1
experiment_tag = "{0}_{1}".format(str(Counter.counter), format_vars(trial_config))
logdir = get_logdir_name(_generate_dirname(experiment_tag, trial_id=trial_id), local_dir)
return logdir
class LabelEncoderforTokenClassification:
def fit_transform(self, y):
# if the labels are tokens, convert them to ids
if any(isinstance(id, str) for id in y[0]):
self.label_list = sorted(list(set().union(*y)))
self._tokenlabel_to_id = {self.label_list[id]: id for id in range(len(self.label_list))}
y = y.apply(lambda sent: [self._tokenlabel_to_id[token] for token in sent])
# if the labels are not tokens, they must be ids
else:
assert all(isinstance(id, (int, np.integer)) for id in y[0]), "The labels must either be tokens or ids"
return y
def transform(self, y):
if hasattr(self, "_tokenlabel_to_id"):
y = y.apply(lambda sent: [self._tokenlabel_to_id[token] for token in sent])
return y

View File

@ -0,0 +1,32 @@
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
try:
import pyspark
import pyspark.pandas as ps
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import DataFrame as sparkDataFrame
from pyspark.pandas import DataFrame as psDataFrame, Series as psSeries, set_option
from pyspark.util import VersionUtils
except ImportError:
class psDataFrame:
pass
F = T = ps = sparkDataFrame = psSeries = psDataFrame
_spark_major_minor_version = set_option = None
ERROR = ImportError(
"""Please run pip install flaml[spark]
and check [here](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)
for more details about installing Spark."""
)
else:
ERROR = None
_spark_major_minor_version = VersionUtils.majorMinorVersion(pyspark.__version__)
try:
import pandas as pd
from pandas import DataFrame, Series
except ImportError:
DataFrame = Series = pd = None

View File

@ -0,0 +1,97 @@
ParamList_LightGBM_Base = [
"baggingFraction",
"baggingFreq",
"baggingSeed",
"binSampleCount",
"boostFromAverage",
"boostingType",
"catSmooth",
"categoricalSlotIndexes",
"categoricalSlotNames",
"catl2",
"chunkSize",
"dataRandomSeed",
"defaultListenPort",
"deterministic",
"driverListenPort",
"dropRate",
"dropSeed",
"earlyStoppingRound",
"executionMode",
"extraSeed" "featureFraction",
"featureFractionByNode",
"featureFractionSeed",
"featuresCol",
"featuresShapCol",
"fobj" "improvementTolerance",
"initScoreCol",
"isEnableSparse",
"isProvideTrainingMetric",
"labelCol",
"lambdaL1",
"lambdaL2",
"leafPredictionCol",
"learningRate",
"matrixType",
"maxBin",
"maxBinByFeature",
"maxCatThreshold",
"maxCatToOnehot",
"maxDeltaStep",
"maxDepth",
"maxDrop",
"metric",
"microBatchSize",
"minDataInLeaf",
"minDataPerBin",
"minDataPerGroup",
"minGainToSplit",
"minSumHessianInLeaf",
"modelString",
"monotoneConstraints",
"monotoneConstraintsMethod",
"monotonePenalty",
"negBaggingFraction",
"numBatches",
"numIterations",
"numLeaves",
"numTasks",
"numThreads",
"objectiveSeed",
"otherRate",
"parallelism",
"passThroughArgs",
"posBaggingFraction",
"predictDisableShapeCheck",
"predictionCol",
"repartitionByGroupingColumn",
"seed",
"skipDrop",
"slotNames",
"timeout",
"topK",
"topRate",
"uniformDrop",
"useBarrierExecutionMode",
"useMissing",
"useSingleDatasetMode",
"validationIndicatorCol",
"verbosity",
"weightCol",
"xGBoostDartMode",
"zeroAsMissing",
"objective",
]
ParamList_LightGBM_Classifier = ParamList_LightGBM_Base + [
"isUnbalance",
"probabilityCol",
"rawPredictionCol",
"thresholds",
]
ParamList_LightGBM_Regressor = ParamList_LightGBM_Base + ["tweedieVariancePower"]
ParamList_LightGBM_Ranker = ParamList_LightGBM_Base + [
"groupCol",
"evalAt",
"labelGain",
"maxPosition",
]

View File

@ -0,0 +1,212 @@
import numpy as np
from typing import Union
from flaml.automl.spark import psSeries, F
from pyspark.ml.evaluation import (
BinaryClassificationEvaluator,
RegressionEvaluator,
MulticlassClassificationEvaluator,
MultilabelClassificationEvaluator,
RankingEvaluator,
)
def ps_group_counts(groups: Union[psSeries, np.ndarray]) -> np.ndarray:
if isinstance(groups, np.ndarray):
_, i, c = np.unique(groups, return_counts=True, return_index=True)
else:
i = groups.drop_duplicates().index.values
c = groups.value_counts().sort_index().to_numpy()
return c[np.argsort(i)].tolist()
def _process_df(df, label_col, prediction_col):
df = df.withColumn(label_col, F.array([df[label_col]]))
df = df.withColumn(prediction_col, F.array([df[prediction_col]]))
return df
def _compute_label_from_probability(df, probability_col, prediction_col):
# array_max finds the maximum value in the 'probability' array
# array_position finds the index of the maximum value in the 'probability' array
max_index_expr = F.expr(f"array_position({probability_col}, array_max({probability_col}))-1")
# Create a new column 'prediction' based on the maximum probability value
df = df.withColumn(prediction_col, max_index_expr.cast("double"))
return df
def spark_metric_loss_score(
metric_name: str,
y_predict: psSeries,
y_true: psSeries,
sample_weight: psSeries = None,
groups: psSeries = None,
) -> float:
"""
Compute the loss score of a metric for spark models.
Args:
metric_name: str | the name of the metric.
y_predict: psSeries | the predicted values.
y_true: psSeries | the true values.
sample_weight: psSeries | the sample weights. Default: None.
groups: psSeries | the group of each row. Default: None.
Returns:
float | the loss score. A lower value indicates a better model.
"""
import warnings
warnings.filterwarnings("ignore")
label_col = "label"
prediction_col = "prediction"
kwargs = {}
y_predict.name = prediction_col
y_true.name = label_col
df = y_predict.to_frame().join(y_true)
if sample_weight is not None:
sample_weight.name = "weight"
df = df.join(sample_weight)
kwargs = {"weightCol": "weight"}
df = df.to_spark()
metric_name = metric_name.lower()
min_mode_metrics = ["log_loss", "rmse", "mse", "mae"]
if metric_name == "rmse":
evaluator = RegressionEvaluator(
metricName="rmse",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "mse":
evaluator = RegressionEvaluator(
metricName="mse",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "mae":
evaluator = RegressionEvaluator(
metricName="mae",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "r2":
evaluator = RegressionEvaluator(
metricName="r2",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "var":
evaluator = RegressionEvaluator(
metricName="var",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "roc_auc":
evaluator = BinaryClassificationEvaluator(
metricName="areaUnderROC",
labelCol=label_col,
rawPredictionCol=prediction_col,
**kwargs,
)
elif metric_name == "pr_auc":
evaluator = BinaryClassificationEvaluator(
metricName="areaUnderPR",
labelCol=label_col,
rawPredictionCol=prediction_col,
**kwargs,
)
elif metric_name == "accuracy":
evaluator = MulticlassClassificationEvaluator(
metricName="accuracy",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "log_loss":
# For log_loss, prediction_col should be probability, and we need to convert it to label
df = _compute_label_from_probability(df, prediction_col, prediction_col + "_label")
evaluator = MulticlassClassificationEvaluator(
metricName="logLoss",
labelCol=label_col,
predictionCol=prediction_col + "_label",
probabilityCol=prediction_col,
**kwargs,
)
elif metric_name == "f1":
evaluator = MulticlassClassificationEvaluator(
metricName="f1",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "micro_f1":
evaluator = MultilabelClassificationEvaluator(
metricName="microF1Measure",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "macro_f1":
evaluator = MultilabelClassificationEvaluator(
metricName="f1MeasureByLabel",
labelCol=label_col,
predictionCol=prediction_col,
**kwargs,
)
elif metric_name == "ap":
evaluator = RankingEvaluator(
metricName="meanAveragePrecision",
labelCol=label_col,
predictionCol=prediction_col,
)
elif "ndcg" in metric_name:
# TODO: check if spark.ml ranker has the same format with
# synapseML ranker, may need to adjust the format of df
if "@" in metric_name:
k = int(metric_name.split("@", 1)[-1])
if groups is None:
evaluator = RankingEvaluator(
metricName="ndcgAtK",
labelCol=label_col,
predictionCol=prediction_col,
k=k,
)
df = _process_df(df, label_col, prediction_col)
score = 1 - evaluator.evaluate(df)
else:
counts = ps_group_counts(groups)
score = 0
psum = 0
for c in counts:
y_true_ = y_true[psum : psum + c]
y_predict_ = y_predict[psum : psum + c]
df = y_true_.to_frame().join(y_predict_).to_spark()
df = _process_df(df, label_col, prediction_col)
evaluator = RankingEvaluator(
metricName="ndcgAtK",
labelCol=label_col,
predictionCol=prediction_col,
k=k,
)
score -= evaluator.evaluate(df)
psum += c
score /= len(counts)
score += 1
else:
evaluator = RankingEvaluator(metricName="ndcgAtK", labelCol=label_col, predictionCol=prediction_col)
df = _process_df(df, label_col, prediction_col)
score = 1 - evaluator.evaluate(df)
return score
else:
raise ValueError(f"Unknown metric name: {metric_name} for spark models.")
return evaluator.evaluate(df) if metric_name in min_mode_metrics else 1 - evaluator.evaluate(df)

255
flaml/automl/spark/utils.py Normal file
View File

@ -0,0 +1,255 @@
import logging
from typing import Union, List, Optional, Tuple
import numpy as np
from flaml.automl.spark import (
sparkDataFrame,
ps,
F,
T,
psDataFrame,
psSeries,
_spark_major_minor_version,
DataFrame,
Series,
set_option,
)
logger = logging.getLogger(__name__)
logger_formatter = logging.Formatter(
"[%(name)s: %(asctime)s] {%(lineno)d} %(levelname)s - %(message)s", "%m-%d %H:%M:%S"
)
logger.propagate = False
def to_pandas_on_spark(
df: Union[DataFrame, sparkDataFrame, Series, psDataFrame, psSeries],
index_col: Optional[str] = None,
default_index_type: Optional[str] = "distributed-sequence",
) -> Union[psDataFrame, psSeries]:
"""Convert pandas or pyspark dataframe/series to pandas_on_Spark dataframe/series.
Args:
df: pandas.DataFrame/series or pyspark dataframe | The input dataframe/series.
index_col: str, optional | The column name to use as index, default None.
default_index_type: str, optional | The default index type, default "distributed-sequence".
Returns:
pyspark.pandas.DataFrame/Series: The converted pandas-on-Spark dataframe/series.
```python
import pandas as pd
from flaml.automl.spark.utils import to_pandas_on_spark
pdf = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
psdf = to_pandas_on_spark(pdf)
print(psdf)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(pdf)
psdf = to_pandas_on_spark(sdf)
print(psdf)
pds = Series([1, 2, 3])
pss = to_pandas_on_spark(pds)
print(pss)
```
"""
set_option("compute.default_index_type", default_index_type)
if isinstance(df, (DataFrame, Series)):
return ps.from_pandas(df)
elif isinstance(df, sparkDataFrame):
if _spark_major_minor_version[0] == 3 and _spark_major_minor_version[1] < 3:
return df.to_pandas_on_spark(index_col=index_col)
else:
return df.pandas_api(index_col=index_col)
elif isinstance(df, (psDataFrame, psSeries)):
return df
else:
raise TypeError(f"{type(df)} is not one of pandas.DataFrame, pandas.Series and pyspark.sql.DataFrame")
def train_test_split_pyspark(
df: Union[sparkDataFrame, psDataFrame],
stratify_column: Optional[str] = None,
test_fraction: Optional[float] = 0.2,
seed: Optional[int] = 1234,
to_pandas_spark: Optional[bool] = True,
index_col: Optional[str] = "tmp_index_col",
) -> Tuple[Union[sparkDataFrame, psDataFrame], Union[sparkDataFrame, psDataFrame]]:
"""Split a pyspark dataframe into train and test dataframes.
Args:
df: pyspark.sql.DataFrame | The input dataframe.
stratify_column: str | The column name to stratify the split. Default None.
test_fraction: float | The fraction of the test data. Default 0.2.
seed: int | The random seed. Default 1234.
to_pandas_spark: bool | Whether to convert the output to pandas_on_spark. Default True.
index_col: str | The column name to use as index. Default None.
Returns:
pyspark.sql.DataFrame/pandas_on_spark DataFrame | The train dataframe.
pyspark.sql.DataFrame/pandas_on_spark DataFrame | The test dataframe.
"""
import warnings
warnings.filterwarnings("ignore")
if isinstance(df, psDataFrame):
df = df.to_spark(index_col=index_col)
if stratify_column:
# Test data
test_fraction_dict = (
df.select(stratify_column).distinct().withColumn("fraction", F.lit(test_fraction)).rdd.collectAsMap()
)
df_test = df.stat.sampleBy(stratify_column, test_fraction_dict, seed)
# Train data
df_train = df.subtract(df_test)
else:
df_train, df_test = df.randomSplit([1 - test_fraction, test_fraction], seed)
if to_pandas_spark:
df_train = to_pandas_on_spark(df_train, index_col=index_col)
df_test = to_pandas_on_spark(df_test, index_col=index_col)
df_train.index.name = None
df_test.index.name = None
elif index_col == "tmp_index_col":
df_train = df_train.drop(index_col)
df_test = df_test.drop(index_col)
return [df_train, df_test]
def unique_pandas_on_spark(psds: Union[psSeries, psDataFrame]) -> Tuple[np.ndarray, np.ndarray]:
"""Get the unique values and counts of a pandas_on_spark series."""
if isinstance(psds, psDataFrame):
psds = psds.iloc[:, 0]
_tmp = psds.value_counts().to_pandas()
label_set = _tmp.index.values
counts = _tmp.values
return label_set, counts
def len_labels(y: Union[psSeries, np.ndarray], return_labels=False) -> Union[int, Optional[np.ndarray]]:
"""Get the number of unique labels in y."""
if not isinstance(y, (psDataFrame, psSeries)):
labels = np.unique(y)
else:
labels = y.unique() if isinstance(y, psSeries) else y.iloc[:, 0].unique()
if return_labels:
return len(labels), labels
return len(labels)
def unique_value_first_index(y: Union[Series, psSeries, np.ndarray]) -> Tuple[np.ndarray, np.ndarray]:
"""Get the unique values and indices of a pandas series,
pandas_on_spark series or numpy array."""
if isinstance(y, psSeries):
y_unique = y.drop_duplicates().sort_index()
label_set = y_unique.values
first_index = y_unique.index.values
else:
label_set, first_index = np.unique(y, return_index=True)
return label_set, first_index
def iloc_pandas_on_spark(
psdf: Union[psDataFrame, psSeries, DataFrame, Series],
index: Union[int, slice, list],
index_col: Optional[str] = "tmp_index_col",
) -> Union[psDataFrame, psSeries]:
"""Get the rows of a pandas_on_spark dataframe/series by index."""
import warnings
warnings.filterwarnings("ignore")
if isinstance(psdf, (DataFrame, Series)):
return psdf.iloc[index]
if isinstance(index, (int, slice)):
if isinstance(psdf, psSeries):
return psdf.iloc[index]
else:
return psdf.iloc[index, :]
elif isinstance(index, list):
if isinstance(psdf, psSeries):
sdf = psdf.to_frame().to_spark(index_col=index_col)
else:
if index_col not in psdf.columns:
sdf = psdf.to_spark(index_col=index_col)
else:
sdf = psdf.to_spark()
sdfiloc = sdf.filter(F.col(index_col).isin(index))
psdfiloc = to_pandas_on_spark(sdfiloc)
if isinstance(psdf, psSeries):
psdfiloc = psdfiloc[psdfiloc.columns.drop(index_col)[0]]
elif index_col not in psdf.columns:
psdfiloc = psdfiloc.drop(columns=[index_col])
return psdfiloc
else:
raise TypeError(f"{type(index)} is not one of int, slice and list for pandas_on_spark iloc")
def spark_kFold(
dataset: Union[sparkDataFrame, psDataFrame],
nFolds: int = 3,
foldCol: str = "",
seed: int = 42,
index_col: Optional[str] = "tmp_index_col",
) -> List[Tuple[psDataFrame, psDataFrame]]:
"""Generate k-fold splits for a Spark DataFrame.
Adopted from https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/tuning.html#CrossValidator
Args:
dataset: sparkDataFrame / psDataFrame. | The DataFrame to split.
nFolds: int | The number of folds. Default is 3.
foldCol: str | The column name to use for fold numbers. If not specified,
the DataFrame will be randomly split. Default is "".
The same group will not appear in two different folds (the number of
distinct groups has to be at least equal to the number of folds).
The folds are approximately balanced in the sense that the number of
distinct groups is approximately the same in each fold.
seed: int | The random seed. Default is 42.
index_col: str | The name of the index column. Default is "tmp_index_col".
Returns:
A list of (train, validation) DataFrames.
"""
import warnings
warnings.filterwarnings("ignore")
if isinstance(dataset, psDataFrame):
dataset = dataset.to_spark(index_col=index_col)
datasets = []
if not foldCol:
# Do random k-fold split.
h = 1.0 / nFolds
randCol = f"rand_col_{seed}"
df = dataset.select("*", F.rand(seed).alias(randCol))
for i in range(nFolds):
validateLB = i * h
validateUB = (i + 1) * h
condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
validation = to_pandas_on_spark(df.filter(condition), index_col=index_col)
train = to_pandas_on_spark(df.filter(~condition), index_col=index_col)
datasets.append((train.drop(columns=[randCol]), validation.drop(columns=[randCol])))
else:
# Use user-specified fold column
def get_fold_num(foldNum: int) -> int:
return int(foldNum % nFolds)
get_fold_num_udf = F.UserDefinedFunction(get_fold_num, T.IntegerType())
for i in range(nFolds):
training = dataset.filter(get_fold_num_udf(dataset[foldCol]) != F.lit(i))
validation = dataset.filter(get_fold_num_udf(dataset[foldCol]) == F.lit(i))
if training.rdd.getNumPartitions() == 0 or len(training.take(1)) == 0:
raise ValueError("The training data at fold %s is empty." % i)
if validation.rdd.getNumPartitions() == 0 or len(validation.take(1)) == 0:
raise ValueError("The validation data at fold %s is empty." % i)
training = to_pandas_on_spark(training, index_col=index_col)
validation = to_pandas_on_spark(validation, index_col=index_col)
datasets.append((training, validation))
return datasets

401
flaml/automl/state.py Normal file
View File

@ -0,0 +1,401 @@
import inspect
import copy
import time
from typing import Any, Optional
import numpy as np
from flaml import tune
from flaml.automl.logger import logger
from flaml.automl.ml import compute_estimator, train_estimator
from flaml.automl.time_series.ts_data import TimeSeriesDataset
from flaml.automl.spark import psDataFrame, psSeries, DataFrame, Series
class SearchState:
@property
def search_space(self):
return self._search_space_domain
@property
def estimated_cost4improvement(self):
return max(
self.time_best_found - self.time_best_found_old,
self.total_time_used - self.time_best_found,
)
def valid_starting_point_one_dim(self, value_one_dim, domain_one_dim):
from flaml.tune.space import sample
"""
For each hp in the starting point, check the following 3 conditions:
(1) If the type of the starting point does not match the required type in search space, return false
(2) If the starting point is not in the required search space, return false
(3) If the search space is a value instead of domain, and the value is not equal to the starting point
Notice (2) include the case starting point not in user specified search space custom_hp
"""
if isinstance(domain_one_dim, sample.Domain):
renamed_type = list(inspect.signature(domain_one_dim.is_valid).parameters.values())[0].annotation
type_match = (
renamed_type == Any
or isinstance(value_one_dim, renamed_type)
or isinstance(value_one_dim, int)
and renamed_type is float
)
if not (type_match and domain_one_dim.is_valid(value_one_dim)):
return False
elif value_one_dim != domain_one_dim:
return False
return True
def valid_starting_point(self, starting_point, search_space):
return all(
self.valid_starting_point_one_dim(value, search_space[name].get("domain"))
for name, value in starting_point.items()
if name != "FLAML_sample_size"
)
def __init__(
self,
learner_class,
data,
task,
starting_point=None,
period=None,
custom_hp=None,
max_iter=None,
budget=None,
):
self.init_eci = learner_class.cost_relative2lgbm() if budget >= 0 else 1
self._search_space_domain = {}
self.init_config = None
self.low_cost_partial_config = {}
self.cat_hp_cost = {}
self.ls_ever_converged = False
self.learner_class = learner_class
self._budget = budget
if task.is_ts_forecast():
data_size = data.train_data.shape
search_space = learner_class.search_space(data=data, task=task, pred_horizon=period)
else:
data_size = data.shape
search_space = learner_class.search_space(data_size=data_size, task=task)
self.data_size = data_size
if custom_hp is not None:
search_space.update(custom_hp)
if isinstance(starting_point, dict):
starting_point = AutoMLState.sanitize(starting_point)
if max_iter > 1 and not self.valid_starting_point(starting_point, search_space):
# If the number of iterations is larger than 1, remove invalid point
logger.warning(
"Starting point {} removed because it is outside of the search space".format(starting_point)
)
starting_point = None
elif isinstance(starting_point, list):
starting_point = [AutoMLState.sanitize(x) for x in starting_point]
if max_iter > len(starting_point):
# If the number of starting points is no smaller than max iter, avoid the checking
starting_point_len = len(starting_point)
starting_point = [x for x in starting_point if self.valid_starting_point(x, search_space)]
if starting_point_len > len(starting_point):
logger.warning(
"Starting points outside of the search space are removed. "
f"Remaining starting points for {learner_class}: {starting_point}"
)
starting_point = starting_point or None
for name, space in search_space.items():
assert "domain" in space, f"{name}'s domain is missing in the search space spec {space}"
if space["domain"] is None:
# don't search this hp
continue
self._search_space_domain[name] = space["domain"]
if "low_cost_init_value" in space:
self.low_cost_partial_config[name] = space["low_cost_init_value"]
if "cat_hp_cost" in space:
self.cat_hp_cost[name] = space["cat_hp_cost"]
# if a starting point is provided, set the init config to be
# the starting point provided
if isinstance(starting_point, dict) and starting_point.get(name) is not None:
if self.init_config is None:
self.init_config = {}
self.init_config[name] = starting_point[name]
elif (
not isinstance(starting_point, list)
and "init_value" in space
and self.valid_starting_point_one_dim(space["init_value"], space["domain"])
):
if self.init_config is None:
self.init_config = {}
self.init_config[name] = space["init_value"]
if isinstance(starting_point, list):
self.init_config = starting_point
else:
self.init_config = [] if self.init_config is None else [self.init_config]
self._hp_names = list(self._search_space_domain.keys())
self.search_alg = None
self.best_config = None
self.best_result = None
self.best_loss = self.best_loss_old = np.inf
self.total_time_used = 0
self.total_iter = 0
self.base_eci = None
self.time_best_found = self.time_best_found_old = 0
self.time2eval_best = 0
self.time2eval_best_old = 0
self.trained_estimator = None
self.sample_size = None
self.trial_time = 0
def update(self, result, time_used):
if result:
config = result["config"]
if config and "FLAML_sample_size" in config:
self.sample_size = config["FLAML_sample_size"]
else:
self.sample_size = self.data_size[0]
obj = result["val_loss"]
metric_for_logging = result["metric_for_logging"]
time2eval = result["time_total_s"]
trained_estimator = result["trained_estimator"]
del result["trained_estimator"] # free up RAM
n_iter = (
trained_estimator
and hasattr(trained_estimator, "ITER_HP")
and trained_estimator.params.get(trained_estimator.ITER_HP)
)
if n_iter:
if "ml" in config:
config["ml"][trained_estimator.ITER_HP] = n_iter
else:
config[trained_estimator.ITER_HP] = n_iter
else:
obj, time2eval, trained_estimator = np.inf, 0.0, None
metric_for_logging = config = None
self.trial_time = time2eval
self.total_time_used += time_used if self._budget >= 0 else 1
self.total_iter += 1
if self.base_eci is None:
self.base_eci = time_used
if (obj is not None) and (obj < self.best_loss):
self.best_loss_old = self.best_loss if self.best_loss < np.inf else 2 * obj
self.best_loss = obj
self.best_result = result
self.time_best_found_old = self.time_best_found
self.time_best_found = self.total_time_used
self.iter_best_found = self.total_iter
self.best_config = config
self.best_config_sample_size = self.sample_size
self.best_config_train_time = time_used
if time2eval:
self.time2eval_best_old = self.time2eval_best
self.time2eval_best = time2eval
if self.trained_estimator and trained_estimator and self.trained_estimator != trained_estimator:
self.trained_estimator.cleanup()
if trained_estimator:
self.trained_estimator = trained_estimator
elif trained_estimator:
trained_estimator.cleanup()
self.metric_for_logging = metric_for_logging
self.val_loss, self.config = obj, config
def get_hist_config_sig(self, sample_size, config):
config_values = tuple([config[k] for k in self._hp_names if k in config])
config_sig = str(sample_size) + "_" + str(config_values)
return config_sig
def est_retrain_time(self, retrain_sample_size):
assert self.best_config_sample_size is not None, "need to first get best_config_sample_size"
return self.time2eval_best * retrain_sample_size / self.best_config_sample_size
class AutoMLState:
def prepare_sample_train_data(self, sample_size: int):
sampled_weight = groups = None
if sample_size <= self.data_size[0]:
if isinstance(self.X_train, TimeSeriesDataset):
sampled_X_train = copy.copy(self.X_train)
sampled_X_train.train_data = self.X_train.train_data.iloc[-sample_size:]
sampled_y_train = None
else:
if isinstance(self.X_train, (DataFrame, psDataFrame)):
sampled_X_train = self.X_train.iloc[:sample_size]
else:
sampled_X_train = self.X_train[:sample_size]
if isinstance(self.y_train, (Series, psSeries)):
sampled_y_train = self.y_train.iloc[:sample_size]
else:
sampled_y_train = self.y_train[:sample_size]
weight = self.fit_kwargs.get(
"sample_weight"
) # NOTE: _prepare_sample_train_data is before kwargs is updated to fit_kwargs_by_estimator
if weight is not None:
sampled_weight = (
weight.iloc[:sample_size] if isinstance(weight, (Series, psSeries)) else weight[:sample_size]
)
if self.groups is not None:
groups = (
self.groups.iloc[:sample_size]
if isinstance(self.groups, (Series, psSeries))
else self.groups[:sample_size]
)
else:
sampled_X_train = self.X_train_all
sampled_y_train = self.y_train_all
if (
"sample_weight" in self.fit_kwargs
): # NOTE: _prepare_sample_train_data is before kwargs is updated to fit_kwargs_by_estimator
sampled_weight = self.sample_weight_all
if self.groups is not None:
groups = self.groups_all
return sampled_X_train, sampled_y_train, sampled_weight, groups
@staticmethod
def _compute_with_config_base(
config_w_resource: dict,
state: "AutoMLState",
estimator: str,
is_report: bool = True,
) -> dict:
if "FLAML_sample_size" in config_w_resource:
sample_size = int(config_w_resource["FLAML_sample_size"])
else:
sample_size = state.data_size[0]
this_estimator_kwargs = state.fit_kwargs_by_estimator.get(
estimator
).copy() # NOTE: _compute_with_config_base is after kwargs is updated to fit_kwargs_by_estimator
(
sampled_X_train,
sampled_y_train,
sampled_weight,
groups,
) = state.task.prepare_sample_train_data(state, sample_size)
if sampled_weight is not None:
weight = this_estimator_kwargs["sample_weight"]
this_estimator_kwargs["sample_weight"] = sampled_weight
if groups is not None:
this_estimator_kwargs["groups"] = groups
config = config_w_resource.copy()
if "FLAML_sample_size" in config:
del config["FLAML_sample_size"]
budget = (
None
if state.time_budget < 0
else state.time_budget - state.time_from_start
if sample_size == state.data_size[0]
else (state.time_budget - state.time_from_start) / 2 * sample_size / state.data_size[0]
)
(
trained_estimator,
val_loss,
metric_for_logging,
_,
pred_time,
) = compute_estimator(
sampled_X_train,
sampled_y_train,
state.X_val,
state.y_val,
state.weight_val,
state.groups_val,
state.train_time_limit if budget is None else min(budget, state.train_time_limit or np.inf),
state.kf,
config,
state.task,
estimator,
state.eval_method,
state.metric,
state.best_loss,
state.n_jobs,
state.learner_classes.get(estimator),
state.cv_score_agg_func,
state.log_training_metric,
this_estimator_kwargs,
state.free_mem_ratio,
)
if state.retrain_final and not state.model_history:
trained_estimator.cleanup()
result = {
"pred_time": pred_time,
"wall_clock_time": time.time() - state._start_time_flag,
"metric_for_logging": metric_for_logging,
"val_loss": val_loss,
"trained_estimator": trained_estimator,
}
if sampled_weight is not None:
this_estimator_kwargs["sample_weight"] = weight
if is_report is True:
tune.report(**result)
return result
@classmethod
def sanitize(cls, config: dict) -> dict:
"""Make a config ready for passing to estimator."""
config = config.get("ml", config).copy()
config.pop("FLAML_sample_size", None)
config.pop("learner", None)
config.pop("_choice_", None)
return config
def _train_with_config(
self,
estimator: str,
config_w_resource: dict,
sample_size: Optional[int] = None,
):
if not sample_size:
sample_size = config_w_resource.get("FLAML_sample_size", len(self.y_train_all))
config = AutoMLState.sanitize(config_w_resource)
this_estimator_kwargs = self.fit_kwargs_by_estimator.get(
estimator
).copy() # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
(
sampled_X_train,
sampled_y_train,
sampled_weight,
groups,
) = self.task.prepare_sample_train_data(self, sample_size)
if sampled_weight is not None:
weight = this_estimator_kwargs[
"sample_weight"
] # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
this_estimator_kwargs[
"sample_weight"
] = sampled_weight # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
if groups is not None:
this_estimator_kwargs[
"groups"
] = groups # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
budget = None if self.time_budget < 0 else self.time_budget - self.time_from_start
estimator, train_time = train_estimator(
X_train=sampled_X_train,
y_train=sampled_y_train,
config_dic=config,
task=self.task,
estimator_name=estimator,
n_jobs=self.n_jobs,
estimator_class=self.learner_classes.get(estimator),
budget=budget,
fit_kwargs=this_estimator_kwargs, # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
eval_metric=self.metric if hasattr(self, "metric") else "train_time",
free_mem_ratio=self.free_mem_ratio,
)
if sampled_weight is not None:
this_estimator_kwargs[
"sample_weight"
] = weight # NOTE: _train_with_config is after kwargs is updated to fit_kwargs_by_estimator
return estimator, train_time

View File

@ -0,0 +1 @@
from .task import Task

View File

@ -0,0 +1,19 @@
from typing import Optional, Union
import numpy as np
from flaml.automl.data import DataFrame, Series
from flaml.automl.task.task import Task, TS_FORECAST
def task_factory(
task_name: str,
X_train: Optional[Union[np.ndarray, DataFrame]] = None,
y_train: Optional[Union[np.ndarray, DataFrame, Series]] = None,
) -> Task:
from flaml.automl.task.generic_task import GenericTask
from flaml.automl.task.time_series_task import TimeSeriesTask
if task_name in TS_FORECAST:
return TimeSeriesTask(task_name, X_train, y_train)
else:
return GenericTask(task_name, X_train, y_train)

View File

@ -0,0 +1,880 @@
import logging
import time
from typing import List, Optional
import numpy as np
from flaml.automl.data import TS_TIMESTAMP_COL, concat
from flaml.automl.ml import EstimatorSubclass, get_val_loss, default_cv_score_agg_func
from flaml.automl.task.task import (
Task,
get_classification_objective,
TS_FORECAST,
TS_FORECASTPANEL,
)
from flaml.config import RANDOM_SEED
from flaml.automl.spark import ps, psDataFrame, psSeries, pd
from flaml.automl.spark.utils import (
iloc_pandas_on_spark,
spark_kFold,
train_test_split_pyspark,
unique_pandas_on_spark,
unique_value_first_index,
len_labels,
set_option,
)
try:
from scipy.sparse import issparse
except ImportError:
pass
try:
from sklearn.utils import shuffle
from sklearn.model_selection import (
train_test_split,
RepeatedStratifiedKFold,
RepeatedKFold,
GroupKFold,
TimeSeriesSplit,
GroupShuffleSplit,
StratifiedGroupKFold,
)
except ImportError:
pass
logger = logging.getLogger(__name__)
class GenericTask(Task):
@property
def estimators(self):
if self._estimators is None:
# put this into a function to avoid circular dependency
from flaml.automl.model import (
XGBoostSklearnEstimator,
XGBoostLimitDepthEstimator,
RandomForestEstimator,
LGBMEstimator,
LRL1Classifier,
LRL2Classifier,
CatBoostEstimator,
ExtraTreesEstimator,
KNeighborsEstimator,
TransformersEstimator,
TransformersEstimatorModelSelection,
SparkLGBMEstimator,
)
self._estimators = {
"xgboost": XGBoostSklearnEstimator,
"xgb_limitdepth": XGBoostLimitDepthEstimator,
"rf": RandomForestEstimator,
"lgbm": LGBMEstimator,
"lgbm_spark": SparkLGBMEstimator,
"lrl1": LRL1Classifier,
"lrl2": LRL2Classifier,
"catboost": CatBoostEstimator,
"extra_tree": ExtraTreesEstimator,
"kneighbor": KNeighborsEstimator,
"transformer": TransformersEstimator,
"transformer_ms": TransformersEstimatorModelSelection,
}
return self._estimators
def validate_data(
self,
automl,
state,
X_train_all,
y_train_all,
dataframe,
label,
X_val=None,
y_val=None,
groups_val=None,
groups=None,
):
if X_train_all is not None and y_train_all is not None:
assert isinstance(X_train_all, (np.ndarray, pd.DataFrame, psDataFrame)) or issparse(X_train_all), (
"X_train_all must be a numpy array, a pandas dataframe, "
"a Scipy sparse matrix or a pyspark.pandas dataframe."
)
assert isinstance(
y_train_all, (np.ndarray, pd.Series, psSeries)
), "y_train_all must be a numpy array, a pandas series or a pyspark.pandas series."
assert X_train_all.size != 0 and y_train_all.size != 0, "Input data must not be empty."
if isinstance(X_train_all, np.ndarray) and len(X_train_all.shape) == 1:
X_train_all = np.reshape(X_train_all, (X_train_all.size, 1))
if isinstance(y_train_all, np.ndarray):
y_train_all = y_train_all.flatten()
assert X_train_all.shape[0] == y_train_all.shape[0], "# rows in X_train must match length of y_train."
if isinstance(X_train_all, psDataFrame):
X_train_all = X_train_all.spark.cache() # cache data to improve compute speed
y_train_all = y_train_all.to_frame().spark.cache()[y_train_all.name]
logger.debug(f"X_train_all and y_train_all cached, shape of X_train_all: {X_train_all.shape}")
automl._df = isinstance(X_train_all, (pd.DataFrame, psDataFrame))
automl._nrow, automl._ndim = X_train_all.shape
if self.is_ts_forecast():
X_train_all = pd.DataFrame(X_train_all) if isinstance(X_train_all, np.ndarray) else X_train_all
X_train_all, y_train_all = self._validate_ts_data(X_train_all, y_train_all)
X, y = X_train_all, y_train_all
elif dataframe is not None and label is not None:
assert isinstance(
dataframe, (pd.DataFrame, psDataFrame)
), "dataframe must be a pandas DataFrame or a pyspark.pandas DataFrame."
assert (
label in dataframe.columns
), f"The provided label column name `{label}` doesn't exist in the provided dataframe."
if isinstance(dataframe, psDataFrame):
dataframe = dataframe.spark.cache() # cache data to improve compute speed
logger.debug(f"dataframe cached, shape of dataframe: {dataframe.shape}")
automl._df = True
if self.is_ts_forecast():
dataframe = self._validate_ts_data(dataframe)
# TODO: to support pyspark.sql.DataFrame and pure dataframe mode
X = dataframe.drop(columns=label)
automl._nrow, automl._ndim = X.shape
y = dataframe[label]
else:
raise ValueError("either X_train+y_train or dataframe+label are required")
# check the validity of input dimensions for NLP tasks, so need to check _is_nlp_task not estimator
if self.is_nlp():
from flaml.automl.nlp.utils import is_a_list_of_str
is_all_str = True
is_all_list = True
for column in X.columns:
assert X[column].dtype.name in (
"object",
"string",
), "If the task is an NLP task, X can only contain text columns"
for _, each_cell in X[column].items():
if each_cell is not None:
is_str = isinstance(each_cell, str)
is_list_of_int = isinstance(each_cell, list) and all(isinstance(x, int) for x in each_cell)
is_list_of_str = is_a_list_of_str(each_cell)
if self.is_token_classification():
assert is_list_of_str, (
"For the token-classification task, the input column needs to be a list of string,"
"instead of string, e.g., ['EU', 'rejects','German', 'call','to','boycott','British','lamb','.',].",
"For more examples, please refer to test/nlp/test_autohf_tokenclassification.py",
)
else:
assert is_str or is_list_of_int, (
"Each column of the input must either be str (untokenized) "
"or a list of integers (tokenized)"
)
is_all_str &= is_str
is_all_list &= is_list_of_int or is_list_of_str
assert is_all_str or is_all_list, (
"Currently FLAML only supports two modes for NLP: either all columns of X are string (non-tokenized), "
"or all columns of X are integer ids (tokenized)"
)
if isinstance(X, psDataFrame):
# TODO: support pyspark.pandas dataframe in DataTransformer
automl._skip_transform = True
if automl._skip_transform or issparse(X_train_all):
automl._transformer = automl._label_transformer = False
automl._X_train_all, automl._y_train_all = X, y
else:
from flaml.automl.data import DataTransformer
automl._transformer = DataTransformer()
(
automl._X_train_all,
automl._y_train_all,
) = automl._transformer.fit_transform(X, y, self)
automl._label_transformer = automl._transformer.label_transformer
if self.is_token_classification():
if hasattr(automl._label_transformer, "label_list"):
state.fit_kwargs.update({"label_list": automl._label_transformer.label_list})
elif "label_list" not in state.fit_kwargs:
for each_fit_kwargs in state.fit_kwargs_by_estimator.values():
assert (
"label_list" in each_fit_kwargs
), "For the token-classification task, you must either (1) pass token labels; or (2) pass id labels and the label list. "
"Please refer to the documentation for more details: https://microsoft.github.io/FLAML/docs/Examples/AutoML-NLP#a-simple-token-classification-example"
automl._feature_names_in_ = (
automl._X_train_all.columns.to_list() if hasattr(automl._X_train_all, "columns") else None
)
automl._sample_weight_full = state.fit_kwargs.get(
"sample_weight"
) # NOTE: _validate_data is before kwargs is updated to fit_kwargs_by_estimator
if X_val is not None and y_val is not None:
assert isinstance(X_val, (np.ndarray, pd.DataFrame, psDataFrame)) or issparse(X_train_all), (
"X_val must be None, a numpy array, a pandas dataframe, "
"a Scipy sparse matrix or a pyspark.pandas dataframe."
)
assert isinstance(y_val, (np.ndarray, pd.Series, psSeries)), (
"y_val must be None, a numpy array, a pandas series " "or a pyspark.pandas series."
)
assert X_val.size != 0 and y_val.size != 0, (
"Validation data are expected to be nonempty. " "Use None for X_val and y_val if no validation data."
)
if isinstance(y_val, np.ndarray):
y_val = y_val.flatten()
assert X_val.shape[0] == y_val.shape[0], "# rows in X_val must match length of y_val."
if automl._transformer:
state.X_val = automl._transformer.transform(X_val)
else:
state.X_val = X_val
# If it's NLG_TASKS, y_val is a pandas series containing the output sequence tokens,
# so we cannot use label_transformer.transform to process it
if automl._label_transformer:
state.y_val = automl._label_transformer.transform(y_val)
else:
state.y_val = y_val
else:
state.X_val = state.y_val = None
if groups is not None and len(groups) != automl._nrow:
# groups is given as group counts
state.groups = np.concatenate([[i] * c for i, c in enumerate(groups)])
assert len(state.groups) == automl._nrow, "the sum of group counts must match the number of examples"
state.groups_val = (
np.concatenate([[i] * c for i, c in enumerate(groups_val)]) if groups_val is not None else None
)
else:
state.groups_val = groups_val
state.groups = groups
automl.data_size_full = len(automl._y_train_all)
@staticmethod
def _split_pyspark(state, X_train_all, y_train_all, split_ratio, stratify=None):
# TODO: optimize this
set_option("compute.ops_on_diff_frames", True)
if not isinstance(y_train_all, (psDataFrame, psSeries)):
raise ValueError("y_train_all must be a pyspark.pandas dataframe or series")
df_all_in_one = X_train_all.join(y_train_all)
stratify_column = y_train_all.name if isinstance(y_train_all, psSeries) else y_train_all.columns[0]
ret_sample_weight = False
if (
"sample_weight" in state.fit_kwargs
): # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
# fit_kwargs["sample_weight"] is an numpy array
ps_sample_weight = ps.DataFrame(
state.fit_kwargs["sample_weight"],
columns=["sample_weight"],
)
df_all_in_one = df_all_in_one.join(ps_sample_weight)
ret_sample_weight = True
df_all_train, df_all_val = train_test_split_pyspark(
df_all_in_one,
None if stratify is None else stratify_column,
test_fraction=split_ratio,
seed=RANDOM_SEED,
)
columns_to_drop = [c for c in df_all_train.columns if c in [stratify_column, "sample_weight"]]
X_train = df_all_train.drop(columns_to_drop)
X_val = df_all_val.drop(columns_to_drop)
y_train = df_all_train[stratify_column]
y_val = df_all_val[stratify_column]
if ret_sample_weight:
return (
X_train,
X_val,
y_train,
y_val,
df_all_train["sample_weight"],
df_all_val["sample_weight"],
)
return X_train, X_val, y_train, y_val
@staticmethod
def _train_test_split(state, X, y, first=None, rest=None, split_ratio=0.2, stratify=None):
condition_type = isinstance(X, (psDataFrame, psSeries))
# NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
condition_param = "sample_weight" in state.fit_kwargs
if not condition_type and condition_param:
sample_weight = (
state.fit_kwargs["sample_weight"] if rest is None else state.fit_kwargs["sample_weight"][rest]
)
(
X_train,
X_val,
y_train,
y_val,
weight_train,
weight_val,
) = train_test_split(
X,
y,
sample_weight,
test_size=split_ratio,
stratify=stratify,
random_state=RANDOM_SEED,
)
if first is not None:
weight1 = state.fit_kwargs["sample_weight"][first]
state.weight_val = concat(weight1, weight_val)
state.fit_kwargs["sample_weight"] = concat(weight1, weight_train)
else:
state.weight_val = weight_val
state.fit_kwargs["sample_weight"] = weight_train
elif not condition_type and not condition_param:
X_train, X_val, y_train, y_val = train_test_split(
X,
y,
test_size=split_ratio,
stratify=stratify,
random_state=RANDOM_SEED,
)
elif condition_type and condition_param:
(
X_train,
X_val,
y_train,
y_val,
weight_train,
weight_val,
) = GenericTask._split_pyspark(state, X, y, split_ratio, stratify)
if first is not None:
weight1 = state.fit_kwargs["sample_weight"][first]
state.weight_val = concat(weight1, weight_val)
state.fit_kwargs["sample_weight"] = concat(weight1, weight_train)
else:
state.weight_val = weight_val
state.fit_kwargs["sample_weight"] = weight_train
else:
X_train, X_val, y_train, y_val = GenericTask._split_pyspark(state, X, y, split_ratio, stratify)
return X_train, X_val, y_train, y_val
def prepare_data(
self,
state,
X_train_all,
y_train_all,
auto_augment,
eval_method,
split_type,
split_ratio,
n_splits,
data_is_df,
sample_weight_full,
) -> int:
X_val, y_val = state.X_val, state.y_val
if issparse(X_val):
X_val = X_val.tocsr()
if issparse(X_train_all):
X_train_all = X_train_all.tocsr()
is_spark_dataframe = isinstance(X_train_all, (psDataFrame, psSeries))
self.is_spark_dataframe = is_spark_dataframe
if (
self.is_classification()
and auto_augment
and state.fit_kwargs.get("sample_weight")
is None # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
and split_type in ["stratified", "uniform"]
and not self.is_token_classification()
):
# logger.info(f"label {pd.unique(y_train_all)}")
if is_spark_dataframe:
label_set, counts = unique_pandas_on_spark(y_train_all)
# TODO: optimize this
set_option("compute.ops_on_diff_frames", True)
else:
label_set, counts = np.unique(y_train_all, return_counts=True)
# augment rare classes
rare_threshld = 20
rare = counts < rare_threshld
rare_label, rare_counts = label_set[rare], counts[rare]
for i, label in enumerate(rare_label.tolist()):
count = rare_count = rare_counts[i]
rare_index = y_train_all == label
n = len(y_train_all)
while count < rare_threshld:
if data_is_df:
X_train_all = concat(X_train_all, X_train_all.iloc[:n].loc[rare_index])
else:
X_train_all = concat(X_train_all, X_train_all[:n][rare_index, :])
if isinstance(y_train_all, (pd.Series, psSeries)):
y_train_all = concat(y_train_all, y_train_all.iloc[:n].loc[rare_index])
else:
y_train_all = np.concatenate([y_train_all, y_train_all[:n][rare_index]])
count += rare_count
logger.info(f"class {label} augmented from {rare_count} to {count}")
SHUFFLE_SPLIT_TYPES = ["uniform", "stratified"]
if is_spark_dataframe:
# no need to shuffle pyspark dataframe
pass
elif split_type in SHUFFLE_SPLIT_TYPES:
if sample_weight_full is not None:
X_train_all, y_train_all, state.sample_weight_all = shuffle(
X_train_all,
y_train_all,
sample_weight_full,
random_state=RANDOM_SEED,
)
state.fit_kwargs[
"sample_weight"
] = (
state.sample_weight_all
) # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
if isinstance(state.sample_weight_all, pd.Series):
state.sample_weight_all.reset_index(drop=True, inplace=True)
else:
X_train_all, y_train_all = shuffle(X_train_all, y_train_all, random_state=RANDOM_SEED)
if data_is_df:
X_train_all.reset_index(drop=True, inplace=True)
if isinstance(y_train_all, pd.Series):
y_train_all.reset_index(drop=True, inplace=True)
X_train, y_train = X_train_all, y_train_all
state.groups_all = state.groups
if X_val is None and eval_method == "holdout":
if split_type == "time":
assert not self.is_ts_forecast(), "For a TS forecast task, this code should never be called"
is_sample_weight = "sample_weight" in state.fit_kwargs
if not is_spark_dataframe and is_sample_weight:
(
X_train,
X_val,
y_train,
y_val,
state.fit_kwargs[
"sample_weight"
], # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
state.weight_val,
) = train_test_split(
X_train_all,
y_train_all,
state.fit_kwargs[
"sample_weight"
], # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
test_size=split_ratio,
shuffle=False,
)
elif not is_spark_dataframe and not is_sample_weight:
X_train, X_val, y_train, y_val = train_test_split(
X_train_all,
y_train_all,
test_size=split_ratio,
shuffle=False,
)
elif is_spark_dataframe and is_sample_weight:
(
X_train,
X_val,
y_train,
y_val,
state.fit_kwargs[
"sample_weight"
], # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
state.weight_val,
) = self._split_pyspark(state, X_train_all, y_train_all, split_ratio)
else:
X_train, X_val, y_train, y_val = self._split_pyspark(state, X_train_all, y_train_all, split_ratio)
if split_type == "group":
gss = GroupShuffleSplit(n_splits=1, test_size=split_ratio, random_state=RANDOM_SEED)
for train_idx, val_idx in gss.split(X_train_all, y_train_all, state.groups_all):
if data_is_df:
X_train = X_train_all.iloc[train_idx]
X_val = X_train_all.iloc[val_idx]
else:
X_train, X_val = X_train_all[train_idx], X_train_all[val_idx]
y_train, y_val = y_train_all[train_idx], y_train_all[val_idx]
state.groups = state.groups_all[train_idx]
state.groups_val = state.groups_all[val_idx]
elif self.is_classification():
# for classification, make sure the labels are complete in both
# training and validation data
label_set, first = unique_value_first_index(y_train_all)
rest = []
last = 0
first.sort()
for i in range(len(first)):
rest.extend(range(last, first[i]))
last = first[i] + 1
rest.extend(range(last, len(y_train_all)))
X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
X_rest = X_train_all.iloc[rest] if data_is_df else X_train_all[rest]
y_rest = (
y_train_all[rest]
if isinstance(y_train_all, np.ndarray)
else iloc_pandas_on_spark(y_train_all, rest)
if is_spark_dataframe
else y_train_all.iloc[rest]
)
stratify = y_rest if split_type == "stratified" else None
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_rest, y_rest, first, rest, split_ratio, stratify
)
X_train = concat(X_first, X_train)
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
elif self.is_regression():
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_train_all, y_train_all, split_ratio=split_ratio
)
state.data_size = X_train.shape
state.data_size_full = len(y_train_all)
state.X_train, state.y_train = X_train, y_train
state.X_val, state.y_val = X_val, y_val
state.X_train_all = X_train_all
state.y_train_all = y_train_all
y_train_all_size = y_train_all.size
if eval_method == "holdout":
state.kf = None
return
if split_type == "group":
# logger.info("Using GroupKFold")
assert len(state.groups_all) == y_train_all_size, "the length of groups must match the number of examples"
assert (
len_labels(state.groups_all) >= n_splits
), "the number of groups must be equal or larger than n_splits"
state.kf = GroupKFold(n_splits)
elif split_type == "stratified":
# logger.info("Using StratifiedKFold")
assert y_train_all_size >= n_splits, (
f"{n_splits}-fold cross validation" f" requires input data with at least {n_splits} examples."
)
assert y_train_all_size >= 2 * n_splits, (
f"{n_splits}-fold cross validation with metric=r2 "
f"requires input data with at least {n_splits*2} examples."
)
state.kf = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=1, random_state=RANDOM_SEED)
elif split_type == "time":
# logger.info("Using TimeSeriesSplit")
if self.is_ts_forecast() and not self.is_ts_forecastpanel():
period = state.fit_kwargs[
"period"
] # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
if period * (n_splits + 1) > y_train_all_size:
n_splits = int(y_train_all_size / period - 1)
assert n_splits >= 2, (
f"cross validation for forecasting period={period}"
f" requires input data with at least {3 * period} examples."
)
logger.info(f"Using nsplits={n_splits} due to data size limit.")
state.kf = TimeSeriesSplit(n_splits=n_splits, test_size=period)
elif self.is_ts_forecastpanel():
n_groups = len(X_train.groupby(state.fit_kwargs.get("group_ids")).size())
period = state.fit_kwargs.get("period")
state.kf = TimeSeriesSplit(n_splits=n_splits, test_size=period * n_groups)
else:
state.kf = TimeSeriesSplit(n_splits=n_splits)
# state.kf = TimeSeriesSplit(n_splits=n_splits)
elif isinstance(split_type, str):
# logger.info("Using RepeatedKFold")
state.kf = RepeatedKFold(n_splits=n_splits, n_repeats=1, random_state=RANDOM_SEED)
else:
# logger.info("Using splitter object")
state.kf = split_type
if isinstance(state.kf, (GroupKFold, StratifiedGroupKFold)):
# self._split_type is either "group", a GroupKFold object, or a StratifiedGroupKFold object
state.kf.groups = state.groups_all
def decide_split_type(
self,
split_type,
y_train_all,
fit_kwargs,
groups=None,
) -> str:
assert not self.is_ts_forecast(), "This function should never be called as part of a time-series task."
if self.name == "classification":
self.name = get_classification_objective(len_labels(y_train_all))
if not isinstance(split_type, str):
assert hasattr(split_type, "split") and hasattr(
split_type, "get_n_splits"
), "split_type must be a string or a splitter object with split and get_n_splits methods."
assert (
not isinstance(split_type, GroupKFold) or groups is not None
), "GroupKFold requires groups to be provided."
return split_type
elif self.is_classification():
assert split_type in ["auto", "stratified", "uniform", "time", "group"]
return split_type if split_type != "auto" else groups is None and "stratified" or "group"
elif self.is_regression():
assert split_type in ["auto", "uniform", "time", "group"]
return split_type if split_type != "auto" else "uniform"
elif self.is_rank():
assert groups is not None, "groups must be specified for ranking task."
assert split_type in ["auto", "group"]
return "group"
elif self.is_nlg():
assert split_type in ["auto", "uniform", "time", "group"]
return split_type if split_type != "auto" else "uniform"
def preprocess(self, X, transformer=None):
if isinstance(X, List):
try:
if isinstance(X[0], List):
X = [x for x in zip(*X)]
X = pd.DataFrame(
dict(
[
(transformer._str_columns[idx], X[idx])
if isinstance(X[0], List)
else (transformer._str_columns[idx], [X[idx]])
for idx in range(len(X))
]
)
)
except IndexError:
raise IndexError("Test data contains more columns than training data, exiting")
elif isinstance(X, int):
return X
elif isinstance(X, psDataFrame):
return X
elif issparse(X):
X = X.tocsr()
if self.is_ts_forecast():
X = pd.DataFrame(X)
if transformer:
X = transformer.transform(X)
return X
def evaluate_model_CV(
self,
config: dict,
estimator: EstimatorSubclass,
X_train_all,
y_train_all,
budget,
kf,
eval_metric,
best_val_loss,
cv_score_agg_func=None,
log_training_metric=False,
fit_kwargs: Optional[dict] = None,
free_mem_ratio=0,
):
if fit_kwargs is None:
fit_kwargs = {}
if cv_score_agg_func is None:
cv_score_agg_func = default_cv_score_agg_func
start_time = time.time()
val_loss_folds = []
log_metric_folds = []
metric = None
train_time = pred_time = 0
total_fold_num = 0
n = kf.get_n_splits()
rng = np.random.RandomState(2020)
budget_per_train = budget and budget / n
groups = None
if self.is_classification():
labels = _, labels = len_labels(y_train_all, return_labels=True)
else:
labels = fit_kwargs.get("label_list") # pass the label list on to compute the evaluation metric
if "sample_weight" in fit_kwargs:
weight = fit_kwargs["sample_weight"]
weight_val = None
else:
weight = weight_val = None
is_spark_dataframe = isinstance(X_train_all, (psDataFrame, psSeries))
if is_spark_dataframe:
dataframe = X_train_all.join(y_train_all)
if weight is not None:
dataframe = dataframe.join(weight)
if isinstance(kf, (GroupKFold, StratifiedGroupKFold)):
groups = kf.groups
dataframe = dataframe.join(groups)
kf = spark_kFold(dataframe, nFolds=n, foldCol=groups.name if groups is not None else "")
shuffle = False
else:
X_train_split, y_train_split = X_train_all, y_train_all
shuffle = getattr(kf, "shuffle", not self.is_ts_forecast())
if isinstance(kf, RepeatedStratifiedKFold):
kf = kf.split(X_train_split, y_train_split)
elif isinstance(kf, (GroupKFold, StratifiedGroupKFold)):
groups = kf.groups
kf = kf.split(X_train_split, y_train_split, groups)
shuffle = False
elif isinstance(kf, TimeSeriesSplit):
kf = kf.split(X_train_split, y_train_split)
else:
kf = kf.split(X_train_split)
for train_index, val_index in kf:
if shuffle:
train_index = rng.permutation(train_index)
if is_spark_dataframe:
# cache data to increase compute speed
X_train = train_index.spark.cache()
X_val = val_index.spark.cache()
y_train = X_train.pop(y_train_all.name)
y_val = X_val.pop(y_train_all.name)
if weight is not None:
weight_val = X_val.pop(weight.name)
fit_kwargs["sample_weight"] = X_train.pop(weight.name)
groups_val = None
elif isinstance(X_train_all, pd.DataFrame):
X_train = X_train_split.iloc[train_index]
X_val = X_train_split.iloc[val_index]
else:
X_train, X_val = X_train_split[train_index], X_train_split[val_index]
if not is_spark_dataframe:
y_train, y_val = y_train_split[train_index], y_train_split[val_index]
if weight is not None:
fit_kwargs["sample_weight"], weight_val = (
weight[train_index],
weight[val_index],
)
if groups is not None:
fit_kwargs["groups"] = (
groups[train_index] if isinstance(groups, np.ndarray) else groups.iloc[train_index]
)
groups_val = groups[val_index] if isinstance(groups, np.ndarray) else groups.iloc[val_index]
else:
groups_val = None
estimator.cleanup()
val_loss_i, metric_i, train_time_i, pred_time_i = get_val_loss(
config,
estimator,
X_train,
y_train,
X_val,
y_val,
weight_val,
groups_val,
eval_metric,
self,
labels,
budget_per_train,
log_training_metric=log_training_metric,
fit_kwargs=fit_kwargs,
free_mem_ratio=free_mem_ratio,
)
if isinstance(metric_i, dict) and "intermediate_results" in metric_i.keys():
del metric_i["intermediate_results"]
if weight is not None:
fit_kwargs["sample_weight"] = weight
total_fold_num += 1
val_loss_folds.append(val_loss_i)
log_metric_folds.append(metric_i)
train_time += train_time_i
pred_time += pred_time_i
if is_spark_dataframe:
X_train.spark.unpersist() # uncache data to free memory
X_val.spark.unpersist() # uncache data to free memory
if budget and time.time() - start_time >= budget:
break
val_loss, metric = cv_score_agg_func(val_loss_folds, log_metric_folds)
n = total_fold_num
pred_time /= n
return val_loss, metric, train_time, pred_time
def default_estimator_list(self, estimator_list: List[str], is_spark_dataframe: bool = False) -> List[str]:
if "auto" != estimator_list:
n_estimators = len(estimator_list)
if is_spark_dataframe:
# For spark dataframe, only estimators ending with '_spark' are supported
estimator_list = [est for est in estimator_list if est.endswith("_spark")]
if len(estimator_list) == 0:
raise ValueError(
"Spark dataframes only support estimator names ending with `_spark`. Non-supported "
"estimators are removed. No estimator is left."
)
elif n_estimators != len(estimator_list):
logger.warning(
"Spark dataframes only support estimator names ending with `_spark`. Non-supported "
"estimators are removed."
)
else:
# For non-spark dataframe, only estimators not ending with '_spark' are supported
estimator_list = [est for est in estimator_list if not est.endswith("_spark")]
if len(estimator_list) == 0:
raise ValueError(
"Non-spark dataframes only support estimator names not ending with `_spark`. Non-supported "
"estimators are removed. No estimator is left."
)
elif n_estimators != len(estimator_list):
logger.warning(
"Non-spark dataframes only support estimator names not ending with `_spark`. Non-supported "
"estimators are removed."
)
return estimator_list
if self.is_rank():
estimator_list = ["lgbm", "xgboost", "xgb_limitdepth", "lgbm_spark"]
elif self.is_nlp():
estimator_list = ["transformer"]
elif self.is_ts_forecastpanel():
estimator_list = ["tft"]
else:
try:
import catboost
estimator_list = [
"lgbm",
"rf",
"catboost",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
]
except ImportError:
estimator_list = [
"lgbm",
"rf",
"xgboost",
"extra_tree",
"xgb_limitdepth",
"lgbm_spark",
]
# if self.is_ts_forecast():
# # catboost is removed because it has a `name` parameter, making it incompatible with hcrystalball
# if "catboost" in estimator_list:
# estimator_list.remove("catboost")
# if self.is_ts_forecastregression():
# try:
# import prophet
#
# estimator_list += [
# "prophet",
# "arima",
# "sarimax",
# "holt-winters",
# ]
# except ImportError:
# estimator_list += ["arima", "sarimax", "holt-winters"]
if not self.is_regression():
estimator_list += ["lrl1"]
estimator_list = [
est
for est in estimator_list
if (est.endswith("_spark") if is_spark_dataframe else not est.endswith("_spark"))
]
return estimator_list
def default_metric(self, metric: str) -> str:
if "auto" != metric:
return metric
if self.is_nlp():
from flaml.automl.nlp.utils import (
load_default_huggingface_metric_for_task,
)
return load_default_huggingface_metric_for_task(self.name)
elif self.is_binary():
return "roc_auc"
elif self.is_multiclass():
return "log_loss"
elif self.is_ts_forecast():
return "mape"
elif self.is_rank():
return "ndcg"
else:
return "r2"
@staticmethod
def prepare_sample_train_data(automlstate, sample_size):
return automlstate.prepare_sample_train_data(sample_size)

347
flaml/automl/task/task.py Normal file
View File

@ -0,0 +1,347 @@
from abc import ABC, abstractmethod
from typing import TYPE_CHECKING, List, Optional, Tuple, Union
import numpy as np
from flaml.automl.data import DataFrame, Series, psDataFrame, psSeries
if TYPE_CHECKING:
import flaml
# TODO: if your task is not specified in here, define your task as an all-capitalized word
SEQCLASSIFICATION = "seq-classification"
MULTICHOICECLASSIFICATION = "multichoice-classification"
TOKENCLASSIFICATION = "token-classification"
SEQREGRESSION = "seq-regression"
TS_FORECASTREGRESSION = (
"forecast",
"ts_forecast",
"ts_forecast_regression",
)
REGRESSION = ("regression", SEQREGRESSION, *TS_FORECASTREGRESSION)
TS_FORECASTCLASSIFICATION = "ts_forecast_classification"
TS_FORECASTPANEL = "ts_forecast_panel"
TS_FORECAST = (
*TS_FORECASTREGRESSION,
TS_FORECASTCLASSIFICATION,
TS_FORECASTPANEL,
)
CLASSIFICATION = (
"binary",
"multiclass",
"classification",
SEQCLASSIFICATION,
MULTICHOICECLASSIFICATION,
TOKENCLASSIFICATION,
TS_FORECASTCLASSIFICATION,
)
RANK = ("rank",)
SUMMARIZATION = "summarization"
NLG_TASKS = (SUMMARIZATION,)
NLU_TASKS = (
SEQREGRESSION,
SEQCLASSIFICATION,
MULTICHOICECLASSIFICATION,
TOKENCLASSIFICATION,
)
NLP_TASKS = (*NLG_TASKS, *NLU_TASKS)
def get_classification_objective(num_labels: int) -> str:
if num_labels == 2:
objective_name = "binary"
else:
objective_name = "multiclass"
return objective_name
class Task(ABC):
"""
Abstract base class for a machine learning task.
Class definitions should implement abstract methods and provide a non-empty dictionary of estimator classes.
A Task can be suitable to be used for multiple machine-learning tasks (e.g. classification or regression) or be
implemented specifically for a single one depending on the generality of data validation and model evaluation methods
implemented. The implementation of a Task may optionally use the training data and labels to determine data and task
specific details, such as in determining if a problem is single-label or multi-label.
FLAML evaluates at runtime how to behave exactly, relying on the task instance to provide implementations of
operations which vary between tasks.
"""
def __init__(
self,
task_name: str,
X_train: Optional[Union[np.ndarray, DataFrame, psDataFrame]] = None,
y_train: Optional[Union[np.ndarray, DataFrame, Series, psSeries]] = None,
):
"""Constructor.
Args:
task_name: String name for this type of task. Used when the Task can be generic and implement a number of
types of sub-task.
X_train: Optional. Some Task types may use the data shape or features to determine details of their usage,
such as in binary vs multilabel classification.
y_train: Optional. Some Task types may use the data shape or features to determine details of their usage,
such as in binary vs multilabel classification.
"""
self.name = task_name
self._estimators = None
def __str__(self) -> str:
"""Name of this task type."""
return self.name
@abstractmethod
def evaluate_model_CV(
self,
config: dict,
estimator: "flaml.automl.ml.BaseEstimator",
X_train_all: Union[np.ndarray, DataFrame, psDataFrame],
y_train_all: Union[np.ndarray, DataFrame, Series, psSeries],
budget: int,
kf,
eval_metric: str,
best_val_loss: float,
log_training_metric: bool = False,
fit_kwargs: Optional[dict] = {},
) -> Tuple[float, float, float, float]:
"""Evaluate the model using cross-validation.
Args:
config: configuration used in the evaluation of the metric.
estimator: Estimator class of the model.
X_train_all: Complete training feature data.
y_train_all: Complete training target data.
budget: Training time budget.
kf: Cross-validation index generator.
eval_metric: Metric name to be used for evaluation.
best_val_loss: Best current validation-set loss.
log_training_metric: Bool defaults False. Enables logging of the training metric.
fit_kwargs: Additional kwargs passed to the estimator's fit method.
Returns:
validation loss, metric value, train time, prediction time
"""
@abstractmethod
def validate_data(
self,
automl: "flaml.automl.automl.AutoML",
state: "flaml.automl.state.AutoMLState",
X_train_all: Union[np.ndarray, DataFrame, psDataFrame, None],
y_train_all: Union[np.ndarray, DataFrame, Series, psSeries, None],
dataframe: Union[DataFrame, None],
label: str,
X_val: Optional[Union[np.ndarray, DataFrame, psDataFrame]] = None,
y_val: Optional[Union[np.ndarray, DataFrame, Series, psSeries]] = None,
groups_val: Optional[List[str]] = None,
groups: Optional[List[str]] = None,
):
"""Validate that the data is suitable for this task type.
Args:
automl: The AutoML instance from which this task has been constructed.
state: The AutoMLState instance for this run.
X_train_all: The complete data set or None if dataframe is supplied.
y_train_all: The complete target set or None if dataframe is supplied.
dataframe: A dataframe constaining the complete data set with targets.
label: The name of the target column in dataframe.
X_val: Optional. A data set for validation.
y_val: Optional. A target vector corresponding to X_val for validation.
groups_val: Group labels (with matching length to y_val) or group counts (with sum equal to length of y_val)
for validation data. Need to be consistent with groups.
groups: Group labels (with matching length to y_train) or groups counts (with sum equal to length of y_train)
for training data.
Raises:
AssertionError: The data provided is invalid for this task type and configuration.
"""
@abstractmethod
def prepare_data(
self,
state: "flaml.automl.state.AutoMLState",
X_train_all: Union[np.ndarray, DataFrame, psDataFrame],
y_train_all: Union[np.ndarray, DataFrame, Series, psSeries, None],
auto_augment: bool,
eval_method: str,
split_type: str,
split_ratio: float,
n_splits: int,
data_is_df: bool,
sample_weight_full: Optional[List[float]] = None,
):
"""Prepare the data for fitting or inference.
Args:
automl: The AutoML instance from which this task has been constructed.
state: The AutoMLState instance for this run.
X_train_all: The complete data set or None if dataframe is supplied. Must
contain the target if y_train_all is None
y_train_all: The complete target set or None if supplied in X_train_all.
auto_augment: If true, task-specific data augmentations will be applied.
eval_method: A string of resampling strategy, one of ['auto', 'cv', 'holdout'].
split_type: str or splitter object, default="auto" | the data split type.
* A valid splitter object is an instance of a derived class of scikit-learn
[KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)
and have ``split`` and ``get_n_splits`` methods with the same signatures.
Set eval_method to "cv" to use the splitter object.
* Valid str options depend on different tasks.
For classification tasks, valid choices are
["auto", 'stratified', 'uniform', 'time', 'group']. "auto" -> stratified.
For regression tasks, valid choices are ["auto", 'uniform', 'time'].
"auto" -> uniform.
For time series forecast tasks, must be "auto" or 'time'.
For ranking task, must be "auto" or 'group'.
split_ratio: A float of the valiation data percentage for holdout.
n_splits: An integer of the number of folds for cross - validation.
data_is_df: True if the data was provided as a DataFrame else False.
sample_weight_full: A 1d arraylike of the sample weight.
Raises:
AssertionError: The configuration provided is invalid for this task type and data.
"""
@abstractmethod
def decide_split_type(
self,
split_type: str,
y_train_all: Union[np.ndarray, DataFrame, Series, psSeries, None],
fit_kwargs: dict,
groups: Optional[List[str]] = None,
) -> str:
"""Choose an appropriate data split type for this data and task.
If split_type is 'auto' then this is determined based on the task type and data.
If a specific split_type is requested then the choice is validated to be appropriate.
Args:
split_type: Either 'auto' or a task appropriate split type.
y_train_all: The complete set of targets.
fit_kwargs: Additional kwargs passed to the estimator's fit method.
groups: Optional. Group labels (with matching length to y_train) or groups counts (with sum equal to length
of y_train) for training data.
Returns:
The determined appropriate split type.
Raises:
AssertionError: The requested split_type is invalid for this task, configuration and data.
"""
@abstractmethod
def preprocess(
self,
X: Union[np.ndarray, DataFrame, psDataFrame],
transformer: Optional["flaml.automl.data.DataTransformer"] = None,
) -> Union[np.ndarray, DataFrame]:
"""Preprocess the data ready for fitting or inference with this task type.
Args:
X: The data set to process.
transformer: A DataTransformer instance to be used in processing.
Returns:
The preprocessed data set having the same type as the input.
"""
@abstractmethod
def default_estimator_list(
self,
estimator_list: Union[List[str], str] = "auto",
is_spark_dataframe: bool = False,
) -> List[str]:
"""Return the list of default estimators registered for this task type.
If 'auto' is provided then the default list is returned, else the provided list will be validated given this task
type.
Args:
estimator_list: Either 'auto' or a list of estimator names to be validated.
is_spark_dataframe: True if the data is a spark dataframe.
Returns:
A list of valid estimator names for this task type.
"""
@abstractmethod
def default_metric(self, metric: str) -> str:
"""Return the default metric for this task type.
If 'auto' is provided then the default metric for this task will be returned. Otherwise, the provided metric name
is validated for this task type.
Args:
metric: The name of a metric to be used in evaluation of models during fitting or validation.
Returns:
The default metric, or the provided metric if it is valid for this task type.
"""
def is_ts_forecast(self) -> bool:
return self.name in TS_FORECAST
def is_ts_forecastpanel(self) -> bool:
return self.name == TS_FORECASTPANEL
def is_ts_forecastregression(self) -> bool:
return self.name in TS_FORECASTREGRESSION
def is_nlp(self) -> bool:
return self.name in NLP_TASKS
def is_nlg(self) -> bool:
return self.name in NLG_TASKS
def is_classification(self) -> bool:
return self.name in CLASSIFICATION
def is_rank(self) -> bool:
return self.name in RANK
def is_binary(self) -> bool:
return self.name == "binary"
def is_seq_regression(self) -> bool:
return self.name == SEQREGRESSION
def is_seq_classification(self) -> bool:
return self.name == SEQCLASSIFICATION
def is_token_classification(self) -> bool:
return self.name == TOKENCLASSIFICATION
def is_summarization(self) -> bool:
return self.name == SUMMARIZATION
def is_multiclass(self) -> bool:
return "multiclass" in self.name
def is_regression(self) -> bool:
return self.name in REGRESSION
def __eq__(self, other: str) -> bool:
"""For backward compatibility with all the string comparisons to task"""
return self.name == other
def estimator_class_from_str(self, estimator_name: str) -> "flaml.automl.ml.BaseEstimator":
"""Determine the estimator class corresponding to the provided name.
Args:
estimator_name: Name of the desired estimator.
Returns:
The estimator class corresponding to the provided name.
Raises:
ValueError: The provided estimator_name has not been registered for this task type.
"""
if estimator_name in self.estimators:
return self.estimators[estimator_name]
else:
raise ValueError(
f"{estimator_name} is not a built-in learner for this task type, "
f"only {list(self.estimators.keys())} are supported."
"Please use AutoML.add_learner() to add a customized learner."
)

View File

@ -0,0 +1,523 @@
import logging
import time
from typing import List
import pandas as pd
import numpy as np
from scipy.sparse import issparse
from sklearn.model_selection import (
GroupKFold,
TimeSeriesSplit,
)
from flaml.automl.ml import get_val_loss, default_cv_score_agg_func
from flaml.automl.time_series.ts_data import (
TimeSeriesDataset,
DataTransformerTS,
normalize_ts_data,
)
from flaml.automl.task.task import (
Task,
get_classification_objective,
TS_FORECAST,
TS_FORECASTPANEL,
)
logger = logging.getLogger(__name__)
class TimeSeriesTask(Task):
@property
def estimators(self):
if self._estimators is None:
# put this into a function to avoid circular dependency
from flaml.automl.time_series import (
XGBoost_TS,
XGBoostLimitDepth_TS,
RF_TS,
LGBM_TS,
ExtraTrees_TS,
CatBoost_TS,
Prophet,
Orbit,
ARIMA,
SARIMAX,
TemporalFusionTransformerEstimator,
HoltWinters,
)
self._estimators = {
"xgboost": XGBoost_TS,
"xgb_limitdepth": XGBoostLimitDepth_TS,
"rf": RF_TS,
"lgbm": LGBM_TS,
"extra_tree": ExtraTrees_TS,
"arima": ARIMA,
"sarimax": SARIMAX,
"holt-winters": HoltWinters,
"catboost": CatBoost_TS,
"tft": TemporalFusionTransformerEstimator,
}
try:
from prophet import Prophet as foo
self._estimators["prophet"] = Prophet
except ImportError:
logger.info("Couldn't import Prophet, skipping")
try:
from orbit.models import DLT
self._estimators["orbit"] = Orbit
except ImportError:
logger.info("Couldn't import Prophet, skipping")
return self._estimators
# processed
def validate_data(
self,
automl,
state,
X_train_all,
y_train_all,
dataframe,
label,
X_val=None,
y_val=None,
groups_val=None,
groups=None,
):
# first beat the data into a TimeSeriesDataset shape
if isinstance(X_train_all, TimeSeriesDataset):
# in this case, we're most likely being called by another FLAML instance
# so all the preliminary cleaning has already been done
pre_data = X_train_all
val_len = len(pre_data.X_val)
else:
if label is None and dataframe is not None:
raise ValueError("If data is specified via dataframe parameter, you must also specify label")
if isinstance(y_train_all, pd.Series):
label = y_train_all.name
elif isinstance(y_train_all, np.ndarray):
label = "y" # Prophet convention
if isinstance(label, str):
target_names = [label]
else:
target_names = label
if self.time_col is None:
if isinstance(X_train_all, pd.DataFrame):
assert dataframe is None, "One of dataframe and X arguments must be None"
self.time_col = X_train_all.columns[0]
elif dataframe is not None:
assert X_train_all is None, "One of dataframe and X arguments must be None"
self.time_col = dataframe.columns[0]
else:
self.time_col = "ds"
automl._df = True
if X_train_all is not None:
assert y_train_all is not None, "If X_train_all is not None, y_train_all must also be"
assert dataframe is None, "If X_train_all is provided, dataframe must be None"
dataframe = TimeSeriesDataset.to_dataframe(X_train_all, y_train_all, target_names, self.time_col)
elif dataframe is not None:
assert label is not None, "A label or list of labels must be provided."
assert isinstance(dataframe, pd.DataFrame), "dataframe must be a pandas DataFrame"
assert label in dataframe.columns, f"{label} must a column name in dataframe"
else:
raise ValueError("Must supply either X_train_all and y_train_all, or dataframe and label")
try:
dataframe[self.time_col] = pd.to_datetime(dataframe[self.time_col])
except Exception:
raise ValueError(
f"For '{TS_FORECAST}' task, time column {self.time_col} must contain timestamp values."
)
dataframe = remove_ts_duplicates(dataframe, self.time_col)
if X_val is not None:
assert y_val is not None, "If X_val is not None, y_val must also be"
val_df = TimeSeriesDataset.to_dataframe(X_val, y_val, target_names, self.time_col)
val_len = len(val_df)
else:
val_len = 0
val_df = None
pre_data = TimeSeriesDataset(
train_data=dataframe,
time_col=self.time_col,
target_names=target_names,
test_data=val_df,
)
# TODO: should the transformer be a property of the dataset instead?
automl._transformer = DataTransformerTS(self.time_col, label)
Xt, yt = automl._transformer.fit_transform(pre_data.X_all, pre_data.y_all)
df_t = pd.concat([Xt, yt], axis=1)
data = TimeSeriesDataset(
train_data=df_t,
time_col=pre_data.time_col,
target_names=pre_data.target_names,
).move_validation_boundary(-val_len)
# now setup the properties of all the other relevant objects
# TODO: where are these used? Replace with pointers to data?
automl._X_train_all, automl._y_train_all = Xt, yt
# TODO: where are these used?
automl._nrow, automl._ndim = data.X_train.shape
# make a property instead? Or just fix the call?
automl._label_transformer = automl._transformer.label_transformer
automl._feature_names_in_ = (
automl._X_train_all.columns.to_list() if hasattr(automl._X_train_all, "columns") else None
)
self.time_col = data.time_col
self.target_names = data.target_names
automl._state.X_val = data
automl._state.X_train = data
automl._state.y_train = None
automl._state.y_val = None
if data.test_data is not None and len(data.test_data) > 0:
automl._state.X_train_all = data.move_validation_boundary(len(data.test_data))
else:
automl._state.X_train_all = data
automl._state.y_train_all = None
automl._state.data_size = data.train_data.shape
automl.data_size_full = len(data.all_data)
automl._state.groups = None
automl._sample_weight_full = None
def prepare_data(
self,
state,
X_train_all,
y_train_all,
auto_argument,
eval_method,
split_type,
split_ratio,
n_splits,
data_is_df,
sample_weight_full,
time_col=None,
):
state.kf = None
state.data_size_full = len(y_train_all)
if split_type in ["uniform", "stratified"]:
raise ValueError(f"Split type {split_type} is not valid for time series")
state.groups = None
state.groups_all = None
state.groups_val = None
ts_data = state.X_val
no_test_data = ts_data is None or ts_data.test_data is None or len(ts_data.test_data) == 0
if no_test_data and eval_method == "holdout":
# NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
period = state.fit_kwargs["period"]
if self.name == TS_FORECASTPANEL:
# TODO: move this into the TimeSeriesDataset class
X_train_all = ts_data.X_train
y_train_all = ts_data.y_train
X_train_all["time_idx"] -= X_train_all["time_idx"].min()
X_train_all["time_idx"] = X_train_all["time_idx"].astype("int")
ids = state.fit_kwargs["group_ids"].copy()
ids.append(ts_data.time_col)
ids.append("time_idx")
y_train_all = pd.DataFrame(y_train_all)
y_train_all[ids] = X_train_all[ids]
X_train_all = X_train_all.sort_values(ids)
y_train_all = y_train_all.sort_values(ids)
training_cutoff = X_train_all["time_idx"].max() - period
X_train = X_train_all[lambda x: x.time_idx <= training_cutoff]
y_train = y_train_all[lambda x: x.time_idx <= training_cutoff].drop(columns=ids)
X_val = X_train_all[lambda x: x.time_idx > training_cutoff]
y_val = y_train_all[lambda x: x.time_idx > training_cutoff].drop(columns=ids)
train_data = normalize_ts_data(
X_train,
ts_data.target_names,
ts_data.time_col,
y_train,
)
test_data = normalize_ts_data(
X_val,
ts_data.target_names,
ts_data.time_col,
y_val,
)
ts_data = TimeSeriesDataset(
train_data,
ts_data.time_col,
ts_data.target_names,
ts_data.frequency,
test_data,
)
state.X_val = ts_data
state.X_train = ts_data
else:
# if eval_method = holdout, make holdout data
num_samples = ts_data.train_data.shape[0]
assert period < num_samples, f"period={period}>#examples={num_samples}"
state.X_val = ts_data.move_validation_boundary(-period)
state.X_train = state.X_val
if eval_method != "holdout":
if self.name != TS_FORECASTPANEL:
period = state.fit_kwargs[
"period"
] # NOTE: _prepare_data is before kwargs is updated to fit_kwargs_by_estimator
step_size = state.fit_kwargs.get("cv_step_size", period)
ts_data = state.X_train
if n_splits * step_size + 2 * period > ts_data.y_train.size:
n_splits = int((ts_data.y_train.size - 2 * period) / step_size)
assert n_splits >= 2, (
f"cross validation for forecasting period={period}"
f" requires input data with at least {2*period + 2*step_size} examples."
)
logger.info(f"Using nsplits={n_splits} due to data size limit.")
state.kf = TimeSeriesSplit(n_splits=n_splits, test_size=period)
state.kf.step_size = step_size
else:
n_groups = ts_data.X_train.groupby(state.fit_kwargs.get("group_ids")).ngroups
period = state.fit_kwargs["period"]
state.kf = TimeSeriesSplit(n_splits=n_splits, test_size=period * n_groups)
# TODO: move task detection to Task.__init__!
def decide_split_type(
self,
split_type,
y_train_all,
fit_kwargs,
groups=None,
) -> str:
# TODO: move into task creation!!!
if self.name == "classification":
self.name = get_classification_objective(len(np.unique(y_train_all)))
# TODO: do we need this?
if not isinstance(split_type, str):
assert hasattr(split_type, "split") and hasattr(
split_type, "get_n_splits"
), "split_type must be a string or a splitter object with split and get_n_splits methods."
assert (
not isinstance(split_type, GroupKFold) or groups is not None
), "GroupKFold requires groups to be provided."
return split_type
else:
assert split_type in ["auto", "time"]
assert isinstance(
fit_kwargs.get("period"),
int, # NOTE: _decide_split_type is before kwargs is updated to fit_kwargs_by_estimator
), f"missing a required integer 'period' for '{TS_FORECAST}' task."
if fit_kwargs.get("group_ids"):
# TODO (MARK) This will likely not play well with the task class
self.name = TS_FORECASTPANEL
assert isinstance(
fit_kwargs.get("group_ids"), list
), f"missing a required List[str] 'group_ids' for '{TS_FORECASTPANEL}' task."
return "time"
# TODO: merge with preprocess() below
def _preprocess(self, X, transformer=None):
if isinstance(X, List):
try:
if isinstance(X[0], List):
X = [x for x in zip(*X)]
X = pd.DataFrame(
dict(
[
(transformer._str_columns[idx], X[idx])
if isinstance(X[0], List)
else (transformer._str_columns[idx], [X[idx]])
for idx in range(len(X))
]
)
)
except IndexError:
raise IndexError("Test data contains more columns than training data, exiting")
elif isinstance(X, int):
return X
elif issparse(X):
X = X.tocsr()
if self.is_ts_forecast():
X = pd.DataFrame(X)
if transformer:
X = transformer.transform(X)
return X
def preprocess(self, X, transformer=None):
if isinstance(X, pd.DataFrame) or isinstance(X, np.ndarray) or isinstance(X, pd.Series):
X = X.copy()
X = normalize_ts_data(X, self.target_names, self.time_col)
return self._preprocess(X, transformer)
elif isinstance(X, int):
return X
else:
raise ValueError(f"unknown type of X, {X.__class__}")
def evaluate_model_CV(
self,
config,
estimator,
X_train_all,
y_train_all,
budget,
kf,
eval_metric,
best_val_loss,
cv_score_agg_func=None,
log_training_metric=False,
fit_kwargs={},
free_mem_ratio=0, # what is this for?
):
if cv_score_agg_func is None:
cv_score_agg_func = default_cv_score_agg_func
start_time = time.time()
val_loss_folds = []
log_metric_folds = []
metric = None
train_time = pred_time = 0
total_fold_num = 0
n = kf.get_n_splits()
if self.is_classification():
labels = np.unique(y_train_all)
else:
labels = fit_kwargs.get("label_list") # pass the label list on to compute the evaluation metric
ts_data = X_train_all
budget_per_train = budget / n
ts_data = X_train_all
for data in ts_data.cv_train_val_sets(kf.n_splits, kf.test_size, kf.step_size):
estimator.cleanup()
val_loss_i, metric_i, train_time_i, pred_time_i = get_val_loss(
config,
estimator,
X_train=data,
y_train=None,
X_val=data,
y_val=None,
eval_metric=eval_metric,
labels=labels,
budget=budget_per_train,
log_training_metric=log_training_metric,
fit_kwargs=fit_kwargs,
task=self,
weight_val=None,
groups_val=None,
free_mem_ratio=free_mem_ratio,
)
if isinstance(metric_i, dict) and "intermediate_results" in metric_i:
del metric_i["intermediate_results"]
total_fold_num += 1
val_loss_folds.append(val_loss_i)
log_metric_folds.append(metric_i)
train_time += train_time_i
pred_time += pred_time_i
if time.time() - start_time >= budget:
break
val_loss, metric = cv_score_agg_func(val_loss_folds, log_metric_folds)
n = total_fold_num
pred_time /= n
return val_loss, metric, train_time, pred_time
def default_estimator_list(self, estimator_list: List[str], is_spark_dataframe: bool) -> List[str]:
assert not is_spark_dataframe, "Spark is not yet supported for time series"
# TODO: why not do this if/then in the calling function?
if "auto" != estimator_list:
return estimator_list
if self.is_ts_forecastpanel():
return ["tft"]
estimator_list = [
"lgbm",
"rf",
"xgboost",
"extra_tree",
"xgb_limitdepth",
]
# Catboost appears to be way slower than the others, don't include it by default
# try:
# import catboost
#
# estimator_list.append("catboost")
# except ImportError:
# pass
if self.is_regression():
estimator_list += ["arima", "sarimax"]
try:
import prophet
estimator_list.append("prophet")
except ImportError:
pass
return estimator_list
def default_metric(self, metric: str) -> str:
assert self.is_ts_forecast(), "If this is not a TS forecasting task, this code should never have been called"
if metric == "auto":
return "mape"
else:
return metric
@staticmethod
def prepare_sample_train_data(automlstate, sample_size):
# we take the tail, rather than the head, for compatibility with time series
shift = sample_size - len(automlstate.X_train.train_data)
sampled_X_train = automlstate.X_train.move_validation_boundary(shift)
return sampled_X_train, None, None, None
def remove_ts_duplicates(
X,
time_col,
):
"""
Assumes the targets are included
@param X:
@param time_col:
@param y:
@return:
"""
duplicates = X.duplicated()
if any(duplicates):
logger.warning("Duplicate timestamp values found in timestamp column. " f"\n{X.loc[duplicates, X][time_col]}")
X = X.drop_duplicates()
logger.warning("Removed duplicate rows based on all columns")
assert (
X[[X.columns[0]]].duplicated() is None
), "Duplicate timestamp values with different values for other columns."
return X

View File

@ -0,0 +1,17 @@
from .ts_model import (
Prophet,
Orbit,
ARIMA,
SARIMAX,
HoltWinters,
LGBM_TS,
XGBoost_TS,
RF_TS,
ExtraTrees_TS,
XGBoostLimitDepth_TS,
CatBoost_TS,
TimeSeriesEstimator,
)
from .tft import TemporalFusionTransformerEstimator
from .ts_data import TimeSeriesDataset

View File

@ -0,0 +1,34 @@
import math
import datetime
from functools import lru_cache
import pandas as pd
def monthly_fourier_features(timestamps: pd.Series, month_fourier_degree: int = 2):
if len(timestamps):
data = pd.DataFrame({"time": timestamps})
month_pos = timestamps.apply(lambda x: position_in_month(datetime.date(x.year, x.month, x.day)))
for d in range(month_fourier_degree):
data[f"cos{d+1}"] = (2 * (d + 1) * math.pi * month_pos).apply(math.cos)
data[f"sin{d + 1}"] = (2 * (d + 1) * math.pi * month_pos).apply(math.sin)
drop_cols = ["time"]
data = data.drop(columns=drop_cols)
return data
else:
columns = []
for d in range(month_fourier_degree):
columns += [f"cos{d+1}", f"sin{d + 1}"]
return pd.DataFrame(columns=columns)
@lru_cache(maxsize=4096)
def position_in_month(d: datetime.date):
prev = datetime.date(d.year, d.month, 1) - datetime.timedelta(days=1)
nxt = datetime.date(
d.year + 1 if d.month == 12 else d.year, 1 if d.month == 12 else d.month + 1, 1
) - datetime.timedelta(days=1)
delta = (d - prev).days / (nxt - prev).days
return delta

View File

@ -0,0 +1,156 @@
try:
import pandas as pd
from pandas import DataFrame, Series, to_datetime
except ImportError:
class PD:
pass
pd = PD()
pd.DataFrame = None
pd.Series = None
DataFrame = Series = None
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
def make_lag_features(X: pd.DataFrame, y: pd.Series, lags: int):
"""Transform input data X, y into autoregressive form - shift
them appropriately based on horizon and create `lags` columns.
Parameters
----------
X : pandas.DataFrame
Input features.
y : array_like, (1d)
Target vector.
horizon : int
length of X for `predict` method
Returns
-------
pandas.DataFrame
shifted dataframe with `lags` columns
"""
lag_features = []
# make sure we show y's _previous_ value to exclude data leaks
X = X.reset_index(drop=True)
X["lag_" + y.name] = y.shift(1).values
X_lag = X.copy()
for i in range(0, lags):
X_lag.columns = [f"{c}_lag_{i}" for c in X.columns]
lag_features.append(X_lag)
X_lag = X_lag.shift(1)
X_lags = pd.concat(lag_features, axis=1)
X_out = X_lags.dropna().reset_index(drop=True)
assert len(X_out) + lags == len(X)
return X_out
class SklearnWrapper:
def __init__(
self,
model_class: type,
horizon: int,
lags: int,
init_params: dict = None,
fit_params: dict = None,
pca_features: bool = False,
):
init_params = init_params if init_params else {}
self.fit_params = fit_params if fit_params else {}
self.lags = lags
self.horizon = horizon
# TODO: use multiregression where available
self.models = [model_class(**init_params) for _ in range(horizon)]
self.pca_features = pca_features
if self.pca_features:
self.norm = StandardScaler()
self.pca = None
def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs):
self._X = X
self._y = y
fit_params = {**self.fit_params, **kwargs}
X_feat = make_lag_features(X, y, self.lags)
if self.pca_features:
X_trans = self.norm.fit_transform(X_feat)
cum_expl_var = np.cumsum(PCA(svd_solver="full").fit(X_trans).explained_variance_ratio_)
self.pca = PCA(svd_solver="full", n_components=np.argmax(1 - cum_expl_var < 1e-6))
X_trans = self.pca.fit_transform(X_trans)
else:
X_trans = X_feat
for i, model in enumerate(self.models):
offset = i + self.lags
model.fit(X_trans[: len(X) - offset], y[offset:], **fit_params)
return self
def predict(self, X, X_train=None, y_train=None):
if X_train is None:
X_train = self._X
if y_train is None:
y_train = self._y
X_train = X_train.reset_index(drop=True)
X_train[self._y.name] = y_train.values
Xall = pd.concat([X_train, X], axis=0).reset_index(drop=True)
y = Xall.pop(self._y.name)
X_feat = make_lag_features(Xall[: len(X_train) + 1], y[: len(X_train) + 1], self.lags)
if self.pca_features:
X_trans = self.pca.transform(self.norm.transform(X_feat))
else:
X_trans = X_feat
# predict all horizons from the latest features vector
preds = pd.Series([m.predict(X_trans[-1:])[0] for m in self.models])
if len(preds) < len(X):
# recursive call if len(X) > trained horizon
y_train = pd.concat([y_train, preds], axis=0, ignore_index=True)
preds = pd.concat(
[
preds,
self.predict(
X=Xall[len(y_train) :],
X_train=Xall[: len(y_train)],
y_train=y_train,
),
],
axis=0,
ignore_index=True,
)
if len(preds) > len(X):
preds = preds[: len(X)]
preds.index = X.index
# TODO: do we want auto-clipping?
# return self._clip_predictions(preds)
return preds
# TODO: fix
# @staticmethod
# def _adjust_holidays(X):
# """Transform 'holiday' columns to binary feature.
#
# Parameters
# ----------
# X : pandas.DataFrame
# Input features with 'holiday' column.
#
# Returns
# -------
# pandas.DataFrame
# Holiday feature in numeric form
# """
# return X.assign(
# **{col: X[col] != "" for col in X.filter(like="_holiday_").columns}
# )

View File

@ -0,0 +1,183 @@
import time
try:
import pandas as pd
from pandas import DataFrame, Series, to_datetime
except ImportError:
class PD:
pass
pd = PD()
pd.DataFrame = None
pd.Series = None
DataFrame = Series = None
from flaml import tune
from flaml.automl.data import add_time_idx_col
from flaml.automl.time_series.ts_data import TimeSeriesDataset
from flaml.automl.time_series.ts_model import TimeSeriesEstimator
class TemporalFusionTransformerEstimator(TimeSeriesEstimator):
"""The class for tuning Temporal Fusion Transformer"""
@classmethod
def search_space(cls, data, task, pred_horizon, **params):
space = {
"gradient_clip_val": {
"domain": tune.loguniform(lower=0.01, upper=100.0),
"init_value": 0.01,
},
"hidden_size": {
"domain": tune.lograndint(lower=8, upper=512),
"init_value": 16,
},
"hidden_continuous_size": {
"domain": tune.randint(lower=1, upper=65),
"init_value": 8,
},
"attention_head_size": {
"domain": tune.randint(lower=1, upper=5),
"init_value": 4,
},
"dropout": {
"domain": tune.uniform(lower=0.1, upper=0.3),
"init_value": 0.1,
},
"learning_rate": {
"domain": tune.loguniform(lower=0.00001, upper=1.0),
"init_value": 0.001,
},
}
return space
def transform_ds(self, X_train: TimeSeriesDataset, y_train, **kwargs):
self.data = X_train.train_data
max_prediction_length = kwargs["period"]
self.max_encoder_length = kwargs["max_encoder_length"]
training_cutoff = self.data["time_idx"].max() - max_prediction_length
from pytorch_forecasting import TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
self.group_ids = kwargs["group_ids"].copy()
training = TimeSeriesDataSet(
self.data[lambda x: x.time_idx <= training_cutoff],
time_idx="time_idx",
target=X_train.target_names[0],
group_ids=self.group_ids,
min_encoder_length=kwargs.get(
"min_encoder_length", self.max_encoder_length // 2
), # keep encoder length long (as it is in the validation set)
max_encoder_length=self.max_encoder_length,
min_prediction_length=1,
max_prediction_length=max_prediction_length,
static_categoricals=kwargs.get("static_categoricals", []),
static_reals=kwargs.get("static_reals", []),
time_varying_known_categoricals=kwargs.get("time_varying_known_categoricals", []),
time_varying_known_reals=kwargs.get("time_varying_known_reals", []),
time_varying_unknown_categoricals=kwargs.get("time_varying_unknown_categoricals", []),
time_varying_unknown_reals=kwargs.get("time_varying_unknown_reals", []),
variable_groups=kwargs.get(
"variable_groups", {}
), # group of categorical variables can be treated as one variable
lags=kwargs.get("lags", {}),
target_normalizer=GroupNormalizer(
groups=kwargs["group_ids"], transformation="softplus"
), # use softplus and normalize by group
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True,
)
# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, self.data, predict=True, stop_randomization=True)
# create dataloaders for model
batch_size = kwargs.get("batch_size", 64)
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)
return training, train_dataloader, val_dataloader
def fit(self, X_train, y_train, budget=None, **kwargs):
import warnings
import pytorch_lightning as pl
import torch
from pytorch_forecasting import TemporalFusionTransformer
from pytorch_forecasting.metrics import QuantileLoss
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
# a bit of monkey patching to fix the MacOS test
# all the log_prediction method appears to do is plot stuff, which ?breaks github tests
def log_prediction(*args, **kwargs):
pass
TemporalFusionTransformer.log_prediction = log_prediction
warnings.filterwarnings("ignore")
current_time = time.time()
super().fit(X_train, **kwargs)
training, train_dataloader, val_dataloader = self.transform_ds(X_train, y_train, **kwargs)
params = self.params.copy()
gradient_clip_val = params.pop("gradient_clip_val", None)
params.pop("n_jobs", None)
max_epochs = kwargs.get("max_epochs", 20)
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger = LearningRateMonitor() # log the learning rate
logger = TensorBoardLogger(kwargs.get("log_dir", "lightning_logs")) # logging results to a tensorboard
default_trainer_kwargs = dict(
gpus=self._kwargs.get("gpu_per_trial", [0]) if torch.cuda.is_available() else None,
max_epochs=max_epochs,
gradient_clip_val=gradient_clip_val,
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
trainer = pl.Trainer(
**default_trainer_kwargs,
)
tft = TemporalFusionTransformer.from_dataset(
training,
**params,
lstm_layers=2, # 2 is mostly optimal according to documentation
output_size=7, # 7 quantiles by default
loss=QuantileLoss(),
log_interval=10, # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
reduce_on_plateau_patience=4,
)
# fit network
trainer.fit(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)
best_model_path = trainer.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)
train_time = time.time() - current_time
self._model = best_tft
return train_time
def predict(self, X):
ids = self.group_ids.copy()
ids.append(self.time_col)
encoder_data = self.data[lambda x: x.time_idx > x.time_idx.max() - self.max_encoder_length]
# following pytorchforecasting example, make all target values equal to the last data
last_data_cols = self.group_ids.copy()
last_data_cols.append(self.target_names[0])
last_data = self.data[lambda x: x.time_idx == x.time_idx.max()][last_data_cols]
decoder_data = X.X_val if isinstance(X, TimeSeriesDataset) else X
if "time_idx" not in decoder_data:
decoder_data = add_time_idx_col(decoder_data)
decoder_data["time_idx"] += encoder_data["time_idx"].max() + 1 - decoder_data["time_idx"].min()
decoder_data = decoder_data.merge(last_data, how="inner", on=self.group_ids)
decoder_data = decoder_data.sort_values(ids)
new_prediction_data = pd.concat([encoder_data, decoder_data], ignore_index=True)
new_prediction_data["time_idx"] = new_prediction_data["time_idx"].astype("int")
new_raw_predictions = self._model.predict(new_prediction_data)
index = [decoder_data[idx].to_numpy() for idx in ids]
predictions = pd.Series(new_raw_predictions.numpy().ravel(), index=index)
return predictions

View File

@ -0,0 +1,544 @@
import copy
import datetime
import math
from dataclasses import dataclass, field
from typing import List, Optional, Callable, Dict, Generator, Union
import numpy as np
try:
import pandas as pd
from pandas import DataFrame, Series, to_datetime
from scipy.sparse import issparse
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from .feature import monthly_fourier_features
except ImportError:
class PD:
pass
pd = PD()
pd.DataFrame = None
pd.Series = None
DataFrame = Series = None
@dataclass
class TimeSeriesDataset:
train_data: pd.DataFrame
time_idx: str
time_col: str
target_names: List[str]
frequency: str
test_data: pd.DataFrame
time_varying_known_categoricals: List[str] = field(default_factory=lambda: [])
time_varying_known_reals: List[str] = field(default_factory=lambda: [])
time_varying_unknown_categoricals: List[str] = field(default_factory=lambda: [])
time_varying_unknown_reals: List[str] = field(default_factory=lambda: [])
def __init__(
self,
train_data: pd.DataFrame,
time_col: str,
target_names: Union[str, List[str]],
time_idx: str = "time_idx",
test_data: Optional[pd.DataFrame] = None,
):
self.train_data = train_data
self.time_col = time_col
self.time_idx = time_idx
self.target_names = [target_names] if isinstance(target_names, str) else list(target_names)
assert isinstance(self.target_names, list)
assert len(self.target_names)
self.frequency = pd.infer_freq(train_data[time_col].unique())
assert self.frequency is not None, "Only time series of regular frequency are currently supported."
float_cols = list(train_data.select_dtypes(include=["floating"]).columns)
self.time_varying_known_reals = list(set(float_cols) - set(self.target_names))
self.time_varying_known_categoricals = list(
set(train_data.columns) - set(self.time_varying_known_reals) - set(self.target_names) - {time_col}
)
if test_data is not None:
self.test_data = test_data
else:
self.test_data = pd.DataFrame(columns=self.train_data.columns)
def add_test_data(self, X: pd.DataFrame) -> "TimeSeriesDataset":
assert self.time_col in X.columns
train_data = self.all_data[self.all_data[self.time_col] < X[self.time_col].min()]
return TimeSeriesDataset(train_data, self.time_col, self.target_names, self.time_idx, X)
@staticmethod
def to_dataframe(X, y, target_names: List[str], time_col: str):
assert len(X) == len(y), "X_val and y_val must have the same length"
validate_data_basic(X, y)
# coerce them into a dataframe
val_df = normalize_ts_data(X, target_names, time_col, y)
return val_df
@property
def all_data(self):
if len(self.test_data):
return pd.concat([self.train_data, self.test_data], axis=0)
else:
return self.train_data
@property
def regressors(self):
return self.time_varying_known_categoricals + self.time_varying_known_reals
@property
def end_date(self):
test_len = 0 if self.test_data is None else len(self.test_data)
data = self.test_data if test_len else self.train_data
return data.iloc[-1][self.time_col]
def _X(self, df: pd.DataFrame):
features = [col for col in df.columns if col not in self.target_names]
return df[features]
def _y(self, df: pd.DataFrame):
if len(self.target_names) > 1:
return df[self.target_names]
else:
return df[self.target_names[0]]
@property
def X_train(self) -> pd.DataFrame:
return self._X(self.train_data)
@property
def X_val(self) -> pd.DataFrame:
return self._X(self.test_data)
@property
def X_all(self) -> pd.DataFrame:
return pd.concat([self.X_train, self.X_val], axis=0)
@property
def y_train(self) -> pd.DataFrame:
return self._y(self.train_data)
@property
def y_val(self) -> pd.DataFrame:
return self._y(self.test_data)
@property
def y_all(self) -> pd.DataFrame:
return self._y(self.all_data)
def next_scale(self) -> int:
scale_map = {"D": 7, "MS": 12}
return scale_map.get(self.frequency, 8)
def known_features_to_floats(self, train: bool, drop_first: bool = True) -> np.ndarray:
# this is a bit tricky as shapes for train and test data must match, so need to encode together
combined = pd.concat(
[
self.train_data,
self.test_data,
],
ignore_index=True,
)
cat_one_hots = pd.get_dummies(
combined[self.time_varying_known_categoricals],
columns=self.time_varying_known_categoricals,
drop_first=drop_first,
).values.astype(float)
reals = combined[self.time_varying_known_reals].values.astype(float)
both = np.concatenate([reals, cat_one_hots], axis=1)
if train:
return both[: len(self.train_data)]
else:
return both[len(self.train_data) :]
# def unique_dimension_values(self) -> np.ndarray:
# # this is the same set for train and test data, by construction
# return self.combine_dims(self.train_data).unique()
#
# def combine_dims(self, df):
# return df.apply(lambda row: tuple([row[d] for d in self.dimensions]), axis=1)
def to_univariate(self) -> Dict[str, "TimeSeriesDataset"]:
"""
Convert a multivariate TrainingData to a dict of univariate ones
@param df:
@return:
"""
train_dims = self.combine_dims(self.train_data)
test_dims = self.combine_dims(self.test_data)
out = {}
for d in train_dims.unique():
out[d] = copy.copy(self)
out[d].train_data = self.train_data[train_dims == d]
out[d].test_data = self.test_data[test_dims == d]
return out
def move_validation_boundary(self, steps: int) -> "TimeSeriesDataset":
out = copy.copy(self)
if steps > 0:
out.train_data = pd.concat([self.train_data, self.test_data[:steps]])
out.test_data = self.test_data[steps:]
elif steps < 0:
out.train_data = self.train_data[:steps]
if len(self.test_data):
out.test_data = pd.concat([self.train_data[steps:], self.test_data])
else:
out.test_data = self.train_data[steps:]
return out
def cv_train_val_sets(
self, n_splits: int, val_length: int, step_size: int
) -> Generator["TimeSeriesDataset", None, None]:
max_index = len(self.train_data) - 1
for i in range(n_splits):
out = copy.copy(self)
val_start = max_index - (n_splits - i - 1) * step_size - val_length
out.train_data = self.train_data[:val_start]
out.test_data = self.train_data[val_start : val_start + val_length]
yield out
def filter(self, filter_fun: Callable) -> "TimeSeriesDataset":
if filter_fun is None:
return self
out = copy.copy(self)
out.train_data = self.train_data[filter_fun]
out.test_data = self.test_data[filter_fun]
return out
def prettify_prediction(self, y_pred: Union[pd.DataFrame, pd.Series, np.ndarray]):
if self.test_data is not None and len(self.test_data):
assert len(y_pred) == len(self.test_data)
if isinstance(y_pred, np.ndarray):
y_pred = pd.DataFrame(data=y_pred, columns=self.target_names, index=self.test_data.index)
elif isinstance(y_pred, pd.Series):
assert len(self.target_names) == 1, "Not enough columns in y_pred"
y_pred.name = self.target_names[0]
y_pred = pd.DataFrame(y_pred)
y_pred.index = self.test_data.index
elif isinstance(y_pred, pd.DataFrame):
y_pred.index = self.test_data.index
if self.time_col not in y_pred.columns:
y_pred[self.time_col] = self.test_data[self.time_col]
else:
if isinstance(y_pred, np.ndarray):
raise ValueError("Can't enrich np.ndarray as self.test_data is None")
elif isinstance(y_pred, pd.Series):
assert len(self.target_names) == 1, "Not enough columns in y_pred"
y_pred = pd.DataFrame({self.target_names[0]: y_pred})
# TODO auto-create the timestamps for the time column instead of throwing
raise NotImplementedError("Need a non-None test_data for this to work, for now")
assert isinstance(y_pred, pd.DataFrame)
assert self.time_col in y_pred.columns
assert all([t in y_pred.columns for t in self.target_names])
return y_pred
def merge_prediction_with_target(self, y_pred: Union[pd.DataFrame, pd.Series, np.ndarray]):
y_pred = self.prettify_prediction(y_pred)
return pd.concat([self.train_data[[self.time_col] + self.target_names], y_pred], axis=0)
def enrich_dataframe(
df: Union[pd.DataFrame, pd.Series],
fourier_degree: int,
remove_constants: bool = False,
fourier_time: bool = True,
) -> pd.DataFrame:
if isinstance(df, pd.Series):
df = pd.DataFrame(df)
new_cols = []
for col in df.columns:
if df[col].dtype.name == "datetime64[ns]":
extras = monthly_fourier_features(df[col], fourier_degree)
extras.columns = [f"{col}_{c}" for c in extras.columns]
extras.index = df.index
new_cols.append(extras)
date_feat = date_feature_dict_fourier(df[col]) if fourier_time else date_feature_dict(df[col])
if remove_constants:
re_date_feat = {k: v for k, v in date_feat.items() if v.nunique(dropna=False) >= 2}
else:
re_date_feat = date_feat
date_feat = pd.DataFrame(re_date_feat, index=df.index)
new_cols.append(date_feat)
return pd.concat([df] + new_cols, axis=1, verify_integrity=True)
def enrich_dataset(
X: TimeSeriesDataset,
fourier_degree: int = 0,
remove_constants: bool = False,
fourier_time: bool = True,
) -> TimeSeriesDataset:
new_train = enrich_dataframe(X.train_data, fourier_degree, remove_constants, fourier_time)
new_test = (
None if X.test_data is None else enrich_dataframe(X.test_data, fourier_degree, remove_constants, fourier_time)
)
return TimeSeriesDataset(
train_data=new_train,
time_col=X.time_col,
target_names=X.target_names,
time_idx=X.time_idx,
test_data=new_test,
)
def date_feature_dict(timestamps: pd.Series) -> dict:
tmp_dt = timestamps.dt
column = timestamps.name
pre_columns_dict = {
# f"{column}_year": tmp_dt.year, # not stationary
f"{column}_month": tmp_dt.month,
# f"{column}_day": tmp_dt.day,# taken care of with monthly fourier features
f"{column}_hour": tmp_dt.hour,
f"{column}_minute": tmp_dt.minute,
f"{column}_second": tmp_dt.second,
f"{column}_dayofweek": tmp_dt.dayofweek,
f"{column}_dayofyear": tmp_dt.dayofyear,
f"{column}_quarter": tmp_dt.quarter,
}
new_columns_dict = {}
for k, v in pre_columns_dict.items():
new_columns_dict.update(fourier_series(v, k))
return new_columns_dict
def date_feature_dict_fourier(timestamps: pd.Series) -> dict:
tmp_dt = timestamps.dt
column = timestamps.name
pre_columns_dict = {
# f"{column}_year": tmp_dt.year, # not stationary
f"{column}_month": tmp_dt.month / 12.0,
# f"{column}_day": tmp_dt.day,# taken care of with monthly fourier features
f"{column}_hour": tmp_dt.hour / 24.0,
f"{column}_minute": tmp_dt.minute / 60.0,
f"{column}_second": tmp_dt.second / 60.0,
f"{column}_dayofweek": tmp_dt.dayofweek / 7.0,
f"{column}_dayofyear": tmp_dt.dayofyear / 366.0,
f"{column}_quarter": tmp_dt.quarter / 4.0,
}
new_columns_dict = {}
for k, v in pre_columns_dict.items():
new_columns_dict.update(fourier_series(v, k))
return new_columns_dict
def fourier_series(feature: pd.Series, name: str):
"""
Assume feature goes from 0 to 1 cyclically, transform that into Fourier
@param feature: input feature
@return: sin(2pi*feature), cos(2pi*feature)
"""
return {
name + "_sin": np.sin(2 * math.pi * feature),
name + "_cos": np.cos(2 * math.pi * feature),
}
class DataTransformerTS:
"""Transform input time series training data."""
def __init__(self, time_col: str, label: Union[str, List[str]], time_idx: str = "time_idx"):
self.time_col = time_col
self.time_idx = time_idx
self.label = label
self.cat_columns = []
self.num_columns = []
self.datetime_columns = []
self.drop_columns = []
@property
def _drop(self):
return len(self.drop_columns)
def fit(self, X: Union[DataFrame, np.array], y):
"""Fit transformer.
Args:
X: A numpy array or a pandas dataframe of training data.
y: A numpy array or a pandas series of labels.
Returns:
X: Processed numpy array or pandas dataframe of training data.
y: Processed numpy array or pandas series of labels.
"""
assert isinstance(X, DataFrame)
X = X.copy()
n = X.shape[0]
assert len(self.num_columns) == 0, "Trying to call fit() twice, something is wrong"
for column in X.columns:
# sklearn/utils/validation.py needs int/float values
if X[column].dtype.name in ("object", "category"):
if (
# drop columns where all values are the same
X[column].nunique() == 1
# this drops UID-type cols
or X[column].nunique(dropna=True) == n - X[column].isnull().sum()
):
self.drop_columns.append(column)
elif column != self.time_idx:
self.cat_columns.append(column)
elif X[column].nunique(dropna=True) < 2:
self.drop_columns.append(column)
elif X[column].dtype.name == "datetime64[ns]":
pass # these will be processed at model level,
# so they can also be done in the predict method
else:
self.num_columns.append(column)
if self.num_columns:
self.transformer = ColumnTransformer(
[
(
"continuous",
SimpleImputer(missing_values=np.nan, strategy="median"),
self.num_columns,
)
]
)
self.transformer.fit(X[self.num_columns])
else:
self.transformer = None
# TODO: revisit for multivariate series, and recast for a single df input anyway
if isinstance(y, Series):
y = y.rename(self.label)
if isinstance(y, pd.DataFrame):
ycol = y[y.columns[0]]
elif isinstance(y, pd.Series):
ycol = y
else:
raise ValueError("y must be either a pd.Series or a pd.DataFrame at this stage")
if not pd.api.types.is_numeric_dtype(ycol):
self.label_transformer = LabelEncoder()
self.label_transformer.fit(ycol)
else:
self.label_transformer = None
def transform(self, X: Union[DataFrame, np.array], y=None):
# TODO: revisit for multivariate series, and recast for a single df input anyway
if self.label_transformer is not None and y is not None:
if isinstance(y, pd.DataFrame):
ycol = y[y.columns[0]]
elif isinstance(y, pd.Series):
ycol = y
else:
raise ValueError("y must be either a pd.Series or a pd.DataFrame at this stage")
y_tr = self.label_transformer.transform(ycol)
y.iloc[:] = y_tr.reshape(y.shape)
X.drop(columns=self.drop_columns, inplace=True)
for col in self.cat_columns:
if X[col].dtype.name == "category":
if "__NAN__" not in X[col].cat.categories:
X[col] = X[col].cat.add_categories("__NAN__").fillna("__NAN__")
else:
X[col] = X[col].fillna("__NAN__")
X[col] = X[col].astype("category")
for column in self.num_columns:
X[column] = X[column].fillna(np.nan)
if self.transformer is not None:
X[self.num_columns] = self.transformer.transform(X[self.num_columns])
if y is None:
return X
return X, y
def fit_transform(self, X: Union[DataFrame, np.array], y):
self.fit(X, y)
return self.transform(X, y)
def create_forward_frame(
frequency: str,
steps: int,
test_end_date: datetime.datetime,
time_col: str,
):
start_date = test_end_date + pd.Timedelta(1, frequency)
times = pd.date_range(
start=start_date,
periods=steps,
freq=frequency,
)
return pd.DataFrame({time_col: times})
def normalize_ts_data(X_train_all, target_names, time_col, y_train_all=None):
if isinstance(X_train_all, TimeSeriesDataset):
return X_train_all
if issparse(X_train_all):
X_train_all = X_train_all.tocsr()
if isinstance(X_train_all, np.ndarray) and len(X_train_all.shape) == 1:
X_train_all = np.reshape(X_train_all, (X_train_all.size, 1))
if isinstance(X_train_all, np.ndarray):
X_train_all = pd.DataFrame(
X_train_all,
columns=[time_col] + [f"x{i}" for i in range(X_train_all.shape[1] - 1)],
)
if y_train_all is None:
return X_train_all
else:
if isinstance(y_train_all, np.ndarray):
# TODO: will need to revisit this when doing multivariate y
y_train_all = pd.DataFrame(
y_train_all.reshape(len(X_train_all), -1),
columns=target_names,
index=X_train_all.index,
)
elif isinstance(y_train_all, pd.Series):
y_train_all = pd.DataFrame(y_train_all)
y_train_all.index = X_train_all.index
dataframe = pd.concat([X_train_all, y_train_all], axis=1)
return dataframe
def validate_data_basic(X_train_all, y_train_all):
assert isinstance(X_train_all, np.ndarray) or issparse(X_train_all) or isinstance(X_train_all, pd.DataFrame), (
"X_train_all must be a numpy array, a pandas dataframe, " "or Scipy sparse matrix."
)
assert (
isinstance(y_train_all, np.ndarray)
or isinstance(y_train_all, pd.Series)
or isinstance(y_train_all, pd.DataFrame)
), "y_train_all must be a numpy array or a pandas series or DataFrame."
assert X_train_all.size != 0 and y_train_all.size != 0, "Input data must not be empty, use None if no data"
assert X_train_all.shape[0] == y_train_all.shape[0], "# rows in X_train must match length of y_train."

View File

@ -0,0 +1,760 @@
import time
import logging
import os
from datetime import datetime
import math
from typing import List, Optional, Union
try:
import pandas as pd
from pandas import DataFrame, Series, to_datetime
except ImportError:
class PD:
pass
pd = PD()
pd.DataFrame = None
pd.Series = None
DataFrame = Series = None
import numpy as np
from flaml import tune
from flaml.model import (
suppress_stdout_stderr,
SKLearnEstimator,
logger,
LGBMEstimator,
XGBoostSklearnEstimator,
RandomForestEstimator,
ExtraTreesEstimator,
XGBoostLimitDepthEstimator,
CatBoostEstimator,
)
from flaml.data import TS_TIMESTAMP_COL, TS_VALUE_COL
from flaml.automl.time_series.ts_data import (
TimeSeriesDataset,
enrich_dataset,
enrich_dataframe,
normalize_ts_data,
create_forward_frame,
)
from flaml.automl.task import Task
class TimeSeriesEstimator(SKLearnEstimator):
def __init__(self, task="ts_forecast", n_jobs=1, **params):
super().__init__(task, **params)
self.time_col: Optional[str] = None
self.target_names: Optional[Union[str, List[str]]] = None
self.frequency: Optional[str] = None
self.end_date: Optional[datetime] = None
self.regressors: Optional[List[str]] = None
def enrich(
self,
X: Union[int, TimeSeriesDataset, DataFrame],
remove_constants: bool = False,
):
X = normalize_ts_data(X, None, self.time_col, None)
if isinstance(X, int):
X = create_forward_frame(self.frequency, X, self.end_date, self.time_col)
fourier_degree = self.params.get("monthly_fourier_degree", 4)
if isinstance(X, TimeSeriesDataset):
return enrich_dataset(
X,
fourier_degree,
remove_constants=remove_constants,
fourier_time=self.params.get("fourier_time_features"),
)
return enrich_dataframe(
X,
fourier_degree,
remove_constants=remove_constants,
fourier_time=self.params.get("fourier_time_features"),
)
@classmethod
def search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int):
space = cls._search_space(data=data, task=task, pred_horizon=pred_horizon)
space.update(cls.top_search_space())
return space
@staticmethod
def adjust_scale(scale: int, data_len: int, pred_horizon: int):
points = data_len - pred_horizon
max_lags = math.floor(points / scale)
while scale > 2:
if max_lags >= 2:
break
scale = math.ceil(scale / 1.7)
max_lags = math.floor(points / scale)
assert scale >= 2 and max_lags >= 2, f"Too few points ({data_len}) for prediction horizon {pred_horizon}"
return scale, max_lags
@classmethod
def top_search_space(cls):
return {
"monthly_fourier_degree": {
"domain": tune.randint(lower=0, upper=8),
"init_value": 4,
"low_cost_init_value": 2,
},
"fourier_time_features": {
"domain": tune.randint(lower=0, upper=2), # tune.choice([True, False]),
"init_value": 1,
"low_cost_init_value": 0,
},
"pca_features": { # disable for now, will deal with occasional svd fail later
"domain": tune.choice([False]),
"init_value": False,
"low_cost_init_value": False,
},
}
@classmethod
def top_level_params(cls):
return ["monthly_fourier_degree"]
def _join(self, X_train, y_train):
assert TS_TIMESTAMP_COL in X_train, (
"Dataframe for training ts_forecast model must have column"
f' "{TS_TIMESTAMP_COL}" with the dates in X_train.'
)
y_train = DataFrame(y_train, columns=[TS_VALUE_COL])
train_df = X_train.join(y_train)
return train_df
def fit(self, X_train: TimeSeriesDataset, y_train=None, budget=None, **kwargs):
# TODO purge y_train
self.time_col = X_train.time_col
self.target_names = X_train.target_names
self.X_train = X_train
self.frequency = self.X_train.frequency
self.end_date = self.X_train.end_date
def score(self, X_val: DataFrame, y_val: Series, **kwargs):
from sklearn.metrics import r2_score
from ..ml import metric_loss_score
y_pred = self.predict(X_val, **kwargs)
if isinstance(X_val, TimeSeriesDataset):
y_val = X_val.test_data[X_val.target_names[0]]
self._metric = kwargs.get("metric", None)
if self._metric:
return metric_loss_score(self._metric, y_pred, y_val)
else:
return r2_score(y_pred, y_val)
class Orbit(TimeSeriesEstimator):
def fit(self, X_train: TimeSeriesDataset, y_train=None, budget=None, **kwargs):
# This may be needed to get PyStan to run, needed for Orbit
os.environ["KMP_DUPLICATE_LIB_OK"] = "True"
from orbit.models import DLT
# y_train is ignored, just need it for signature compatibility with other classes
super().fit(X_train, y_train, budget=budget, **kwargs)
current_time = time.time()
self.logger = logging.getLogger("orbit").setLevel(logging.WARNING)
model_class = self.params.get("model_class", DLT)
self._model = model_class(
response_col=X_train.target_names[0],
date_col=X_train.time_col,
regressor_col=X_train.regressors,
# TODO: infer seasonality from frequency
**self.params,
)
with suppress_stdout_stderr():
self._model.fit(df=X_train.train_data.copy())
train_time = time.time() - current_time
return train_time
def predict(self, X: Union[TimeSeriesDataset, DataFrame], **kwargs):
if isinstance(X, int):
X = create_forward_frame(
self.frequency,
X,
self.end_date,
self.time_col,
)
elif isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data[[self.time_col] + X.regressors]
if self._model is not None:
forecast = self._model.predict(X, **kwargs)
out = (
DataFrame(
forecast[
[
self.time_col,
"prediction",
"prediction_5",
"prediction_95",
]
]
)
.reset_index(drop=True)
.rename(
columns={
"prediction": self.target_names[0],
}
)
)
return out
else:
self.logger.warning("Estimator is not fit yet. Please run fit() before predict().")
return None
@classmethod
def _search_space(cls, **params):
# TODO: fill in a proper search space
space = {}
return space
class Prophet(TimeSeriesEstimator):
"""The class for tuning Prophet."""
@classmethod
def _search_space(cls, **params):
space = {
"changepoint_prior_scale": {
"domain": tune.loguniform(lower=0.001, upper=0.05),
"init_value": 0.05,
"low_cost_init_value": 0.001,
},
"seasonality_prior_scale": {
"domain": tune.loguniform(lower=0.01, upper=10),
"init_value": 10,
},
"holidays_prior_scale": {
"domain": tune.loguniform(lower=0.01, upper=10),
"init_value": 10,
},
"seasonality_mode": {
"domain": tune.choice(["additive", "multiplicative"]),
"init_value": "multiplicative",
},
}
return space
def fit(self, X_train, y_train=None, budget=None, **kwargs):
from prophet import Prophet
X_train = self.enrich(X_train)
super().fit(X_train, y_train, budget=budget, **kwargs)
current_time = time.time()
if isinstance(X_train, TimeSeriesDataset):
data = X_train
target_col = data.target_names[0]
time_col = data.time_col
regressors = data.regressors
# this class only supports univariate regression
train_df = data.train_data[regressors + [target_col, time_col]]
train_df = train_df.rename(columns={target_col: "y", time_col: "ds"})
else:
train_df = self._join(X_train, y_train)
regressors = list(train_df.columns)
regressors.remove(TS_TIMESTAMP_COL)
regressors.remove(TS_VALUE_COL)
train_df = self._preprocess(train_df)
logging.getLogger("prophet").setLevel(logging.WARNING)
nice_params = {k: v for k, v in self.params.items() if k in self._search_space()}
model = Prophet(**nice_params)
for regressor in regressors:
model.add_regressor(regressor)
with suppress_stdout_stderr():
model.fit(train_df)
train_time = time.time() - current_time
self._model = model
return train_time
def predict(self, X, **kwargs):
X = self.enrich(X)
if isinstance(X, int):
raise ValueError(
"predict() with steps is only supported for arima/sarimax."
" For Prophet, pass a dataframe with the first column containing"
" the timestamp values."
)
if isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data[data.regressors + [data.time_col]]
X = X.rename(columns={self.time_col: "ds"})
if self._model is not None:
X = self._preprocess(X)
forecast = self._model.predict(X, **kwargs)
out = forecast["yhat"]
out.name = self.target_names[0]
return out
else:
logger.warning("Estimator is not fit yet. Please run fit() before predict().")
return np.ones(X.shape[0])
class StatsModelsEstimator(TimeSeriesEstimator):
def predict(self, X, **kwargs) -> pd.Series:
X = self.enrich(X)
if self._model is None or self._model is False:
return np.ones(X if isinstance(X, int) else X.shape[0])
if isinstance(X, int):
return self._model.forecast(steps=X)
if isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data[data.regressors + [data.time_col]]
else:
X = X[self.regressors + [self.time_col]]
if isinstance(X, DataFrame):
start = X[self.time_col].iloc[0]
end = X[self.time_col].iloc[-1]
if len(self.regressors):
exog = self._preprocess(X[self.regressors])
forecast = self._model.predict(start=start, end=end, exog=exog.values, **kwargs)
else:
forecast = self._model.predict(start=start, end=end, **kwargs)
else:
raise ValueError(
"X needs to be either a pandas Dataframe with dates as the first column"
" or an int number of periods for predict()."
)
forecast.name = self.target_names[0]
return forecast
class ARIMA(StatsModelsEstimator):
"""The class for tuning ARIMA."""
def __init__(self, **kwargs):
super().__init__(**kwargs)
if not all([p in self.params for p in ["p", "d", "q"]]):
print("arima params at init time:")
print(self.params)
try:
raise ValueError("ARIMA initialized without required params p, d, q")
except Exception as e:
import traceback
print(traceback.format_exc())
raise e
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
scale, _ = cls.adjust_scale(data.next_scale(), len(data.train_data), pred_horizon)
space = {
"p": {
"domain": tune.qrandint(lower=0, upper=2 * scale, q=1),
"init_value": scale,
"low_cost_init_value": 0,
},
"d": {
"domain": tune.qrandint(lower=0, upper=6, q=1),
"init_value": 1,
"low_cost_init_value": 0,
},
"q": {
"domain": tune.qrandint(lower=0, upper=2 * scale, q=1),
"init_value": scale,
"low_cost_init_value": 0,
},
}
return space
def _join(self, X_train, y_train):
train_df = super()._join(X_train, y_train)
train_df.index = to_datetime(train_df[TS_TIMESTAMP_COL])
train_df = train_df.drop(TS_TIMESTAMP_COL, axis=1)
return train_df
def fit(self, X_train, y_train=None, budget=None, **kwargs):
import warnings
super().fit(X_train, y_train, budget=budget, **kwargs)
X_train = self.enrich(X_train, remove_constants=True)
warnings.filterwarnings("ignore")
from statsmodels.tsa.arima.model import ARIMA as ARIMA_estimator
current_time = time.time()
if isinstance(X_train, TimeSeriesDataset):
data = X_train
# this class only supports univariate regression
target_col = data.target_names[0] if isinstance(data.target_names, list) else data.target_names
self.regressors = data.regressors
train_df = data.train_data[self.regressors + [target_col]]
train_df.index = to_datetime(data.train_data[data.time_col])
self.time_col = data.time_col
self.target_names = target_col
else:
target_col = TS_VALUE_COL
train_df = self._join(X_train, y_train)
self.regressors = list(train_df)
self.regressors.remove(TS_VALUE_COL)
train_df = self._preprocess(train_df)
if len(self.regressors):
model = ARIMA_estimator(
train_df[[target_col]],
exog=train_df[self.regressors],
order=(self.params["p"], self.params["d"], self.params["q"]),
enforce_stationarity=False,
enforce_invertibility=False,
)
else:
model = ARIMA_estimator(
train_df,
order=(self.params["p"], self.params["d"], self.params["q"]),
enforce_stationarity=False,
enforce_invertibility=False,
)
with suppress_stdout_stderr():
model = model.fit()
train_time = time.time() - current_time
self._model = model
return train_time
class SARIMAX(StatsModelsEstimator):
"""The class for tuning SARIMA."""
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
scale, max_lags = cls.adjust_scale(data.next_scale(), len(data.train_data), pred_horizon)
# TODO: instead, downscale the dataset and take next_scale from that for P and Q
scales = [
s for s in [scale, 2 * scale, 3 * scale, 4 * scale] if s * max_lags <= len(data.train_data) - pred_horizon
]
space = {
"p": {
"domain": tune.qrandint(lower=0, upper=scale - 1, q=1),
"init_value": scale - 1,
"low_cost_init_value": 0,
},
"d": {
"domain": tune.qrandint(lower=0, upper=6, q=1),
"init_value": 0,
"low_cost_init_value": 0,
},
"q": {
"domain": tune.qrandint(lower=0, upper=scale - 1, q=1),
"init_value": scale - 1,
"low_cost_init_value": 0,
},
"P": {
"domain": tune.qrandint(lower=0, upper=min(10, max_lags), q=1),
"init_value": 3,
"low_cost_init_value": 0,
},
"D": {
"domain": tune.qrandint(lower=0, upper=6, q=1),
"init_value": 0,
"low_cost_init_value": 0,
},
"Q": {
"domain": tune.qrandint(lower=0, upper=min(10, max_lags), q=1),
"init_value": 3,
"low_cost_init_value": 0,
},
"s": {
"domain": tune.choice(scales),
"init_value": scale,
},
}
return space
def fit(self, X_train, y_train=None, budget=None, **kwargs):
import warnings
super().fit(X_train, y_train, budget=budget, **kwargs)
X_train = self.enrich(X_train)
warnings.filterwarnings("ignore")
from statsmodels.tsa.statespace.sarimax import SARIMAX as SARIMAX_estimator
current_time = time.time()
if isinstance(X_train, TimeSeriesDataset):
data = X_train
target_col = data.target_names[0]
self.regressors = data.regressors
# this class only supports univariate regression
train_df = data.train_data[self.regressors + [target_col]]
train_df.index = to_datetime(data.train_data[data.time_col])
else:
target_col = TS_VALUE_COL
train_df = self._join(X_train, y_train)
self.regressors = list(train_df)
self.regressors.remove(TS_VALUE_COL)
train_df = self._preprocess(train_df)
# regressors = list(train_df)
# regressors.remove(target_col)
if self.regressors:
model = SARIMAX_estimator(
train_df[[target_col]],
exog=train_df[self.regressors],
order=(self.params["p"], self.params["d"], self.params["q"]),
seasonal_order=(
self.params["P"],
self.params["D"],
self.params["Q"],
self.params["s"],
),
enforce_stationarity=False,
enforce_invertibility=False,
)
else:
model = SARIMAX_estimator(
train_df,
order=(self.params["p"], self.params["d"], self.params["q"]),
seasonal_order=(
self.params["P"],
self.params["D"],
self.params["Q"],
self.params["s"],
),
enforce_stationarity=False,
enforce_invertibility=False,
)
with suppress_stdout_stderr():
model = model.fit()
train_time = time.time() - current_time
self._model = model
return train_time
class HoltWinters(StatsModelsEstimator):
"""
The class for tuning Holt Winters model, aka 'Triple Exponential Smoothing'.
"""
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
space = {
"damped_trend": {"domain": tune.choice([True, False]), "init_value": False},
"trend": {"domain": tune.choice(["add", "mul", None]), "init_value": "add"},
"seasonal": {
"domain": tune.choice(["add", "mul", None]),
"init_value": "add",
},
"use_boxcox": {"domain": tune.choice([False, True]), "init_value": False},
"seasonal_periods": { # statsmodels casts this to None if "seasonal" is None
"domain": tune.choice([7, 12, 4, 52, 6]), # weekly, yearly, quarterly, weekly w yearly data
"init_value": 7,
},
}
return space
def fit(self, X_train, y_train, budget=None, free_mem_ratio=0, **kwargs):
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tsa.holtwinters import (
ExponentialSmoothing as HWExponentialSmoothing,
)
current_time = time.time()
super().fit(X_train, y_train, budget=budget, **kwargs)
X_train = self.enrich(X_train)
self.regressors = []
if isinstance(X_train, TimeSeriesDataset):
data = X_train
target_col = data.target_names[0]
regressors = data.regressors
# this class only supports univariate regression
train_df = data.train_data[self.regressors + [target_col]]
train_df.index = to_datetime(data.train_data[data.time_col])
else:
target_col = TS_VALUE_COL
train_df = self._join(X_train, y_train)
regressors = list(train_df)
regressors.remove(TS_VALUE_COL)
if regressors:
logger.warning("Regressors are ignored for Holt-Winters ETS models.")
train_df = self._preprocess(train_df)
# Override incompatible parameters
if (
train_df.shape[0] < 2 * self.params["seasonal_periods"]
): # this would prevent heuristic initialization to work properly
self.params["seasonal"] = None
if (
self.params["seasonal"] == "mul" and (train_df.y == 0).sum() > 0
): # cannot have multiplicative seasonality in this case
self.params["seasonal"] = "add"
if self.params["trend"] == "mul" and (train_df.y == 0).sum() > 0:
self.params["trend"] = "add"
if not self.params["seasonal"] or self.params["trend"] not in ["mul", "add"]:
self.params["damped_trend"] = False
model = HWExponentialSmoothing(
train_df[[target_col]],
damped_trend=self.params["damped_trend"],
seasonal=self.params["seasonal"],
trend=self.params["trend"],
)
with suppress_stdout_stderr():
model = model.fit()
train_time = time.time() - current_time
self._model = model
return train_time
class TS_SKLearn(TimeSeriesEstimator):
"""The class for tuning SKLearn Regressors for time-series forecasting"""
base_class = SKLearnEstimator
@classmethod
def _search_space(cls, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params):
data_size = data.train_data.shape
space = cls.base_class.search_space(data_size=data_size, task=task, **params)
scale, _ = cls.adjust_scale(data.next_scale(), len(data.train_data), pred_horizon)
max_lags = max(3 * scale, int(np.sqrt(data_size[0])))
max_lags = min(max_lags, data_size[0] - pred_horizon - 1)
space.update(
{
"lags": {
"domain": tune.randint(lower=1, upper=max_lags),
"init_value": min(max_lags, scale),
},
}
)
return space
def __init__(self, task="ts_forecast", **params):
# TODO: pass task objects throughout
super().__init__(task, **params)
self._model = None
self.ts_task = task
def fit(self, X_train, y_train=None, budget=None, **kwargs):
super().fit(X_train, y_train, budget=budget, **kwargs)
X_train = self.enrich(X_train)
current_time = time.time()
if isinstance(X_train, TimeSeriesDataset):
data = X_train
X_train = data.train_data[data.regressors + [data.time_col]]
self.regressors = data.regressors
# this class only supports univariate regression
y_train = data.y_train
self.time_col = data.time_col
self.target_names = data.target_names
elif isinstance(X_train, DataFrame):
self.time_col = X_train.columns.tolist()[0]
# X_train = self.transform_X(X_train)
self.regressors = X_train.columns.tolist()[1:]
else:
raise ValueError("Unknown X type")
X_train = self._preprocess(X_train)
est_params = {k: v for k, v in self.params.items() if k not in self.top_search_space().keys()}
from flaml.automl.time_series.sklearn import SklearnWrapper
horizon = kwargs.pop("period")
lags = est_params.pop("lags")
est_params["task"] = self._task
self._model = SklearnWrapper(
self.base_class,
horizon=horizon,
lags=lags,
init_params=est_params,
pca_features=self.params.get("pca_features", False),
)
self._model.fit(X_train[self.regressors], y_train)
train_time = time.time() - current_time
return train_time
def predict(self, X, **kwargs):
X = self.enrich(X)
if isinstance(X, TimeSeriesDataset):
data = X
X = data.test_data
if self._model is not None:
X = X[self.regressors]
# X = self.transform_X(X)
X = self._preprocess(X)
forecast = self._model.predict(X)
if isinstance(forecast, Series):
forecast.name = self.target_names[0]
return forecast
else:
logger.warning("Estimator is not fit yet. Please run fit() before predict().")
return np.ones(X.shape[0])
class LGBM_TS(TS_SKLearn):
"""The class for tuning LGBM Regressor for time-series forecasting"""
base_class = LGBMEstimator
class XGBoost_TS(TS_SKLearn):
"""The class for tuning XGBoost Regressor for time-series forecasting"""
base_class = XGBoostSklearnEstimator
class RF_TS(TS_SKLearn):
"""The class for tuning Random Forest Regressor for time-series forecasting"""
base_class = RandomForestEstimator
class ExtraTrees_TS(TS_SKLearn):
"""The class for tuning Extra Trees Regressor for time-series forecasting"""
base_class = ExtraTreesEstimator
class XGBoostLimitDepth_TS(TS_SKLearn):
"""The class for tuning XGBoost Regressor with unlimited depth for time-series forecasting"""
base_class = XGBoostLimitDepthEstimator
# catboost regressor is invalid because it has a `name` parameter, making it incompatible with hcrystalball
class CatBoost_TS(TS_SKLearn):
base_class = CatBoostEstimator

View File

@ -0,0 +1,179 @@
"""!
* Copyright (c) Microsoft Corporation. All rights reserved.
* Licensed under the MIT License.
"""
import json
from typing import IO
from contextlib import contextmanager
import logging
logger = logging.getLogger("flaml.automl")
class TrainingLogRecord(object):
def __init__(
self,
record_id: int,
iter_per_learner: int,
logged_metric: float,
trial_time: float,
wall_clock_time: float,
validation_loss: float,
config: dict,
learner: str,
sample_size: int,
):
self.record_id = record_id
self.iter_per_learner = iter_per_learner
self.logged_metric = logged_metric
self.trial_time = trial_time
self.wall_clock_time = wall_clock_time
self.validation_loss = float(validation_loss)
self.config = config
self.learner = learner
self.sample_size = sample_size
def dump(self, fp: IO[str]):
d = vars(self)
return json.dump(d, fp)
@classmethod
def load(cls, json_str: str):
d = json.loads(json_str)
return cls(**d)
def __str__(self):
return json.dumps(vars(self))
class TrainingLogCheckPoint(TrainingLogRecord):
def __init__(self, curr_best_record_id: int):
self.curr_best_record_id = curr_best_record_id
class TrainingLogWriter(object):
def __init__(self, output_filename: str):
self.output_filename = output_filename
self.file = None
self.current_best_loss_record_id = None
self.current_best_loss = float("+inf")
self.current_sample_size = None
self.current_record_id = 0
def open(self):
self.file = open(self.output_filename, "w")
def append_open(self):
self.file = open(self.output_filename, "a")
def append(
self,
it_counter: int,
train_loss: float,
trial_time: float,
wall_clock_time: float,
validation_loss,
config,
learner,
sample_size,
):
if self.file is None:
raise IOError("Call open() to open the output file first.")
if validation_loss is None:
raise ValueError("TEST LOSS NONE ERROR!!!")
record = TrainingLogRecord(
self.current_record_id,
it_counter,
train_loss,
trial_time,
wall_clock_time,
validation_loss,
config,
learner,
sample_size,
)
if (
validation_loss < self.current_best_loss
or validation_loss == self.current_best_loss
and self.current_sample_size is not None
and sample_size > self.current_sample_size
):
self.current_best_loss = validation_loss
self.current_sample_size = sample_size
self.current_best_loss_record_id = self.current_record_id
self.current_record_id += 1
record.dump(self.file)
self.file.write("\n")
self.file.flush()
def checkpoint(self):
if self.file is None:
raise IOError("Call open() to open the output file first.")
if self.current_best_loss_record_id is None:
logger.warning("flaml.training_log: checkpoint() called before any record is written, skipped.")
return
record = TrainingLogCheckPoint(self.current_best_loss_record_id)
record.dump(self.file)
self.file.write("\n")
self.file.flush()
def close(self):
if self.file is not None:
self.file.close()
self.file = None # for pickle
class TrainingLogReader(object):
def __init__(self, filename: str):
self.filename = filename
self.file = None
def open(self):
self.file = open(self.filename)
def records(self):
if self.file is None:
raise IOError("Call open() before reading log file.")
for line in self.file:
data = json.loads(line)
if len(data) == 1:
# Skip checkpoints.
continue
yield TrainingLogRecord(**data)
def close(self):
if self.file is not None:
self.file.close()
self.file = None # for pickle
def get_record(self, record_id) -> TrainingLogRecord:
if self.file is None:
raise IOError("Call open() before reading log file.")
for rec in self.records():
if rec.record_id == record_id:
return rec
raise ValueError(f"Cannot find record with id {record_id}.")
@contextmanager
def training_log_writer(filename: str, append: bool = False):
try:
w = TrainingLogWriter(filename)
if not append:
w.open()
else:
w.append_open()
yield w
finally:
w.close()
@contextmanager
def training_log_reader(filename: str):
try:
r = TrainingLogReader(filename)
r.open()
yield r
finally:
r.close()

15
flaml/config.py Normal file
View File

@ -0,0 +1,15 @@
"""!
* Copyright (c) Microsoft Corporation. All rights reserved.
* Licensed under the MIT License.
"""
N_SPLITS = 5
RANDOM_SEED = 1
SPLIT_RATIO = 0.1
MEM_THRES = 4 * (1024**3)
SMALL_LARGE_THRES = 10000000
MIN_SAMPLE_TRAIN = 10000
CV_HOLDOUT_THRESHOLD = 100000
SAMPLE_MULTIPLY_FACTOR = 4
SEARCH_THREAD_EPS = 1.0
PENALTY = 1e10 # penalty term for constraints

9
flaml/data.py Normal file
View File

@ -0,0 +1,9 @@
import warnings
from flaml.automl.data import *
warnings.warn(
"Importing from `flaml.data` is deprecated. Please use `flaml.automl.data`.",
DeprecationWarning,
)

184
flaml/default/README.md Normal file
View File

@ -0,0 +1,184 @@
# FLAML-Zero: Zero-shot AutoML
## Zero-shot AutoML
There are several ways to use zero-shot AutoML, i.e., train a model with the data-dependent default configuration.
0. Use estimators in `flaml.default.estimator`.
```python
from flaml.default import LGBMRegressor
estimator = LGBMRegressor()
estimator.fit(X_train, y_train)
estimator.predict(X_test, y_test)
```
1. Use AutoML.fit(). set `starting_points="data"` and `max_iter=0`.
```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl = AutoML()
automl_settings = {
"time_budget": 2,
"task": "classification",
"log_file_name": "test/iris.log",
"starting_points": "data",
"max_iter": 0,
}
automl.fit(X_train, y_train, **automl_settings)
```
2. Use `flaml.default.preprocess_and_suggest_hyperparams`.
```python
from flaml.default import preprocess_and_suggest_hyperparams
X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
hyperparams, estimator_class, X_transformed, y_transformed, feature_transformer, label_transformer = preprocess_and_suggest_hyperparams(
"classification", X_train, y_train, "lgbm"
)
model = estimator_class(**hyperparams) # estimator_class is LGBMClassifier
model.fit(X_transformed, y_train) # LGBMClassifier can handle raw labels
X_test = feature_transformer.transform(X_test) # preprocess test data
y_pred = model.predict(X_test)
```
If you want to use your own meta-learned defaults, specify the path containing the meta-learned defaults. For example,
```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl = AutoML()
automl_settings = {
"time_budget": 2,
"task": "classification",
"log_file_name": "test/iris.log",
"starting_points": "data:test/default",
"estimator_list": ["lgbm", "xgb_limitdepth", "rf"]
"max_iter": 0,
}
automl.fit(X_train, y_train, **automl_settings)
```
Since this is a multiclass task, it will look for the following files under `test/default/`:
- `all/multiclass.json`.
- `{learner_name}/multiclass.json` for every learner_name in the estimator_list.
Read the next subsection to understand how to generate these files if you would like to meta-learn the defaults yourself.
To perform hyperparameter search starting with the data-dependent defaults, remove `max_iter=0`.
## Perform Meta Learning
FLAML provides a package `flaml.default` to learn defaults customized for your own tasks/learners/metrics.
### Prepare a collection of training tasks
Collect a diverse set of training tasks. For each task, extract its meta feature and save in a .csv file. For example, test/default/all/metafeatures.csv:
```
Dataset,NumberOfInstances,NumberOfFeatures,NumberOfClasses,PercentageOfNumericFeatures
2dplanes,36691,10,0,1.0
adult,43957,14,2,0.42857142857142855
Airlines,485444,7,2,0.42857142857142855
Albert,382716,78,2,0.3333333333333333
Amazon_employee_access,29492,9,2,0.0
bng_breastTumor,104976,9,0,0.1111111111111111
bng_pbc,900000,18,0,0.5555555555555556
car,1555,6,4,0.0
connect-4,60801,42,3,0.0
dilbert,9000,2000,5,1.0
Dionis,374569,60,355,1.0
poker,922509,10,0,1.0
```
The first column is the dataset name, and the latter four are meta features.
### Prepare the candidate configurations
You can extract the best configurations for each task in your collection of training tasks by running flaml on each of them with a long enough budget. Save the best configuration in a .json file under `{location_for_defaults}/{learner_name}/{task_name}.json`. For example,
```python
X_train, y_train = load_iris(return_X_y=True, as_frame=as_frame)
automl.fit(X_train, y_train, estimator_list=["lgbm"], **settings)
automl.save_best_config("test/default/lgbm/iris.json")
```
### Evaluate each candidate configuration on each task
Save the evaluation results in a .csv file. For example, save the evaluation results for lgbm under `test/default/lgbm/results.csv`:
```
task,fold,type,result,params
2dplanes,0,regression,0.946366,{'_modeljson': 'lgbm/2dplanes.json'}
2dplanes,0,regression,0.907774,{'_modeljson': 'lgbm/adult.json'}
2dplanes,0,regression,0.901643,{'_modeljson': 'lgbm/Airlines.json'}
2dplanes,0,regression,0.915098,{'_modeljson': 'lgbm/Albert.json'}
2dplanes,0,regression,0.302328,{'_modeljson': 'lgbm/Amazon_employee_access.json'}
2dplanes,0,regression,0.94523,{'_modeljson': 'lgbm/bng_breastTumor.json'}
2dplanes,0,regression,0.945698,{'_modeljson': 'lgbm/bng_pbc.json'}
2dplanes,0,regression,0.946194,{'_modeljson': 'lgbm/car.json'}
2dplanes,0,regression,0.945549,{'_modeljson': 'lgbm/connect-4.json'}
2dplanes,0,regression,0.946232,{'_modeljson': 'lgbm/default.json'}
2dplanes,0,regression,0.945594,{'_modeljson': 'lgbm/dilbert.json'}
2dplanes,0,regression,0.836996,{'_modeljson': 'lgbm/Dionis.json'}
2dplanes,0,regression,0.917152,{'_modeljson': 'lgbm/poker.json'}
adult,0,binary,0.927203,{'_modeljson': 'lgbm/2dplanes.json'}
adult,0,binary,0.932072,{'_modeljson': 'lgbm/adult.json'}
adult,0,binary,0.926563,{'_modeljson': 'lgbm/Airlines.json'}
adult,0,binary,0.928604,{'_modeljson': 'lgbm/Albert.json'}
adult,0,binary,0.911171,{'_modeljson': 'lgbm/Amazon_employee_access.json'}
adult,0,binary,0.930645,{'_modeljson': 'lgbm/bng_breastTumor.json'}
adult,0,binary,0.928603,{'_modeljson': 'lgbm/bng_pbc.json'}
adult,0,binary,0.915825,{'_modeljson': 'lgbm/car.json'}
adult,0,binary,0.919499,{'_modeljson': 'lgbm/connect-4.json'}
adult,0,binary,0.930109,{'_modeljson': 'lgbm/default.json'}
adult,0,binary,0.932453,{'_modeljson': 'lgbm/dilbert.json'}
adult,0,binary,0.921959,{'_modeljson': 'lgbm/Dionis.json'}
adult,0,binary,0.910763,{'_modeljson': 'lgbm/poker.json'}
...
```
The `type` column indicates the type of the task, such as regression, binary or multiclass.
The `result` column stores the evaluation result, assuming the large the better. The `params` column indicates which json config is used. For example 'lgbm/2dplanes.json' indicates that the best lgbm configuration extracted from 2dplanes is used.
### Learn data-dependent defaults
To recap, the inputs required for meta-learning are:
1. Metafeatures: e.g., `{location}/all/metafeatures.csv`.
1. Configurations: `{location}/{learner_name}/{task_name}.json`.
1. Evaluation results: `{location}/{learner_name}/results.csv`.
For example, if the input location is "test/default", learners are lgbm, xgb_limitdepth and rf, the following command learns data-dependent defaults for binary classification tasks.
```bash
python portfolio.py --output test/default --input test/default --metafeatures test/default/all/metafeatures.csv --task binary --estimator lgbm xgb_limitdepth rf
```
It will produce the following files as output:
- test/default/lgbm/binary.json: the learned defaults for lgbm.
- test/default/xgb_limitdepth/binary.json: the learned defaults for xgb_limitdepth.
- test/default/rf/binary.json: the learned defaults for rf.
- test/default/all/binary.json: the learned defaults for lgbm, xgb_limitdepth and rf together.
Change "binary" into "multiclass" or "regression" for the other tasks.
## Reference
For more technical details, please check our research paper.
* [Mining Robust Default Configurations for Resource-constrained AutoML](https://arxiv.org/abs/2202.09927). Moe Kayali, Chi Wang. arXiv preprint arXiv:2202.09927 (2022).
```bibtex
@article{Kayali2022default,
title={Mining Robust Default Configurations for Resource-constrained AutoML},
author={Moe Kayali and Chi Wang},
year={2022},
journal={arXiv preprint arXiv:2202.09927},
}
```

18
flaml/default/__init__.py Normal file
View File

@ -0,0 +1,18 @@
from .suggest import (
suggest_config,
suggest_learner,
suggest_hyperparams,
preprocess_and_suggest_hyperparams,
meta_feature,
)
from .estimator import (
flamlize_estimator,
LGBMClassifier,
LGBMRegressor,
XGBClassifier,
XGBRegressor,
RandomForestClassifier,
RandomForestRegressor,
ExtraTreesClassifier,
ExtraTreesRegressor,
)

View File

@ -0,0 +1,946 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 2541,
"num_leaves": 1667,
"min_child_samples": 29,
"learning_rate": 0.0016660662914022302,
"log_max_bin": 8,
"colsample_bytree": 0.5157078343718623,
"reg_alpha": 0.045792841240713165,
"reg_lambda": 0.0012362651138125363,
"FLAML_sample_size": 436899
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 141,
"num_leaves": 139,
"min_child_samples": 8,
"learning_rate": 0.04824748268727149,
"log_max_bin": 9,
"colsample_bytree": 0.5261441571042451,
"reg_alpha": 0.002896920833899335,
"reg_lambda": 0.024463247502165594
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 31204,
"num_leaves": 4,
"min_child_samples": 3,
"learning_rate": 0.009033979476164342,
"log_max_bin": 10,
"colsample_bytree": 0.5393339924944204,
"reg_alpha": 15.800090067239827,
"reg_lambda": 34.82471227276953
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 362,
"num_leaves": 1208,
"min_child_samples": 8,
"learning_rate": 0.02070742242160566,
"log_max_bin": 4,
"colsample_bytree": 0.37915528071680865,
"reg_alpha": 0.002982599447751338,
"reg_lambda": 1.136605174453919,
"FLAML_sample_size": 337147
}
},
{
"class": "lgbm",
"hyperparameters": {}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 319,
"max_leaves": 1312,
"min_child_weight": 0.001,
"learning_rate": 0.01872379806270421,
"subsample": 0.6890079660561895,
"colsample_bylevel": 0.7551225121854014,
"colsample_bytree": 0.7860755604500558,
"reg_alpha": 0.17028752704343114,
"reg_lambda": 1.4375743264564231
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 7902,
"max_leaves": 49,
"min_child_weight": 0.038063497848955595,
"learning_rate": 0.0009765625,
"subsample": 0.9357800695141445,
"colsample_bylevel": 0.47031312177249246,
"colsample_bytree": 0.9053386579586192,
"reg_alpha": 1.5286102593845932,
"reg_lambda": 18.96811296717419
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 13499,
"max_leaves": 60,
"min_child_weight": 0.008494221584011285,
"learning_rate": 0.006955765856675575,
"subsample": 0.5965241023754743,
"colsample_bylevel": 0.590641168068946,
"colsample_bytree": 1.0,
"reg_alpha": 0.2522240954379289,
"reg_lambda": 5.351809144038808
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 591,
"max_leaves": 16651,
"min_child_weight": 0.03356567864689129,
"learning_rate": 0.002595066436678338,
"subsample": 0.9114132805513452,
"colsample_bylevel": 0.9503441844594458,
"colsample_bytree": 0.5703338448066768,
"reg_alpha": 0.010405212349127894,
"reg_lambda": 0.05352660657433639
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 877,
"max_depth": 11,
"min_child_weight": 0.6205465771093738,
"learning_rate": 0.013622118381700795,
"subsample": 0.566692814245426,
"colsample_bylevel": 0.8865741642101924,
"colsample_bytree": 1.0,
"reg_alpha": 0.01386336444764391,
"reg_lambda": 3.113947886074155
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 5457,
"max_depth": 6,
"min_child_weight": 0.19978269031877885,
"learning_rate": 0.003906732665632749,
"subsample": 0.8207785234496902,
"colsample_bylevel": 0.8438751931476698,
"colsample_bytree": 0.42202862997585794,
"reg_alpha": 0.017372558844968737,
"reg_lambda": 0.03977802121721031
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 3526,
"max_depth": 13,
"min_child_weight": 0.0994486725676356,
"learning_rate": 0.0009765625,
"subsample": 0.46123759274652554,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.4498813776397717,
"reg_alpha": 0.002599398546499414,
"reg_lambda": 0.028336396854402753
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 501,
"max_features": 0.24484242524861066,
"max_leaves": 1156,
"criterion": "entropy"
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 356,
"max_features": 0.1,
"max_leaves": 102,
"criterion": "gini"
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 1000,
"max_features": 0.1779692423238241,
"max_leaves": 7499,
"criterion": "gini"
}
},
{
"class": "rf",
"hyperparameters": {}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 1080,
"max_features": 1.0,
"max_leaves": 590,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 0.46132798093546956,
"max_leaves": 12856,
"criterion": "gini"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 408,
"max_features": 0.3629795757973625,
"max_leaves": 81,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 553,
"max_features": 0.9592132391435095,
"max_leaves": 1127,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
18000.0,
28.0,
2.0,
0.7565217391304347
],
"scale": [
42124.0,
130.0,
1.0,
0.5714285714285715
]
},
"neighbors": [
{
"features": [
1.196467571930491,
1.0923076923076922,
0.0,
0.4260869565217391
],
"choice": [
5,
18,
19,
4,
8,
3,
9,
7,
10,
6,
21,
2,
20,
17,
13,
16,
15,
1,
14,
12,
0,
11
]
},
{
"features": [
11.096856898680088,
-0.16153846153846155,
0.0,
-0.5739130434782609
],
"choice": [
0,
5,
7,
9,
11,
8,
1,
18,
15,
12,
3,
2,
10,
20,
4,
6,
13,
17,
14,
19,
16,
21
]
},
{
"features": [
8.658152122305575,
0.38461538461538464,
0.0,
-0.7405797101449274
],
"choice": [
7,
9,
2,
5,
10,
1,
0,
3,
12,
4,
6,
11,
8,
18,
15,
13,
20,
16,
17,
21,
14,
19
]
},
{
"features": [
0.27281359794891274,
-0.14615384615384616,
0.0,
-1.3239130434782607
],
"choice": [
8,
11,
0,
5,
1,
15,
13,
16,
10,
9,
20,
7,
17,
12,
4,
3,
21,
18,
6,
14,
19,
2
]
},
{
"features": [
-0.4125676573924604,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
19,
15,
11,
17,
8,
14,
13,
16,
3,
18,
21,
6,
9,
10,
20,
5,
7,
1,
0,
12,
2,
4
]
},
{
"features": [
0.6409647706770487,
1.5538461538461539,
0.0,
0.0
],
"choice": [
2,
14,
10,
19,
6,
0,
1,
4,
11,
3,
5,
17,
9,
13,
12,
20,
7,
15,
18,
8,
16,
21
]
},
{
"features": [
2.3515573069983855,
0.16923076923076924,
0.0,
0.4260869565217391
],
"choice": [
7,
9,
10,
5,
2,
0,
3,
1,
12,
4,
6,
11,
18,
8,
15,
13,
16,
21,
20,
17,
14,
19
]
},
{
"features": [
0.6162045389801538,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
10,
12,
1,
4,
11,
6,
9,
0,
2,
5,
3,
7,
8,
13,
20,
17,
15,
14,
16,
19,
18,
21
]
},
{
"features": [
0.5386240622922799,
-0.09230769230769231,
0.0,
-0.5582880434782608
],
"choice": [
1,
0,
5,
11,
10,
9,
6,
4,
3,
20,
17,
18,
13,
15,
16,
8,
7,
2,
12,
21,
19,
14
]
},
{
"features": [
-0.41133320672300827,
-0.18461538461538463,
0.0,
0.4260869565217391
],
"choice": [
14,
9,
7,
10,
15,
13,
3,
6,
16,
5,
19,
2,
12,
18,
4,
21,
20,
0,
11,
17,
1,
8
]
},
{
"features": [
-0.31155635742094767,
12.36923076923077,
0.0,
0.3865087169129372
],
"choice": [
7,
2,
6,
10,
3,
0,
9,
20,
5,
1,
18,
11,
8,
17,
4,
13,
15,
12,
14,
16,
19,
21
]
},
{
"features": [
-0.40594435476213087,
-0.06153846153846154,
0.0,
-0.7114130434782607
],
"choice": [
9,
5,
6,
1,
0,
13,
15,
7,
19,
4,
16,
3,
10,
12,
11,
18,
14,
8,
17,
20,
21,
2
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
20,
17,
0,
1,
18,
3,
13,
9,
10,
5,
11,
15,
2,
4,
12,
16,
14,
19,
21
]
},
{
"features": [
1.6675766783781218,
0.0,
0.0,
0.4260869565217391
],
"choice": [
7,
9,
5,
0,
1,
10,
6,
11,
4,
2,
12,
3,
8,
15,
13,
18,
16,
20,
17,
21,
14,
19
]
},
{
"features": [
-0.36356946158959264,
0.8923076923076924,
0.0,
-1.2266908212560386
],
"choice": [
8,
15,
3,
13,
16,
11,
4,
0,
20,
6,
14,
5,
1,
21,
17,
9,
10,
18,
19,
7,
12,
2
]
},
{
"features": [
-0.38225239768303104,
-0.05384615384615385,
0.0,
0.4260869565217391
],
"choice": [
16,
13,
15,
18,
17,
14,
20,
8,
10,
9,
3,
7,
19,
21,
11,
1,
5,
0,
6,
4,
2,
12
]
},
{
"features": [
-0.3590352293229513,
0.06153846153846154,
0.0,
-1.3239130434782607
],
"choice": [
7,
9,
10,
4,
5,
17,
19,
20,
12,
18,
6,
13,
16,
0,
1,
3,
15,
21,
14,
11,
8,
2
]
},
{
"features": [
0.3090399772101415,
0.6923076923076923,
0.0,
-0.003997789240972687
],
"choice": [
7,
9,
10,
1,
12,
5,
3,
4,
0,
11,
20,
8,
17,
13,
6,
15,
16,
21,
18,
2,
14,
19
]
},
{
"features": [
-0.3118649700883107,
-0.17692307692307693,
0.0,
0.4260869565217391
],
"choice": [
20,
18,
21,
17,
7,
9,
15,
13,
1,
16,
4,
12,
5,
0,
10,
14,
6,
11,
8,
3,
2,
19
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
9,
10,
0,
5,
1,
12,
3,
4,
2,
21,
11,
16,
18,
20,
15,
8,
17,
13,
14,
19
]
},
{
"features": [
-0.3178473079479632,
-0.06153846153846154,
0.0,
0.4260869565217391
],
"choice": [
18,
17,
20,
1,
5,
21,
0,
8,
4,
3,
10,
12,
9,
13,
11,
6,
16,
15,
7,
19,
14,
2
]
}
],
"configsource": [
"lgbm/Airlines",
"lgbm/riccardo",
"lgbm/fried",
"lgbm/Dionis",
"lgbm/default",
"xgboost/fabert",
"xgboost/bng_lowbwt",
"xgboost/pol",
"xgboost/Amazon_employee_access",
"xgb_limitdepth/Jannis",
"xgb_limitdepth/adult",
"xgb_limitdepth/Amazon_employee_access",
"xgb_limitdepth/default",
"rf/Amazon_employee_access",
"rf/kc1",
"rf/Helena",
"rf/default",
"extra_tree/segment",
"extra_tree/Helena",
"extra_tree/kr-vs-kp",
"extra_tree/bank-marketing",
"extra_tree/default"
]
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,885 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 4797,
"num_leaves": 122,
"min_child_samples": 2,
"learning_rate": 0.022635758411078528,
"log_max_bin": 9,
"colsample_bytree": 0.7019911744574896,
"reg_alpha": 0.004252223402511765,
"reg_lambda": 0.11288241427227624
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 1009,
"num_leaves": 42,
"min_child_samples": 12,
"learning_rate": 0.02167229637171611,
"log_max_bin": 7,
"colsample_bytree": 0.7385038460573171,
"reg_alpha": 0.003607184551842614,
"reg_lambda": 12.08340803550741
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 32767,
"num_leaves": 372,
"min_child_samples": 4,
"learning_rate": 0.03517259015200922,
"log_max_bin": 5,
"colsample_bytree": 1.0,
"reg_alpha": 0.02271142170225636,
"reg_lambda": 0.001963791798843179,
"FLAML_sample_size": 830258
}
},
{
"class": "lgbm",
"hyperparameters": {}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 6357,
"max_leaves": 206,
"min_child_weight": 1.9495322566288034,
"learning_rate": 0.0068766724195393905,
"subsample": 0.9451618245005704,
"colsample_bylevel": 0.9030482524943064,
"colsample_bytree": 0.9278972006416252,
"reg_alpha": 0.01857648400903689,
"reg_lambda": 6.021166480604588,
"FLAML_sample_size": 344444
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 23045,
"max_leaves": 247,
"min_child_weight": 0.004319397499079841,
"learning_rate": 0.0032914413473281215,
"subsample": 0.7334190564433234,
"colsample_bylevel": 1.0,
"colsample_bytree": 1.0,
"reg_alpha": 0.03514226467919635,
"reg_lambda": 1.2679661021665851
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 1899,
"max_leaves": 59,
"min_child_weight": 0.013389019900720164,
"learning_rate": 0.0028943401472847964,
"subsample": 0.7808944208233943,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.9999355357362375,
"reg_alpha": 0.7905117773932884,
"reg_lambda": 2.916897119216104
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 5611,
"max_leaves": 61,
"min_child_weight": 0.01070518287797225,
"learning_rate": 0.005485127037677848,
"subsample": 0.4713518256961299,
"colsample_bylevel": 0.9777437906530106,
"colsample_bytree": 0.9519335125615331,
"reg_alpha": 0.03621564207188963,
"reg_lambda": 1.8045765669466283
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 4923,
"max_depth": 12,
"min_child_weight": 0.7625732991776795,
"learning_rate": 0.009239549681857523,
"subsample": 0.8193164619615052,
"colsample_bylevel": 0.7785754297307862,
"colsample_bytree": 0.788491073979525,
"reg_alpha": 0.002282749364196872,
"reg_lambda": 131.2194560716441
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 2111,
"max_depth": 9,
"min_child_weight": 3.405822241186395,
"learning_rate": 0.005804247705198151,
"subsample": 0.37848422782052427,
"colsample_bylevel": 0.8228350674288559,
"colsample_bytree": 0.8813475713109656,
"reg_alpha": 0.009761356063132219,
"reg_lambda": 13.187783936727843,
"FLAML_sample_size": 810000
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 1499,
"max_depth": 11,
"min_child_weight": 0.07563529776156448,
"learning_rate": 0.039042609221240955,
"subsample": 0.7832981935783824,
"colsample_bylevel": 1.0,
"colsample_bytree": 1.0,
"reg_alpha": 0.0009765625,
"reg_lambda": 23.513066752844153
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 19722,
"max_depth": 11,
"min_child_weight": 6.46800727978204,
"learning_rate": 0.0010837437950202355,
"subsample": 0.49509562408032115,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.8826299329274134,
"reg_alpha": 0.23887161121959208,
"reg_lambda": 15.163773888208217
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 544,
"max_depth": 12,
"min_child_weight": 79.32555867011995,
"learning_rate": 0.010128107120014433,
"subsample": 0.9799974977817297,
"colsample_bylevel": 0.881815418056542,
"colsample_bytree": 0.9718556912196423,
"reg_alpha": 72.63148950428749,
"reg_lambda": 1.4601415712058006
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 960,
"max_features": 0.694616932858775,
"max_leaves": 8937
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 1.0,
"max_leaves": 32767,
"FLAML_sample_size": 830258
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 0.6683903035731483,
"max_leaves": 591,
"criterion": "entropy"
}
},
{
"class": "rf",
"hyperparameters": {}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 1233,
"max_features": 1.0,
"max_leaves": 6452
}
},
{
"class": "extra_tree",
"hyperparameters": {}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 346,
"max_features": 1.0,
"max_leaves": 1007,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 0.5106397565689275,
"max_leaves": 32767,
"FLAML_sample_size": 319382
}
}
],
"preprocessing": {
"center": [
36691.0,
10.0,
0.0,
0.85
],
"scale": [
463680.0,
8.5,
1.0,
0.48611111111111116
]
},
"neighbors": [
{
"features": [
0.0,
0.0,
0.0,
0.3085714285714286
],
"choice": [
3,
6,
12,
1,
16,
20,
7,
13,
9,
8,
4,
11,
0,
14,
18,
15,
5,
17,
10,
21,
2,
19
]
},
{
"features": [
0.6972675120772946,
10.588235294117647,
0.0,
0.3085714285714286
],
"choice": [
19,
18,
21,
20
]
},
{
"features": [
-0.05244133885438233,
3.5294117647058822,
0.0,
0.3085714285714286
],
"choice": [
1,
0,
3,
14,
17,
15,
16,
10,
8,
18,
2,
19,
20,
4,
21,
13,
9,
5,
7,
11,
6,
12
]
},
{
"features": [
1.8618637853692201,
-0.11764705882352941,
0.0,
-0.3771428571428571
],
"choice": [
12,
7,
4,
9,
13,
8,
1,
6,
3,
5,
16,
10,
0,
18,
14,
20,
15,
17,
19,
2,
21
]
},
{
"features": [
0.1472675120772947,
-0.11764705882352941,
0.0,
-1.52
],
"choice": [
1,
12,
9,
3,
7,
6,
11,
13,
16,
20,
8,
4,
18,
0,
10,
14,
21,
5,
15,
17,
2,
19
]
},
{
"features": [
-0.045171238785369223,
-0.11764705882352941,
0.0,
-0.3771428571428571
],
"choice": [
12,
6,
1,
3,
16,
9,
20,
15,
14,
11,
7,
21,
18,
17,
4,
8,
19,
5,
13,
0,
10,
2
]
},
{
"features": [
1.8618637853692201,
9.411764705882353,
0.0,
0.3085714285714286
],
"choice": [
19,
18,
21,
20
]
},
{
"features": [
-0.018758626639061422,
-0.11764705882352941,
0.0,
-1.2914285714285714
],
"choice": [
6,
3,
12,
9,
1,
16,
20,
13,
7,
11,
8,
18,
4,
14,
10,
15,
0,
17,
21,
5,
19,
2
]
},
{
"features": [
1.8618637853692201,
0.9411764705882353,
0.0,
-0.6057142857142855
],
"choice": [
0,
5,
4,
8,
10,
12,
7,
9,
1,
2,
13,
3,
6,
14,
19,
17,
21,
18,
16,
20
]
},
{
"features": [
1.8618637853692201,
0.0,
0.0,
-1.5428571428571427
],
"choice": [
9,
7,
1,
4,
6,
3,
12,
13,
0,
8,
10,
5,
14,
16,
20,
18,
21,
15,
2,
17,
19
]
},
{
"features": [
0.2647105762594893,
0.0,
0.0,
0.3085714285714286
],
"choice": [
12,
6,
1,
3,
13,
7,
16,
9,
20,
0,
8,
4,
11,
14,
18,
5,
10,
15,
17,
21,
2,
19
]
},
{
"features": [
-0.058378623188405795,
0.23529411764705882,
0.0,
-0.3771428571428571
],
"choice": [
0,
3,
1,
2
]
},
{
"features": [
0.0,
0.0,
0.0,
0.3085714285714286
],
"choice": [
7,
9,
1,
11,
8,
0,
4,
5,
6,
3,
10,
2,
13,
12,
19,
18,
21,
15,
14,
17,
20,
16
]
},
{
"features": [
-0.03490769496204279,
0.7058823529411765,
0.0,
0.3085714285714286
],
"choice": [
7,
11,
5,
4,
9,
1,
8,
3,
6,
0,
10,
2,
17,
12,
15,
14,
16,
13,
19,
18,
21,
20
]
},
{
"features": [
-0.03490769496204279,
-0.23529411764705882,
0.0,
0.3085714285714286
],
"choice": [
6,
4,
8,
5,
7,
9,
11,
10,
3,
1,
18,
12,
21,
19,
0,
14,
16,
20,
15,
13,
17,
2
]
},
{
"features": [
-0.03906789164941339,
-0.23529411764705882,
0.0,
0.3085714285714286
],
"choice": [
0,
4,
7,
5,
11,
1,
8,
10,
9,
6,
12,
3,
13,
14,
15,
17,
16,
2,
21,
18,
19,
20
]
},
{
"features": [
0.0,
0.0,
0.0,
-0.3085714285714286
],
"choice": [
18,
19,
20,
10,
15,
17,
5,
11,
14,
4,
7,
9,
21,
8,
3,
6,
13,
1,
16,
12,
0,
2
]
},
{
"features": [
1.050207039337474,
0.9411764705882353,
0.0,
-0.7199999999999999
],
"choice": [
17,
15,
14,
16
]
},
{
"features": [
0.686201690821256,
-0.11764705882352941,
0.0,
-1.0628571428571427
],
"choice": [
15,
17,
14,
19,
16,
18,
21,
20
]
},
{
"features": [
1.9104080400276053,
0.0,
0.0,
0.3085714285714286
],
"choice": [
10,
2,
5,
8,
0,
4,
19,
7,
9,
13,
17,
15,
18,
21,
1,
14,
12,
20,
6,
3,
16
]
},
{
"features": [
-0.050015096618357485,
4.470588235294118,
0.0,
0.3085714285714286
],
"choice": [
8,
10,
4,
7,
5,
11,
18,
6,
20,
19,
9,
14,
16,
21,
0,
3,
15,
17,
1,
2,
13,
12
]
},
{
"features": [
-0.04660973084886128,
-0.8235294117647058,
0.0,
-1.0628571428571427
],
"choice": [
11,
13,
10,
8,
9,
20,
12,
18,
19,
21
]
}
],
"configsource": [
"lgbm/houses",
"lgbm/house_8L",
"lgbm/poker",
"lgbm/default",
"xgboost/Albert",
"xgboost/mv",
"xgboost/bng_echomonths",
"xgboost/house_16H",
"xgb_limitdepth/higgs",
"xgb_limitdepth/bng_pharynx",
"xgb_limitdepth/connect-4",
"xgb_limitdepth/house_16H",
"xgb_limitdepth/bng_echomonths",
"xgb_limitdepth/default",
"rf/houses",
"rf/poker",
"rf/bank-marketing",
"rf/default",
"extra_tree/house_16H",
"extra_tree/default",
"extra_tree/dilbert",
"extra_tree/particulate-matter"
]
}

184
flaml/default/estimator.py Normal file
View File

@ -0,0 +1,184 @@
from functools import wraps
from flaml.automl.task.task import CLASSIFICATION
from .suggest import preprocess_and_suggest_hyperparams
DEFAULT_LOCATION = "default_location"
def flamlize_estimator(super_class, name: str, task: str, alternatives=None):
"""Enhance an estimator class with flaml's data-dependent default hyperparameter settings.
Example:
```python
import sklearn.ensemble as ensemble
RandomForestRegressor = flamlize_estimator(
ensemble.RandomForestRegressor, "rf", "regression"
)
```
Args:
super_class: an scikit-learn compatible estimator class.
name: a str of the estimator's name.
task: a str of the task type.
alternatives: (Optional) a list for alternative estimator names. For example,
```[("max_depth", 0, "xgboost")]``` means if the "max_depth" is set to 0
in the constructor, then look for the learned defaults for estimator "xgboost".
"""
class EstimatorClass(super_class):
"""**Enhanced with flaml's data-dependent default hyperparameter settings.**"""
@wraps(super_class.__init__)
def __init__(self, **params):
if DEFAULT_LOCATION in params:
self._default_location = params.pop(DEFAULT_LOCATION)
else:
self._default_location = None
self._params = params
super().__init__(**params)
# @classmethod
# @wraps(super_class._get_param_names)
# def _get_param_names(cls):
# return super_class._get_param_names() if hasattr(super_class, "_get_param_names") else []
def suggest_hyperparams(self, X, y):
"""Suggest hyperparameters.
Example:
```python
from flaml.default import LGBMRegressor
estimator = LGBMRegressor()
hyperparams, estimator_name, X_transformed, y_transformed = estimator.fit(X_train, y_train)
print(hyperparams)
```
Args:
X: A dataframe of training data in shape n*m.
y: A series of labels in shape n*1.
Returns:
hyperparams: A dict of the hyperparameter configurations.
estimator_name: A str of the underlying estimator name, e.g., 'xgb_limitdepth'.
X_transformed: the preprocessed X.
y_transformed: the preprocessed y.
"""
estimator_name = name
if alternatives:
for alternative in alternatives:
if self._params.get(alternative[0]) == alternative[1]:
estimator_name = alternative[2]
break
estimator_name = (
"choose_xgb"
if (estimator_name == "xgb_limitdepth" and "max_depth" not in self._params)
else estimator_name
)
(
hyperparams,
estimator_class,
X_transformed,
y_transformed,
self._feature_transformer,
self._label_transformer,
) = preprocess_and_suggest_hyperparams(task, X, y, estimator_name, self._default_location)
assert estimator_class == super_class
hyperparams.update(self._params)
return hyperparams, estimator_name, X_transformed, y_transformed
@wraps(super_class.fit)
def fit(self, X, y, *args, **params):
hyperparams, estimator_name, X, y_transformed = self.suggest_hyperparams(X, y)
self.set_params(**hyperparams)
if self._label_transformer and estimator_name in [
"rf",
"extra_tree",
"xgboost",
"xgb_limitdepth",
"choose_xgb",
]:
# rf and et have trouble in handling boolean labels; xgboost requires integer labels
fitted = super().fit(X, y_transformed, *args, **params)
# if hasattr(self, "_classes"):
# self._classes = self._label_transformer.classes_
# else:
self.classes_ = self._label_transformer.classes_
if "xgb" not in estimator_name:
# rf and et would do inverse transform automatically; xgb doesn't
self._label_transformer = None
else:
# lgbm doesn't need label transformation except for non-str/num labels
try:
fitted = super().fit(X, y, *args, **params)
self._label_transformer = None
except ValueError:
# Unknown label type: 'unknown'
fitted = super().fit(X, y_transformed, *args, **params)
self._classes = self._label_transformer.classes_
return fitted
@wraps(super_class.predict)
def predict(self, X, *args, **params):
if name != "lgbm" or task not in CLASSIFICATION:
X = self._feature_transformer.transform(X)
y_pred = super().predict(X, *args, **params)
if self._label_transformer and y_pred.ndim == 1:
y_pred = self._label_transformer.inverse_transform(y_pred)
return y_pred
if hasattr(super_class, "predict_proba"):
@wraps(super_class.predict_proba)
def predict_proba(self, X, *args, **params):
X_test = self._feature_transformer.transform(X)
y_pred = super().predict_proba(X_test, *args, **params)
return y_pred
EstimatorClass.__doc__ += " " + super_class.__doc__
EstimatorClass.__name__ = super_class.__name__
return EstimatorClass
try:
import sklearn.ensemble as ensemble
except ImportError:
RandomForestClassifier = RandomForestRegressor = ExtraTreesClassifier = ExtraTreesRegressor = ImportError(
"Using flaml.default.* requires scikit-learn."
)
else:
RandomForestRegressor = flamlize_estimator(ensemble.RandomForestRegressor, "rf", "regression")
RandomForestClassifier = flamlize_estimator(ensemble.RandomForestClassifier, "rf", "classification")
ExtraTreesRegressor = flamlize_estimator(ensemble.ExtraTreesRegressor, "extra_tree", "regression")
ExtraTreesClassifier = flamlize_estimator(ensemble.ExtraTreesClassifier, "extra_tree", "classification")
try:
import lightgbm
except ImportError:
LGBMRegressor = LGBMClassifier = ImportError("Using flaml.default.LGBM* requires lightgbm.")
else:
LGBMRegressor = flamlize_estimator(lightgbm.LGBMRegressor, "lgbm", "regression")
LGBMClassifier = flamlize_estimator(lightgbm.LGBMClassifier, "lgbm", "classification")
try:
import xgboost
except ImportError:
XGBClassifier = XGBRegressor = ImportError("Using flaml.default.XGB* requires xgboost.")
else:
XGBRegressor = flamlize_estimator(
xgboost.XGBRegressor,
"xgb_limitdepth",
"regression",
[("max_depth", 0, "xgboost")],
)
XGBClassifier = flamlize_estimator(
xgboost.XGBClassifier,
"xgb_limitdepth",
"classification",
[("max_depth", 0, "xgboost")],
)
# if hasattr(xgboost.XGBRegressor, "_get_param_names"):
# XGBRegressor._get_param_names = xgboost.XGBRegressor._get_param_names
# XGBClassifier._get_param_names = xgboost.XGBClassifier._get_param_names

View File

@ -0,0 +1,361 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 1080,
"max_features": 1.0,
"max_leaves": 590,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 0.46132798093546956,
"max_leaves": 12856,
"criterion": "gini"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 408,
"max_features": 0.3629795757973625,
"max_leaves": 81,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 553,
"max_features": 0.9592132391435095,
"max_leaves": 1127,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
18000.0,
28.0,
2.0,
0.7565217391304347
],
"scale": [
42124.0,
130.0,
1.0,
0.5714285714285715
]
},
"neighbors": [
{
"features": [
1.196467571930491,
1.0923076923076922,
0.0,
0.4260869565217391
],
"choice": [
1,
2,
4
]
},
{
"features": [
11.096856898680088,
-0.16153846153846155,
0.0,
-0.5739130434782609
],
"choice": [
1,
3,
0,
2,
4
]
},
{
"features": [
8.658152122305575,
0.38461538461538464,
0.0,
-0.7405797101449274
],
"choice": [
1,
3,
0,
4
]
},
{
"features": [
0.27281359794891274,
-0.14615384615384616,
0.0,
-1.3239130434782607
],
"choice": [
3,
0,
4
]
},
{
"features": [
-0.4125676573924604,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
2,
0,
1,
4
]
},
{
"features": [
0.6409647706770487,
1.5538461538461539,
0.0,
0.0
],
"choice": [
2,
0,
3,
1,
4
]
},
{
"features": [
2.3515573069983855,
0.16923076923076924,
0.0,
0.4260869565217391
],
"choice": [
1,
4
]
},
{
"features": [
0.6162045389801538,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
3,
0,
2,
1,
4
]
},
{
"features": [
0.5386240622922799,
-0.09230769230769231,
0.0,
-0.5582880434782608
],
"choice": [
3,
0,
1,
4
]
},
{
"features": [
-0.41133320672300827,
-0.18461538461538463,
0.0,
0.4260869565217391
],
"choice": [
2,
1,
4
]
},
{
"features": [
-0.31155635742094767,
12.36923076923077,
0.0,
0.3865087169129372
],
"choice": [
3,
1,
0,
2,
4
]
},
{
"features": [
-0.40594435476213087,
-0.06153846153846154,
0.0,
-0.7114130434782607
],
"choice": [
2,
1,
0,
3,
4
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
3,
0,
1,
2,
4
]
},
{
"features": [
1.6675766783781218,
0.0,
0.0,
0.4260869565217391
],
"choice": [
1,
3,
0,
4
]
},
{
"features": [
-0.36356946158959264,
0.8923076923076924,
0.0,
-1.2266908212560386
],
"choice": [
3,
4
]
},
{
"features": [
-0.38225239768303104,
-0.05384615384615385,
0.0,
0.4260869565217391
],
"choice": [
1,
0,
3,
2,
4
]
},
{
"features": [
-0.3590352293229513,
0.06153846153846154,
0.0,
-1.3239130434782607
],
"choice": [
0,
2,
3,
1,
4
]
},
{
"features": [
0.3090399772101415,
0.6923076923076923,
0.0,
-0.003997789240972687
],
"choice": [
3,
0,
4
]
},
{
"features": [
-0.3118649700883107,
-0.17692307692307693,
0.0,
0.4260869565217391
],
"choice": [
3,
1,
4
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
4
]
},
{
"features": [
-0.3178473079479632,
-0.06153846153846154,
0.0,
0.4260869565217391
],
"choice": [
1,
0,
3,
4
]
}
],
"configsource": [
"segment",
"Helena",
"kr-vs-kp",
"bank-marketing",
"default"
]
}

View File

@ -0,0 +1,310 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 1074,
"max_features": 0.6008299059364026,
"max_leaves": 9287
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 833,
"max_features": 0.055027081530106846,
"max_leaves": 1361,
"criterion": "gini"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 0.9560062760906606,
"max_leaves": 32767,
"criterion": "entropy",
"FLAML_sample_size": 470620
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 812,
"max_features": 1.0,
"max_leaves": 1474,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 1.0,
"max_leaves": 18344
}
},
{
"class": "extra_tree",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
24668.5,
54.0,
7.0,
1.0
],
"scale": [
57198.0,
770.5,
6.0,
1.0
]
},
"neighbors": [
{
"features": [
8.710820308402392,
0.0,
0.0,
-0.8148148148148149
],
"choice": [
2,
4,
5
]
},
{
"features": [
0.6701545508584216,
0.9474367293964958,
0.5,
0.0
],
"choice": [
2,
0,
4,
3,
5
]
},
{
"features": [
0.5945575020105598,
-0.03504218040233614,
15.5,
0.0
],
"choice": [
4,
0,
3,
2,
1,
5
]
},
{
"features": [
0.8862285394594217,
0.0,
-0.5,
0.0
],
"choice": [
2,
4,
0,
3,
5
]
},
{
"features": [
-0.2739344033008147,
9.2744970798183,
0.5,
0.0
],
"choice": [
0,
1,
3,
5
]
},
{
"features": [
0.48133676002657433,
-0.058403634003893576,
0.0,
0.0
],
"choice": [
3,
2,
4,
0,
5
]
},
{
"features": [
0.4862145529563971,
0.16353017521090202,
0.5,
0.0
],
"choice": [
2,
4,
0,
3,
5
]
},
{
"features": [
-0.40409629707332423,
-0.06229720960415315,
-0.5,
-1.0
],
"choice": [
4,
2,
0,
5
]
},
{
"features": [
-0.41428896115248787,
1.0408825438027256,
0.3333333333333333,
0.0
],
"choice": [
1,
5
]
},
{
"features": [
0.6317091506696039,
-0.015574302401038288,
-0.6666666666666666,
-1.0
],
"choice": [
0,
2,
3,
5
]
},
{
"features": [
-0.2739344033008147,
2.5256327060350423,
-0.3333333333333333,
0.0
],
"choice": [
3,
2,
4,
0,
1,
5
]
},
{
"features": [
-0.30168012867582783,
0.9682024659312135,
0.0,
0.0
],
"choice": [
1,
5
]
},
{
"features": [
0.2739344033008147,
-0.06229720960415315,
-0.6666666666666666,
0.0
],
"choice": [
3,
0,
1,
5
]
},
{
"features": [
-0.39981293052204625,
0.21025308241401688,
0.5,
0.0
],
"choice": [
4,
2,
3,
0,
5
]
},
{
"features": [
-0.3949351375922235,
-0.04931862426995458,
0.0,
0.0
],
"choice": [
3,
2,
4,
0,
5
]
},
{
"features": [
-0.41797790132522117,
-0.04672290720311486,
-0.5,
0.0
],
"choice": [
4,
3,
2,
0,
5
]
}
],
"configsource": [
"houses",
"fabert",
"Covertype",
"Amazon_employee_access",
"fried",
"default"
]
}

View File

@ -0,0 +1,312 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 1233,
"max_features": 1.0,
"max_leaves": 6452
}
},
{
"class": "extra_tree",
"hyperparameters": {}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 346,
"max_features": 1.0,
"max_leaves": 1007,
"criterion": "entropy"
}
},
{
"class": "extra_tree",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 0.5106397565689275,
"max_leaves": 32767,
"FLAML_sample_size": 319382
}
}
],
"preprocessing": {
"center": [
36691.0,
10.0,
0.0,
1.0
],
"scale": [
474977.25,
7.5,
1.0,
0.5
]
},
"neighbors": [
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
2,
0,
3,
1
]
},
{
"features": [
0.6806831274550518,
12.0,
0.0,
0.0
],
"choice": [
1
]
},
{
"features": [
-0.05119403087200492,
4.0,
0.0,
0.0
],
"choice": [
0,
1
]
},
{
"features": [
1.817579684079606,
-0.13333333333333333,
0.0,
-0.6666666666666667
],
"choice": [
0,
3,
2,
1
]
},
{
"features": [
0.14376478031316237,
-0.13333333333333333,
0.0,
-1.7777777777777777
],
"choice": [
2,
0,
3,
1
]
},
{
"features": [
-0.044096848849076456,
-0.13333333333333333,
0.0,
-0.6666666666666667
],
"choice": [
2,
3,
0,
1
]
},
{
"features": [
1.817579684079606,
10.666666666666666,
0.0,
0.0
],
"choice": [
1
]
},
{
"features": [
-0.01831245601763032,
-0.13333333333333333,
0.0,
-1.5555555555555556
],
"choice": [
2,
0,
3,
1
]
},
{
"features": [
1.817579684079606,
1.0666666666666667,
0.0,
-0.8888888888888888
],
"choice": [
1
]
},
{
"features": [
1.817579684079606,
0.0,
0.0,
-1.8
],
"choice": [
2,
0,
3,
1
]
},
{
"features": [
0.2584144819567674,
0.0,
0.0,
0.0
],
"choice": [
2,
0,
3,
1
]
},
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
1
]
},
{
"features": [
-0.034077421602824134,
0.8,
0.0,
0.0
],
"choice": [
1
]
},
{
"features": [
-0.034077421602824134,
-0.26666666666666666,
0.0,
0.0
],
"choice": [
0,
3,
1
]
},
{
"features": [
-0.038138668746766295,
-0.26666666666666666,
0.0,
0.0
],
"choice": [
3,
0,
1
]
},
{
"features": [
0.0,
0.0,
0.0,
-0.6000000000000001
],
"choice": [
0,
1
]
},
{
"features": [
0.6698805048031248,
-0.13333333333333333,
0.0,
-1.3333333333333335
],
"choice": [
3,
1
]
},
{
"features": [
1.8649693222149062,
0.0,
0.0,
0.0
],
"choice": [
1
]
},
{
"features": [
-0.0488254963790371,
5.066666666666666,
0.0,
0.0
],
"choice": [
0,
2,
1
]
},
{
"features": [
-0.04550112663290715,
-0.9333333333333333,
0.0,
-1.3333333333333335
],
"choice": [
2,
0,
1
]
}
],
"configsource": [
"house_16H",
"default",
"dilbert",
"particulate-matter"
]
}

90
flaml/default/greedy.py Normal file
View File

@ -0,0 +1,90 @@
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import pairwise_distances
def _augment(row):
max, avg, id = row.max(), row.mean(), row.index[0]
return row.apply(lambda x: (x, max, avg, id))
def construct_portfolio(regret_matrix, meta_features, regret_bound):
"""The portfolio construction algorithm.
(Reference)[https://arxiv.org/abs/2202.09927].
Args:
regret_matrix: A dataframe of regret matrix.
meta_features: None or a dataframe of metafeatures matrix.
When set to None, the algorithm uses greedy strategy.
Otherwise, the algorithm uses greedy strategy with feedback
from the nearest neighbor predictor.
regret_bound: A float of the regret bound.
Returns:
A list of configuration names.
"""
configs = []
all_configs = set(regret_matrix.index.tolist())
tasks = regret_matrix.columns
# pre-processing
if meta_features is not None:
scaler = RobustScaler()
meta_features = meta_features.loc[tasks]
meta_features.loc[:, :] = scaler.fit_transform(meta_features)
nearest_task = {}
for t in tasks:
other_meta_features = meta_features.drop(t)
dist = pd.DataFrame(
pairwise_distances(
meta_features.loc[t].to_numpy().reshape(1, -1),
other_meta_features,
metric="l2",
),
columns=other_meta_features.index,
)
nearest_task[t] = dist.idxmin(axis=1)
regret_matrix = regret_matrix.apply(_augment, axis=1)
print(regret_matrix)
def loss(configs):
"""Loss of config set `configs`, according to nearest neighbor config predictor."""
if meta_features is not None:
r = []
best_config_per_task = regret_matrix.loc[configs, :].min()
for t in tasks:
config = best_config_per_task[nearest_task[t]].iloc[0][-1]
r.append(regret_matrix[t][config][0])
else:
r = regret_matrix.loc[configs].min()
excessive_regret = (np.array(r) - regret_bound).clip(min=0).sum()
avg_regret = np.array(r).mean()
return excessive_regret, avg_regret
prev = np.inf
i = 0
eps = 1e-5
while True:
candidates = [configs + [d] for d in all_configs.difference(configs)]
losses, avg_regret = tuple(zip(*(loss(x) for x in candidates)))
sorted_losses = np.sort(losses)
if sorted_losses[1] - sorted_losses[0] < eps:
minloss = np.nanmin(losses)
print(f"tie detected at loss = {sorted_losses[0]}, using alternative metric.")
tied = np.flatnonzero(losses - minloss < eps)
losses = [(avg_regret[i], i) for i in tied]
minloss, ind = min(losses)
if minloss > prev - eps:
print(f"May be overfitting at k = {i + 1}, current = {minloss:.5f}, " f"prev = {prev:.5f}. Stopping.")
break
configs = candidates[ind]
prev = minloss
else:
configs = candidates[np.nanargmin(losses)]
i += 1
if sorted_losses[0] <= eps:
print(f"Reached target regret bound of {regret_bound}! k = {i}. Declining to pick further!")
break
return configs

View File

@ -0,0 +1,370 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 2541,
"num_leaves": 1667,
"min_child_samples": 29,
"learning_rate": 0.0016660662914022302,
"log_max_bin": 8,
"colsample_bytree": 0.5157078343718623,
"reg_alpha": 0.045792841240713165,
"reg_lambda": 0.0012362651138125363,
"FLAML_sample_size": 436899
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 141,
"num_leaves": 139,
"min_child_samples": 8,
"learning_rate": 0.04824748268727149,
"log_max_bin": 9,
"colsample_bytree": 0.5261441571042451,
"reg_alpha": 0.002896920833899335,
"reg_lambda": 0.024463247502165594
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 31204,
"num_leaves": 4,
"min_child_samples": 3,
"learning_rate": 0.009033979476164342,
"log_max_bin": 10,
"colsample_bytree": 0.5393339924944204,
"reg_alpha": 15.800090067239827,
"reg_lambda": 34.82471227276953
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 362,
"num_leaves": 1208,
"min_child_samples": 8,
"learning_rate": 0.02070742242160566,
"log_max_bin": 4,
"colsample_bytree": 0.37915528071680865,
"reg_alpha": 0.002982599447751338,
"reg_lambda": 1.136605174453919,
"FLAML_sample_size": 337147
}
},
{
"class": "lgbm",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
18000.0,
28.0,
2.0,
0.7565217391304347
],
"scale": [
42124.0,
130.0,
1.0,
0.5714285714285715
]
},
"neighbors": [
{
"features": [
1.196467571930491,
1.0923076923076922,
0.0,
0.4260869565217391
],
"choice": [
4
]
},
{
"features": [
11.096856898680088,
-0.16153846153846155,
0.0,
-0.5739130434782609
],
"choice": [
0,
1,
3,
2,
4
]
},
{
"features": [
8.658152122305575,
0.38461538461538464,
0.0,
-0.7405797101449274
],
"choice": [
2,
1,
0,
3,
4
]
},
{
"features": [
0.27281359794891274,
-0.14615384615384616,
0.0,
-1.3239130434782607
],
"choice": [
0,
1,
4
]
},
{
"features": [
-0.4125676573924604,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
3,
1,
0,
2,
4
]
},
{
"features": [
0.6409647706770487,
1.5538461538461539,
0.0,
0.0
],
"choice": [
2,
0,
1,
4
]
},
{
"features": [
2.3515573069983855,
0.16923076923076924,
0.0,
0.4260869565217391
],
"choice": [
2,
0,
3,
1,
4
]
},
{
"features": [
0.6162045389801538,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
1,
4
]
},
{
"features": [
0.5386240622922799,
-0.09230769230769231,
0.0,
-0.5582880434782608
],
"choice": [
1,
0,
4
]
},
{
"features": [
-0.41133320672300827,
-0.18461538461538463,
0.0,
0.4260869565217391
],
"choice": [
3,
2,
4
]
},
{
"features": [
-0.31155635742094767,
12.36923076923077,
0.0,
0.3865087169129372
],
"choice": [
2,
3,
0,
1,
4
]
},
{
"features": [
-0.40594435476213087,
-0.06153846153846154,
0.0,
-0.7114130434782607
],
"choice": [
1,
0,
4
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
3,
2,
4
]
},
{
"features": [
1.6675766783781218,
0.0,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
4
]
},
{
"features": [
-0.36356946158959264,
0.8923076923076924,
0.0,
-1.2266908212560386
],
"choice": [
3,
4
]
},
{
"features": [
-0.38225239768303104,
-0.05384615384615385,
0.0,
0.4260869565217391
],
"choice": [
3,
1,
0,
4
]
},
{
"features": [
-0.3590352293229513,
0.06153846153846154,
0.0,
-1.3239130434782607
],
"choice": [
4
]
},
{
"features": [
0.3090399772101415,
0.6923076923076923,
0.0,
-0.003997789240972687
],
"choice": [
1,
3,
4
]
},
{
"features": [
-0.3118649700883107,
-0.17692307692307693,
0.0,
0.4260869565217391
],
"choice": [
1,
4
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
3,
4
]
},
{
"features": [
-0.3178473079479632,
-0.06153846153846154,
0.0,
0.4260869565217391
],
"choice": [
1,
0,
4
]
}
],
"configsource": [
"Airlines",
"riccardo",
"fried",
"Dionis",
"default"
]
}

View File

@ -0,0 +1,416 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 134,
"num_leaves": 225,
"min_child_samples": 21,
"learning_rate": 0.10182098014295998,
"log_max_bin": 5,
"colsample_bytree": 0.6103565306428956,
"reg_alpha": 0.0009765625,
"reg_lambda": 40.413729576022625
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 3726,
"num_leaves": 155,
"min_child_samples": 4,
"learning_rate": 0.040941607728296484,
"log_max_bin": 5,
"colsample_bytree": 0.5326256194627191,
"reg_alpha": 0.7408711930398492,
"reg_lambda": 0.5467731065349226
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 573,
"num_leaves": 16,
"min_child_samples": 52,
"learning_rate": 0.2422782244991656,
"log_max_bin": 7,
"colsample_bytree": 1.0,
"reg_alpha": 0.03433194930183514,
"reg_lambda": 0.03870494540146326
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 2931,
"num_leaves": 106,
"min_child_samples": 49,
"learning_rate": 0.007146230961642236,
"log_max_bin": 7,
"colsample_bytree": 0.46947896116006055,
"reg_alpha": 0.37428758811879526,
"reg_lambda": 23.639977131692564
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 241,
"num_leaves": 58,
"min_child_samples": 2,
"learning_rate": 0.022730855281657265,
"log_max_bin": 5,
"colsample_bytree": 0.5620897082415793,
"reg_alpha": 0.0031614554887399314,
"reg_lambda": 0.02175056245188971
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 8353,
"num_leaves": 371,
"min_child_samples": 71,
"learning_rate": 0.017965875630873252,
"log_max_bin": 10,
"colsample_bytree": 0.9002082433803926,
"reg_alpha": 0.4864366003694002,
"reg_lambda": 0.024138585745106363,
"FLAML_sample_size": 470619
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 320,
"num_leaves": 24,
"min_child_samples": 53,
"learning_rate": 0.019316895546068795,
"log_max_bin": 6,
"colsample_bytree": 0.3955693254372702,
"reg_alpha": 0.0013785083170001627,
"reg_lambda": 0.04644365636517757
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 733,
"num_leaves": 11,
"min_child_samples": 94,
"learning_rate": 0.06276798296942972,
"log_max_bin": 6,
"colsample_bytree": 0.6341928918435795,
"reg_alpha": 0.5811038918218691,
"reg_lambda": 43.304997517523944
}
},
{
"class": "lgbm",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
40337.0,
54.0,
7.0,
1.0
],
"scale": [
58722.0,
766.0,
6.0,
1.0
]
},
"neighbors": [
{
"features": [
8.217925138789552,
0.0,
0.0,
-0.8148148148148149
],
"choice": [
5,
1,
0,
3,
2,
7,
4,
8
]
},
{
"features": [
5.691767991553421,
0.007832898172323759,
58.0,
0.0
],
"choice": [
0,
2,
4,
7,
6,
8
]
},
{
"features": [
0.385937127482034,
0.9530026109660574,
0.5,
0.0
],
"choice": [
3,
7,
0,
4,
1,
8
]
},
{
"features": [
0.3123020333094922,
-0.03524804177545692,
15.5,
0.0
],
"choice": [
3,
0,
7,
6,
1,
4,
5,
2,
8
]
},
{
"features": [
0.5964033922550321,
0.0,
-0.5,
0.0
],
"choice": [
3,
0,
7,
4,
8
]
},
{
"features": [
-0.5336500800381458,
9.328981723237598,
0.5,
0.0
],
"choice": [
3,
0,
4,
1,
2,
7,
6,
8
]
},
{
"features": [
0.20201968597799802,
-0.0587467362924282,
0.0,
0.0
],
"choice": [
4,
6,
1,
7,
5,
3,
0,
2,
8
]
},
{
"features": [
0.20677088655018563,
0.16449086161879894,
0.5,
0.0
],
"choice": [
3,
0,
1,
5,
7,
4,
8
]
},
{
"features": [
-0.6604339089268076,
-0.06266318537859007,
-0.5,
-1.0
],
"choice": [
8
]
},
{
"features": [
-0.6703620448894793,
1.0469973890339426,
0.3333333333333333,
0.0
],
"choice": [
4,
1,
8
]
},
{
"features": [
0.34848949286468445,
-0.015665796344647518,
-0.6666666666666666,
-1.0
],
"choice": [
1,
5,
2,
3,
0,
8
]
},
{
"features": [
-0.5336500800381458,
2.5404699738903394,
-0.3333333333333333,
0.0
],
"choice": [
2,
8
]
},
{
"features": [
-0.5606757263036,
0.9738903394255874,
0.0,
0.0
],
"choice": [
4,
1,
8
]
},
{
"features": [
0.0,
-0.06266318537859007,
-0.6666666666666666,
0.0
],
"choice": [
2,
1,
5,
8
]
},
{
"features": [
-0.6562617077075031,
0.21148825065274152,
0.5,
0.0
],
"choice": [
2,
6,
7,
5,
3,
1,
4,
8
]
},
{
"features": [
-0.6515105071353156,
-0.04960835509138381,
0.0,
0.0
],
"choice": [
6,
1,
3,
7,
5,
4,
0,
2,
8
]
},
{
"features": [
-0.6739552467559007,
-0.04699738903394256,
-0.5,
0.0
],
"choice": [
6,
7,
3,
1,
0,
4,
5,
8
]
}
],
"configsource": [
"Helena",
"connect-4",
"jungle_chess_2pcs_raw_endgame_complete",
"Jannis",
"fabert",
"Covertype",
"segment",
"APSFailure",
"default"
]
}

View File

@ -0,0 +1,281 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 4797,
"num_leaves": 122,
"min_child_samples": 2,
"learning_rate": 0.022635758411078528,
"log_max_bin": 9,
"colsample_bytree": 0.7019911744574896,
"reg_alpha": 0.004252223402511765,
"reg_lambda": 0.11288241427227624
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 1009,
"num_leaves": 42,
"min_child_samples": 12,
"learning_rate": 0.02167229637171611,
"log_max_bin": 7,
"colsample_bytree": 0.7385038460573171,
"reg_alpha": 0.003607184551842614,
"reg_lambda": 12.08340803550741
}
},
{
"class": "lgbm",
"hyperparameters": {
"n_estimators": 32767,
"num_leaves": 372,
"min_child_samples": 4,
"learning_rate": 0.03517259015200922,
"log_max_bin": 5,
"colsample_bytree": 1.0,
"reg_alpha": 0.02271142170225636,
"reg_lambda": 0.001963791798843179,
"FLAML_sample_size": 830258
}
},
{
"class": "lgbm",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
36691.0,
10.0,
0.0,
1.0
],
"scale": [
140856.0,
3.0,
1.0,
0.33333333333333337
]
},
"neighbors": [
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
3
]
},
{
"features": [
-0.17263020389617767,
10.0,
0.0,
0.0
],
"choice": [
1,
0,
3
]
},
{
"features": [
6.129018288180837,
-0.3333333333333333,
0.0,
-1.0
],
"choice": [
1,
3
]
},
{
"features": [
0.48478588061566424,
-0.3333333333333333,
0.0,
-2.666666666666666
],
"choice": [
1,
3
]
},
{
"features": [
-0.14869796103822344,
-0.3333333333333333,
0.0,
-1.0
],
"choice": [
1,
3
]
},
{
"features": [
-0.06175100812176975,
-0.3333333333333333,
0.0,
-2.333333333333333
],
"choice": [
3
]
},
{
"features": [
6.129018288180837,
2.6666666666666665,
0.0,
-1.333333333333333
],
"choice": [
0,
1,
2,
3
]
},
{
"features": [
6.129018288180837,
0.0,
0.0,
-2.6999999999999997
],
"choice": [
1,
3
]
},
{
"features": [
0.8713934798659624,
0.0,
0.0,
0.0
],
"choice": [
1,
3
]
},
{
"features": [
-0.19217498722099166,
0.6666666666666666,
0.0,
-1.0
],
"choice": [
0,
3
]
},
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
1,
0,
3
]
},
{
"features": [
-0.11491168285341058,
2.0,
0.0,
0.0
],
"choice": [
1,
3
]
},
{
"features": [
-0.11491168285341058,
-0.6666666666666666,
0.0,
0.0
],
"choice": [
3
]
},
{
"features": [
-0.1286065201340376,
-0.6666666666666666,
0.0,
0.0
],
"choice": [
0,
1,
3
]
},
{
"features": [
0.0,
0.0,
0.0,
-0.9
],
"choice": [
3
]
},
{
"features": [
6.288819787584483,
0.0,
0.0,
0.0
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
-0.16464332367808257,
12.666666666666666,
0.0,
0.0
],
"choice": [
0,
3
]
}
],
"configsource": [
"houses",
"house_8L",
"poker",
"default"
]
}

222
flaml/default/portfolio.py Normal file
View File

@ -0,0 +1,222 @@
import pandas as pd
import numpy as np
import argparse
from pathlib import Path
import json
from sklearn.preprocessing import RobustScaler
from flaml.default import greedy
from flaml.default.regret import load_result, build_regret
from flaml.version import __version__
regret_bound = 0.01
def config_predictor_tuple(tasks, configs, meta_features, regret_matrix):
"""Config predictor represented in tuple.
The returned tuple consists of (meta_features, preferences, proc).
Returns:
meta_features_norm: A dataframe of normalized meta features, each column for a task.
preferences: A dataframe of sorted configuration indicies by their performance per task (column).
regret_matrix: A dataframe of the configuration(row)-task(column) regret matrix.
"""
# pre-processing
scaler = RobustScaler()
meta_features_norm = meta_features.loc[tasks] # this makes a copy
meta_features_norm.loc[:, :] = scaler.fit_transform(meta_features_norm)
proc = {
"center": scaler.center_.tolist(),
"scale": scaler.scale_.tolist(),
}
# best model for each dataset in training
# choices = regret_matrix[tasks].loc[configs].reset_index(drop=True).idxmin()
# break ties using the order in configs
regret = (
regret_matrix[tasks]
.loc[configs]
.reset_index(drop=True)
.apply(lambda row: row.apply(lambda x: (x, row.name)), axis=1)
)
print(regret)
preferences = pd.DataFrame(np.argsort(regret, axis=0), columns=regret.columns)
print(preferences)
return (meta_features_norm, preferences, proc)
def build_portfolio(meta_features, regret, strategy):
"""Build a portfolio from meta features and regret matrix.
Args:
meta_features: A dataframe of metafeatures matrix.
regret: A dataframe of regret matrix.
strategy: A str of the strategy, one of ("greedy", "greedy-feedback").
"""
assert strategy in ("greedy", "greedy-feedback")
if strategy == "greedy":
portfolio = greedy.construct_portfolio(regret, None, regret_bound)
elif strategy == "greedy-feedback":
portfolio = greedy.construct_portfolio(regret, meta_features, regret_bound)
if "default" not in portfolio and "default" in regret.index:
portfolio += ["default"]
return portfolio
def load_json(filename):
"""Returns the contents of json file filename."""
with open(filename, "r") as f:
return json.load(f)
def _filter(preference, regret):
"""Remove choices after default or have NaN regret."""
try:
last = regret.index.get_loc("default") # len(preference) - 1
preference = preference[: preference[preference == last].index[0] + 1]
except KeyError: # no "default"
pass
finally:
regret = regret.reset_index(drop=True)
preference = preference[regret[preference].notna().to_numpy()]
# regret = regret[preference].reset_index(drop=True)
# dup = regret[regret.duplicated()]
# if not dup.empty:
# # break ties using the order in configs
# unique = dup.drop_duplicates()
# for u in unique:
# subset = regret == u
# preference[subset].sort_values(inplace=True)
# # raise ValueError(preference)
return preference.tolist()
def serialize(configs, regret, meta_features, output_file, config_path):
"""Store to disk all information FLAML-metalearn needs at runtime.
configs: names of model configs
regret: regret matrix
meta_features: task metafeatures
output_file: filename
config_path: path containing config json files
"""
output_file = Path(output_file)
# delete if exists
try:
output_file.unlink()
except FileNotFoundError:
pass
meta_features_norm, preferences, proc = config_predictor_tuple(regret.columns, configs, meta_features, regret)
portfolio = [load_json(config_path.joinpath(m + ".json")) for m in configs]
regret = regret.loc[configs]
meta_predictor = {
"version": __version__,
"meta_feature_names": list(meta_features.columns),
"portfolio": portfolio,
"preprocessing": proc,
"neighbors": [
{"features": x.tolist(), "choice": _filter(preferences[y], regret[y])}
for x, y in zip(meta_features_norm.to_records(index=False), preferences.columns)
],
"configsource": list(configs),
}
with open(output_file, "w+") as f:
json.dump(meta_predictor, f, indent=4)
return meta_predictor
# def analyze(regret_matrix, meta_predictor):
# tasks = regret_matrix.columns
# neighbors = meta_predictor["neighbors"]
# from sklearn.neighbors import NearestNeighbors
# nn = NearestNeighbors(n_neighbors=1)
# for i, task in enumerate(neighbors):
# other_tasks = [j for j in range(len(neighbors)) if j != i]
# # find the nn and the regret
# nn.fit([neighbors[j]["features"] for j in other_tasks])
# dist, ind = nn.kneighbors(
# np.array(task["features"]).reshape(1, -1), return_distance=True
# )
# ind = other_tasks[int(ind.item())]
# choice = int(neighbors[ind]["choice"][0])
# r = regret_matrix.iloc[choice, i]
# if r > regret_bound:
# label = "outlier"
# else:
# label = "normal"
# print(tasks[i], label, tasks[ind], "dist", dist, "regret", r)
# # find the best model and the regret
# regrets = regret_matrix.iloc[other_tasks, i]
# best = regrets.min()
# if best > regret_bound:
# print(tasks[i], "best_regret", best, "task", regrets.idxmin())
def main():
parser = argparse.ArgumentParser(description="Build a portfolio.")
parser.add_argument("--strategy", help="One of {greedy, greedy-feedback}", default="greedy")
parser.add_argument("--input", help="Input path")
parser.add_argument("--metafeatures", help="CSV of task metafeatures")
parser.add_argument("--exclude", help="One task name to exclude (for LOO purposes)")
parser.add_argument("--output", help="Location to write portfolio JSON")
parser.add_argument("--task", help="Task to merge portfolios", default="binary")
parser.add_argument(
"--estimator",
help="Estimators to merge portfolios",
default=["lgbm", "xgboost"],
nargs="+",
)
args = parser.parse_args()
meta_features = pd.read_csv(args.metafeatures, index_col=0).groupby(level=0).first()
if args.exclude:
meta_features.drop(args.exclude, inplace=True)
baseline_best = None
all_results = None
for estimator in args.estimator:
# produce regret
all, baseline = load_result(f"{args.input}/{estimator}/results.csv", args.task, "result")
regret = build_regret(all, baseline)
regret = regret.replace(np.inf, np.nan).dropna(axis=1, how="all")
if args.exclude:
regret = regret.loc[[i for i in regret.index if args.exclude not in i]]
regret = regret[[c for c in regret.columns if args.exclude not in c]]
print(f"Regret matrix complete: {100 * regret.count().sum() / regret.shape[0] / regret.shape[1]}%")
print(f"Num models considered: {regret.shape[0]}")
configs = build_portfolio(meta_features, regret, args.strategy)
meta_predictor = serialize(
configs,
regret,
meta_features,
f"{args.output}/{estimator}/{args.task}.json",
Path(f"{args.input}/{estimator}"),
)
configsource = meta_predictor["configsource"]
all = all.loc[configsource]
all.rename({x: f"{estimator}/{x}" for x in regret.index.values}, inplace=True)
baseline_best = baseline if baseline_best is None else pd.DataFrame({0: baseline_best, 1: baseline}).max(1)
all_results = all if all_results is None else pd.concat([all_results, all])
# analyze(regret, meta_predictor)
regrets = build_regret(all_results, baseline_best)
if len(args.estimator) > 1:
meta_predictor = serialize(
regrets.index,
regrets,
meta_features,
f"{args.output}/all/{args.task}.json",
Path(args.input),
)
if __name__ == "__main__":
# execute only if run as a script
main()

42
flaml/default/regret.py Normal file
View File

@ -0,0 +1,42 @@
import argparse
from os import path
import pandas as pd
def build_regret(all, baseline):
all = all[all.columns.intersection(baseline.index)]
return baseline - all
def write_regret(regret, filename):
regret.to_csv(filename)
def load_result(filename, task_type, metric):
df = pd.read_csv(filename)
df = df.loc[
(df[metric].notnull()) & (df.type == task_type),
["task", "fold", "params", metric],
]
df["params"] = df["params"].apply(lambda x: path.splitext(path.basename(eval(x)["_modeljson"]))[0])
baseline = df.loc[df["task"] == df["params"], ["task", metric]].groupby("task").mean()[metric]
df = df.pivot_table(index="params", columns="task", values=metric)
return df, baseline
def main():
parser = argparse.ArgumentParser(description="Build a regret matrix.")
parser.add_argument("--result_csv", help="File of experiment results")
parser.add_argument("--task_type", help="Type of task")
parser.add_argument("--metric", help="Metric for calculating regret", default="result")
parser.add_argument("--output", help="Location to write regret CSV to")
args = parser.parse_args()
all, baseline = load_result(args.result_csv, args.task_type, args.metric)
regret = build_regret(all, baseline)
write_regret(regret, args.output)
if __name__ == "__main__":
# execute only if run as a script
main()

View File

@ -0,0 +1,333 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "rf",
"hyperparameters": {
"n_estimators": 501,
"max_features": 0.24484242524861066,
"max_leaves": 1156,
"criterion": "entropy"
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 356,
"max_features": 0.1,
"max_leaves": 102,
"criterion": "gini"
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 1000,
"max_features": 0.1779692423238241,
"max_leaves": 7499,
"criterion": "gini"
}
},
{
"class": "rf",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
18000.0,
28.0,
2.0,
0.7565217391304347
],
"scale": [
42124.0,
130.0,
1.0,
0.5714285714285715
]
},
"neighbors": [
{
"features": [
1.196467571930491,
1.0923076923076922,
0.0,
0.4260869565217391
],
"choice": [
0,
3
]
},
{
"features": [
11.096856898680088,
-0.16153846153846155,
0.0,
-0.5739130434782609
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
8.658152122305575,
0.38461538461538464,
0.0,
-0.7405797101449274
],
"choice": [
2,
0,
3
]
},
{
"features": [
0.27281359794891274,
-0.14615384615384616,
0.0,
-1.3239130434782607
],
"choice": [
2,
0,
3
]
},
{
"features": [
-0.4125676573924604,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
2,
1,
0,
3
]
},
{
"features": [
0.6409647706770487,
1.5538461538461539,
0.0,
0.0
],
"choice": [
1,
0,
2,
3
]
},
{
"features": [
2.3515573069983855,
0.16923076923076924,
0.0,
0.4260869565217391
],
"choice": [
2,
0,
3
]
},
{
"features": [
0.6162045389801538,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
0,
2,
1,
3
]
},
{
"features": [
0.5386240622922799,
-0.09230769230769231,
0.0,
-0.5582880434782608
],
"choice": [
0,
2,
3
]
},
{
"features": [
-0.41133320672300827,
-0.18461538461538463,
0.0,
0.4260869565217391
],
"choice": [
1,
2,
0,
3
]
},
{
"features": [
-0.31155635742094767,
12.36923076923077,
0.0,
0.3865087169129372
],
"choice": [
0,
2,
1,
3
]
},
{
"features": [
-0.40594435476213087,
-0.06153846153846154,
0.0,
-0.7114130434782607
],
"choice": [
0,
2,
3
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
0,
2,
3
]
},
{
"features": [
1.6675766783781218,
0.0,
0.0,
0.4260869565217391
],
"choice": [
2,
0,
3
]
},
{
"features": [
-0.36356946158959264,
0.8923076923076924,
0.0,
-1.2266908212560386
],
"choice": [
2,
0,
3
]
},
{
"features": [
-0.38225239768303104,
-0.05384615384615385,
0.0,
0.4260869565217391
],
"choice": [
3
]
},
{
"features": [
-0.3590352293229513,
0.06153846153846154,
0.0,
-1.3239130434782607
],
"choice": [
0,
3
]
},
{
"features": [
0.3090399772101415,
0.6923076923076923,
0.0,
-0.003997789240972687
],
"choice": [
0,
2,
3
]
},
{
"features": [
-0.3118649700883107,
-0.17692307692307693,
0.0,
0.4260869565217391
],
"choice": [
2,
0,
3
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
3
]
},
{
"features": [
-0.3178473079479632,
-0.06153846153846154,
0.0,
0.4260869565217391
],
"choice": [
0,
3
]
}
],
"configsource": [
"Amazon_employee_access",
"kc1",
"Helena",
"default"
]
}

View File

@ -0,0 +1,328 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "rf",
"hyperparameters": {
"n_estimators": 1000,
"max_features": 0.1779692423238241,
"max_leaves": 7499,
"criterion": "gini"
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 400,
"max_features": 0.8961466398827462,
"max_leaves": 25095,
"criterion": "entropy",
"FLAML_sample_size": 470620
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 470,
"max_features": 0.12698484669953783,
"max_leaves": 31499,
"criterion": "entropy"
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 632,
"max_features": 1.0,
"max_leaves": 1360,
"criterion": "entropy"
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 1713,
"max_features": 0.40966311008832224,
"max_leaves": 10210,
"criterion": "entropy",
"FLAML_sample_size": 105352
}
},
{
"class": "rf",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
40337.0,
54.0,
7.0,
1.0
],
"scale": [
58722.0,
766.0,
6.0,
1.0
]
},
"neighbors": [
{
"features": [
8.217925138789552,
0.0,
0.0,
-0.8148148148148149
],
"choice": [
1,
4,
5
]
},
{
"features": [
5.691767991553421,
0.007832898172323759,
58.0,
0.0
],
"choice": [
0,
2,
5
]
},
{
"features": [
0.385937127482034,
0.9530026109660574,
0.5,
0.0
],
"choice": [
4,
2,
1,
3,
0,
5
]
},
{
"features": [
0.3123020333094922,
-0.03524804177545692,
15.5,
0.0
],
"choice": [
0,
3,
2,
1,
5
]
},
{
"features": [
0.5964033922550321,
0.0,
-0.5,
0.0
],
"choice": [
4,
1,
3,
0,
2,
5
]
},
{
"features": [
-0.5336500800381458,
9.328981723237598,
0.5,
0.0
],
"choice": [
0,
2,
5
]
},
{
"features": [
0.20201968597799802,
-0.0587467362924282,
0.0,
0.0
],
"choice": [
1,
4,
5
]
},
{
"features": [
0.20677088655018563,
0.16449086161879894,
0.5,
0.0
],
"choice": [
4,
1,
2,
0,
3,
5
]
},
{
"features": [
-0.6604339089268076,
-0.06266318537859007,
-0.5,
-1.0
],
"choice": [
3,
1,
5
]
},
{
"features": [
-0.6703620448894793,
1.0469973890339426,
0.3333333333333333,
0.0
],
"choice": [
0,
5
]
},
{
"features": [
0.34848949286468445,
-0.015665796344647518,
-0.6666666666666666,
-1.0
],
"choice": [
4,
2,
0,
5
]
},
{
"features": [
-0.5336500800381458,
2.5404699738903394,
-0.3333333333333333,
0.0
],
"choice": [
4,
3,
1,
2,
0,
5
]
},
{
"features": [
-0.5606757263036,
0.9738903394255874,
0.0,
0.0
],
"choice": [
2,
4,
0,
3,
1,
5
]
},
{
"features": [
0.0,
-0.06266318537859007,
-0.6666666666666666,
0.0
],
"choice": [
3,
1,
4,
0,
5
]
},
{
"features": [
-0.6562617077075031,
0.21148825065274152,
0.5,
0.0
],
"choice": [
4,
0,
3,
1,
2,
5
]
},
{
"features": [
-0.6515105071353156,
-0.04960835509138381,
0.0,
0.0
],
"choice": [
1,
4,
3,
5
]
},
{
"features": [
-0.6739552467559007,
-0.04699738903394256,
-0.5,
0.0
],
"choice": [
3,
1,
4,
5
]
}
],
"configsource": [
"Helena",
"Covertype",
"Fashion-MNIST",
"jungle_chess_2pcs_raw_endgame_complete",
"MiniBooNE",
"default"
]
}

View File

@ -0,0 +1,293 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "rf",
"hyperparameters": {
"n_estimators": 960,
"max_features": 0.694616932858775,
"max_leaves": 8937
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 1.0,
"max_leaves": 32767,
"FLAML_sample_size": 830258
}
},
{
"class": "rf",
"hyperparameters": {
"n_estimators": 2047,
"max_features": 0.6683903035731483,
"max_leaves": 591,
"criterion": "entropy"
}
},
{
"class": "rf",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
36691.0,
10.0,
0.0,
0.85
],
"scale": [
460950.5,
5.5,
1.0,
0.48611111111111116
]
},
"neighbors": [
{
"features": [
0.0,
0.0,
0.0,
0.3085714285714286
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
-0.052751868150701646,
5.454545454545454,
0.0,
0.3085714285714286
],
"choice": [
0,
3
]
},
{
"features": [
1.8728887375108607,
-0.18181818181818182,
0.0,
-0.3771428571428571
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
0.14813955077605948,
-0.18181818181818182,
0.0,
-1.52
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
-0.04543871847410948,
-0.18181818181818182,
0.0,
-0.3771428571428571
],
"choice": [
2,
1,
0,
3
]
},
{
"features": [
-0.018869705098486712,
-0.18181818181818182,
0.0,
-1.2914285714285714
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
1.8728887375108607,
1.4545454545454546,
0.0,
-0.6057142857142855
],
"choice": [
0,
3
]
},
{
"features": [
1.8728887375108607,
0.0,
0.0,
-1.5428571428571427
],
"choice": [
0,
2,
1,
3
]
},
{
"features": [
0.266278049378404,
0.0,
0.0,
0.3085714285714286
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
0.0,
0.0,
0.0,
0.3085714285714286
],
"choice": [
1,
0,
3
]
},
{
"features": [
-0.035114399485411125,
1.0909090909090908,
0.0,
0.3085714285714286
],
"choice": [
3
]
},
{
"features": [
-0.035114399485411125,
-0.36363636363636365,
0.0,
0.3085714285714286
],
"choice": [
0,
2,
1,
3
]
},
{
"features": [
-0.03929923061152987,
-0.36363636363636365,
0.0,
0.3085714285714286
],
"choice": [
0,
1,
3
]
},
{
"features": [
0.0,
0.0,
0.0,
-0.3085714285714286
],
"choice": [
1,
3
]
},
{
"features": [
1.056425798431719,
1.4545454545454546,
0.0,
-0.7199999999999999
],
"choice": [
3
]
},
{
"features": [
0.6902650067631991,
-0.18181818181818182,
0.0,
-1.0628571428571427
],
"choice": [
1,
3
]
},
{
"features": [
1.92172044503694,
0.0,
0.0,
0.3085714285714286
],
"choice": [
3
]
},
{
"features": [
-0.050311259018050745,
6.909090909090909,
0.0,
0.3085714285714286
],
"choice": [
0,
2,
1,
3
]
}
],
"configsource": [
"houses",
"poker",
"bank-marketing",
"default"
]
}

261
flaml/default/suggest.py Normal file
View File

@ -0,0 +1,261 @@
import numpy as np
import logging
import pathlib
import json
from flaml.automl.data import DataTransformer
from flaml.automl.task.task import CLASSIFICATION, get_classification_objective
from flaml.automl.task.generic_task import len_labels
from flaml.automl.task.factory import task_factory
from flaml.version import __version__
try:
from sklearn.neighbors import NearestNeighbors
except ImportError:
pass
LOCATION = pathlib.Path(__file__).parent.resolve()
logger = logging.getLogger(__name__)
CONFIG_PREDICTORS = {}
def meta_feature(task, X_train, y_train, meta_feature_names):
this_feature = []
n_row = X_train.shape[0]
n_feat = X_train.shape[1]
is_classification = task in CLASSIFICATION
for each_feature_name in meta_feature_names:
if each_feature_name == "NumberOfInstances":
this_feature.append(n_row)
elif each_feature_name == "NumberOfFeatures":
this_feature.append(n_feat)
elif each_feature_name == "NumberOfClasses":
this_feature.append(len_labels(y_train) if is_classification else 0)
elif each_feature_name == "PercentageOfNumericFeatures":
try:
# this feature is only supported for dataframe
this_feature.append(
X_train.select_dtypes(include=[np.number, "float", "int", "long"]).shape[1] / n_feat
)
except AttributeError:
# 'numpy.ndarray' object has no attribute 'select_dtypes'
this_feature.append(1) # all features are numeric
else:
raise ValueError("Feature {} not implemented. ".format(each_feature_name))
return this_feature
def load_config_predictor(estimator_name, task, location=None):
task = str(task)
key = f"{location}/{estimator_name}/{task}"
predictor = CONFIG_PREDICTORS.get(key)
if predictor:
return predictor
task = "multiclass" if task == "multi" else task # TODO: multi -> multiclass?
try:
location = location or LOCATION
with open(f"{location}/{estimator_name}/{task}.json", "r") as f:
CONFIG_PREDICTORS[key] = predictor = json.load(f)
except FileNotFoundError:
raise FileNotFoundError(f"Portfolio has not been built for {estimator_name} on {task} task.")
return predictor
def suggest_config(
task,
X,
y,
estimator_or_predictor,
location=None,
k=None,
meta_feature_fn=meta_feature,
):
"""Suggest a list of configs for the given task and training data.
The returned configs can be used as starting points for AutoML.fit().
`FLAML_sample_size` is removed from the configs.
"""
from packaging.version import parse as version_parse
task = get_classification_objective(len_labels(y)) if task == "classification" and y is not None else task
predictor = (
load_config_predictor(estimator_or_predictor, task, location)
if isinstance(estimator_or_predictor, str)
else estimator_or_predictor
)
older_version = "1.0.2"
# TODO: update older_version when the newer code can no longer handle the older version json file
assert version_parse(__version__) >= version_parse(predictor["version"]) >= version_parse(older_version)
prep = predictor["preprocessing"]
feature = meta_feature_fn(task, X_train=X, y_train=y, meta_feature_names=predictor["meta_feature_names"])
feature = (np.array(feature) - np.array(prep["center"])) / np.array(prep["scale"])
neighbors = predictor["neighbors"]
nn = NearestNeighbors(n_neighbors=1)
nn.fit([x["features"] for x in neighbors])
dist, ind = nn.kneighbors(feature.reshape(1, -1), return_distance=True)
logger.info(f"metafeature distance: {dist.item()}")
ind = int(ind.item())
choice = neighbors[ind]["choice"] if k is None else neighbors[ind]["choice"][:k]
configs = [predictor["portfolio"][x] for x in choice]
for config in configs:
if "hyperparameters" in config:
hyperparams = config["hyperparameters"]
if hyperparams and "FLAML_sample_size" in hyperparams:
hyperparams.pop("FLAML_sample_size")
return configs
def suggest_learner(task, X, y, estimator_or_predictor="all", estimator_list=None, location=None):
"""Suggest best learner within estimator_list."""
configs = suggest_config(task, X, y, estimator_or_predictor, location)
if not estimator_list:
return configs[0]["class"]
for c in configs:
if c["class"] in estimator_list:
return c["class"]
return estimator_list[0]
def suggest_hyperparams(task, X, y, estimator_or_predictor, location=None):
"""Suggest hyperparameter configurations and an estimator class.
The configurations can be used to initialize the estimator class like lightgbm.LGBMRegressor.
Example:
```python
hyperparams, estimator_class = suggest_hyperparams("regression", X_train, y_train, "lgbm")
model = estimator_class(**hyperparams) # estimator_class is LGBMRegressor
model.fit(X_train, y_train)
```
Args:
task: A string of the task type, e.g.,
'classification', 'regression', 'ts_forecast', 'rank',
'seq-classification', 'seq-regression'.
X: A dataframe of training data in shape n*m.
For 'ts_forecast' task, the first column of X_train
must be the timestamp column (datetime type). Other
columns in the dataframe are assumed to be exogenous
variables (categorical or numeric).
y: A series of labels in shape n*1.
estimator_or_predictor: A str of the learner name or a dict of the learned config predictor.
If a dict, it contains:
- "version": a str of the version number.
- "preprocessing": a dictionary containing:
* "center": a list of meta feature value offsets for normalization.
* "scale": a list of meta feature scales to normalize each dimension.
- "neighbors": a list of dictionaries. Each dictionary contains:
* "features": a list of the normalized meta features for a neighbor.
* "choice": an integer of the configuration id in the portfolio.
- "portfolio": a list of dictionaries, each corresponding to a configuration:
* "class": a str of the learner name.
* "hyperparameters": a dict of the config. The key "FLAML_sample_size" will be ignored.
location: (Optional) A str of the location containing mined portfolio file.
Only valid when the portfolio is a str, by default the location is flaml/default.
Returns:
hyperparams: A dict of the hyperparameter configurations.
estiamtor_class: A class of the underlying estimator, e.g., lightgbm.LGBMClassifier.
"""
config = suggest_config(task, X, y, estimator_or_predictor, location=location, k=1)[0]
estimator = config["class"]
task = task_factory(task)
model_class = task.estimator_class_from_str(estimator)
hyperparams = config["hyperparameters"]
model = model_class(task=task.name, **hyperparams)
estimator_class = model.estimator_class
hyperparams = hyperparams and model.params
return hyperparams, estimator_class
class AutoMLTransformer:
def __init__(self, model, data_transformer):
self._model = model
self._dt = data_transformer
def transform(self, X):
return self._model._preprocess(self._dt.transform(X))
def preprocess_and_suggest_hyperparams(
task,
X,
y,
estimator_or_predictor,
location=None,
):
"""Preprocess the data and suggest hyperparameters.
Example:
```python
hyperparams, estimator_class, X, y, feature_transformer, label_transformer = \
preprocess_and_suggest_hyperparams("classification", X_train, y_train, "xgb_limitdepth")
model = estimator_class(**hyperparams) # estimator_class is XGBClassifier
model.fit(X, y)
X_test = feature_transformer.transform(X_test)
y_pred = label_transformer.inverse_transform(pd.Series(model.predict(X_test).astype(int)))
```
Args:
task: A string of the task type, e.g.,
'classification', 'regression', 'ts_forecast', 'rank',
'seq-classification', 'seq-regression'.
X: A dataframe of training data in shape n*m.
For 'ts_forecast' task, the first column of X_train
must be the timestamp column (datetime type). Other
columns in the dataframe are assumed to be exogenous
variables (categorical or numeric).
y: A series of labels in shape n*1.
estimator_or_predictor: A str of the learner name or a dict of the learned config predictor.
"choose_xgb" means choosing between xgb_limitdepth and xgboost.
If a dict, it contains:
- "version": a str of the version number.
- "preprocessing": a dictionary containing:
* "center": a list of meta feature value offsets for normalization.
* "scale": a list of meta feature scales to normalize each dimension.
- "neighbors": a list of dictionaries. Each dictionary contains:
* "features": a list of the normalized meta features for a neighbor.
* "choice": a integer of the configuration id in the portfolio.
- "portfolio": a list of dictionaries, each corresponding to a configuration:
* "class": a str of the learner name.
* "hyperparameters": a dict of the config. They key "FLAML_sample_size" will be ignored.
location: (Optional) A str of the location containing mined portfolio file.
Only valid when the portfolio is a str, by default the location is flaml/default.
Returns:
hyperparams: A dict of the hyperparameter configurations.
estiamtor_class: A class of the underlying estimator, e.g., lightgbm.LGBMClassifier.
X: the preprocessed X.
y: the preprocessed y.
feature_transformer: a data transformer that can be applied to X_test.
label_transformer: a label transformer that can be applied to y_test.
"""
dt = DataTransformer()
X, y = dt.fit_transform(X, y, task)
if "choose_xgb" == estimator_or_predictor:
# choose between xgb_limitdepth and xgboost
estimator_or_predictor = suggest_learner(
task,
X,
y,
estimator_list=["xgb_limitdepth", "xgboost"],
location=location,
)
config = suggest_config(task, X, y, estimator_or_predictor, location=location, k=1)[0]
estimator = config["class"]
model_class = task_factory(task).estimator_class_from_str(estimator)
hyperparams = config["hyperparameters"]
model = model_class(task=task, **hyperparams)
if model.estimator_class is None:
return hyperparams, model_class, X, y, None, None
else:
estimator_class = model.estimator_class
X = model._preprocess(X)
hyperparams = hyperparams and model.params
transformer = AutoMLTransformer(model, dt)
return hyperparams, estimator_class, X, y, transformer, dt.label_transformer

View File

@ -0,0 +1,329 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 877,
"max_depth": 11,
"min_child_weight": 0.6205465771093738,
"learning_rate": 0.013622118381700795,
"subsample": 0.566692814245426,
"colsample_bylevel": 0.8865741642101924,
"colsample_bytree": 1.0,
"reg_alpha": 0.01386336444764391,
"reg_lambda": 3.113947886074155
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 5457,
"max_depth": 6,
"min_child_weight": 0.19978269031877885,
"learning_rate": 0.003906732665632749,
"subsample": 0.8207785234496902,
"colsample_bylevel": 0.8438751931476698,
"colsample_bytree": 0.42202862997585794,
"reg_alpha": 0.017372558844968737,
"reg_lambda": 0.03977802121721031
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 3526,
"max_depth": 13,
"min_child_weight": 0.0994486725676356,
"learning_rate": 0.0009765625,
"subsample": 0.46123759274652554,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.4498813776397717,
"reg_alpha": 0.002599398546499414,
"reg_lambda": 0.028336396854402753
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
18000.0,
21.0,
2.0,
0.7565217391304347
],
"scale": [
39542.5,
143.0,
1.0,
0.5714285714285715
]
},
"neighbors": [
{
"features": [
1.2745779857115762,
1.0419580419580419,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
3
]
},
{
"features": [
11.821306189542897,
-0.0979020979020979,
0.0,
-0.5739130434782609
],
"choice": [
0,
2,
3
]
},
{
"features": [
0.290624012138838,
-0.08391608391608392,
0.0,
-1.3239130434782607
],
"choice": [
2,
1,
0,
3
]
},
{
"features": [
-0.4395018018587596,
-0.04895104895104895,
0.0,
-0.5739130434782609
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
0.68280963520263,
1.4615384615384615,
0.0,
0.0
],
"choice": [
1,
2,
0,
3
]
},
{
"features": [
0.65643295188721,
-0.04895104895104895,
0.0,
-0.5739130434782609
],
"choice": [
1,
3
]
},
{
"features": [
0.5737876967819435,
-0.03496503496503497,
0.0,
-0.5582880434782608
],
"choice": [
2,
1,
0,
3
]
},
{
"features": [
-0.4381867610798508,
-0.11888111888111888,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
3
]
},
{
"features": [
-0.3318960611999747,
11.293706293706293,
0.0,
0.3865087169129372
],
"choice": [
1,
0,
2,
3
]
},
{
"features": [
-0.432446102294999,
-0.006993006993006993,
0.0,
-0.7114130434782607
],
"choice": [
0,
1,
3
]
},
{
"features": [
0.0,
29.895104895104897,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
2,
3
]
},
{
"features": [
1.7764430675855092,
0.04895104895104895,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
2,
3
]
},
{
"features": [
-0.3873047986343807,
0.8601398601398601,
0.0,
-1.2266908212560386
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
-0.40720743503824997,
0.0,
0.0,
0.4260869565217391
],
"choice": [
1,
0,
2,
3
]
},
{
"features": [
-0.38247455269646585,
0.1048951048951049,
0.0,
-1.3239130434782607
],
"choice": [
0,
1,
3
]
},
{
"features": [
0.32921540115066067,
0.6783216783216783,
0.0,
-0.003997789240972687
],
"choice": [
0,
1,
3
]
},
{
"features": [
-0.3322248213947019,
-0.11188811188811189,
0.0,
0.4260869565217391
],
"choice": [
0,
3
]
},
{
"features": [
0.0,
29.895104895104897,
0.0,
0.4260869565217391
],
"choice": [
0,
1,
3
]
},
{
"features": [
-0.3385977113232598,
-0.006993006993006993,
0.0,
0.4260869565217391
],
"choice": [
1,
3
]
}
],
"configsource": [
"Jannis",
"adult",
"Amazon_employee_access",
"default"
]
}

View File

@ -0,0 +1,357 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 1191,
"max_depth": 13,
"min_child_weight": 6.4007885677724605,
"learning_rate": 0.037622775650237326,
"subsample": 1.0,
"colsample_bylevel": 0.3697773165627811,
"colsample_bytree": 0.813871237069598,
"reg_alpha": 0.0009765625,
"reg_lambda": 1.075702708240612
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 1499,
"max_depth": 11,
"min_child_weight": 0.07563529776156448,
"learning_rate": 0.039042609221240955,
"subsample": 0.7832981935783824,
"colsample_bylevel": 1.0,
"colsample_bytree": 1.0,
"reg_alpha": 0.0009765625,
"reg_lambda": 23.513066752844153
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 313,
"max_depth": 7,
"min_child_weight": 30.424259012001368,
"learning_rate": 0.08466828646360688,
"subsample": 0.9897083979469301,
"colsample_bylevel": 0.6769490906308069,
"colsample_bytree": 1.0,
"reg_alpha": 0.0014544085935366477,
"reg_lambda": 34.09911172306857
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 566,
"max_depth": 13,
"min_child_weight": 0.013176186839973599,
"learning_rate": 0.09285619488896565,
"subsample": 0.5897287493640815,
"colsample_bylevel": 0.923664288991597,
"colsample_bytree": 0.8244714790646485,
"reg_alpha": 0.023484974838756726,
"reg_lambda": 0.5690298249126402,
"FLAML_sample_size": 470620
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 971,
"max_depth": 8,
"min_child_weight": 0.0044052948947322645,
"learning_rate": 0.15171239415469703,
"subsample": 0.8340342805529243,
"colsample_bylevel": 0.9489310919814007,
"colsample_bytree": 0.022724724669028674,
"reg_alpha": 0.0009765625,
"reg_lambda": 0.0025897714798936954
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 464,
"max_depth": 2,
"min_child_weight": 0.0068282719220722,
"learning_rate": 0.07962498837600937,
"subsample": 0.47139986510869014,
"colsample_bylevel": 0.4814471959023239,
"colsample_bytree": 0.6050207253592859,
"reg_alpha": 0.0010290828959872173,
"reg_lambda": 0.0103104214002687
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 1799,
"max_depth": 3,
"min_child_weight": 0.0010034151843327725,
"learning_rate": 0.03453775119035777,
"subsample": 0.31322065037892344,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.2219038021462818,
"reg_alpha": 0.03885163786709896,
"reg_lambda": 1.1077175359756786
}
}
],
"preprocessing": {
"center": [
24668.5,
54.0,
7.0,
1.0
],
"scale": [
57198.0,
770.5,
6.0,
1.0
]
},
"neighbors": [
{
"features": [
8.710820308402392,
0.0,
0.0,
-0.8148148148148149
],
"choice": [
0,
3,
4
]
},
{
"features": [
0.6701545508584216,
0.9474367293964958,
0.5,
0.0
],
"choice": [
0,
2,
7,
4
]
},
{
"features": [
0.5945575020105598,
-0.03504218040233614,
15.5,
0.0
],
"choice": [
0,
2,
7,
6,
3,
4
]
},
{
"features": [
0.8862285394594217,
0.0,
-0.5,
0.0
],
"choice": [
2,
4
]
},
{
"features": [
-0.2739344033008147,
9.2744970798183,
0.5,
0.0
],
"choice": [
0,
2,
7,
6,
4
]
},
{
"features": [
0.48133676002657433,
-0.058403634003893576,
0.0,
0.0
],
"choice": [
1,
4
]
},
{
"features": [
0.4862145529563971,
0.16353017521090202,
0.5,
0.0
],
"choice": [
0,
1,
4
]
},
{
"features": [
-0.40409629707332423,
-0.06229720960415315,
-0.5,
-1.0
],
"choice": [
4
]
},
{
"features": [
-0.41428896115248787,
1.0408825438027256,
0.3333333333333333,
0.0
],
"choice": [
5,
3,
1,
7,
6,
4
]
},
{
"features": [
0.6317091506696039,
-0.015574302401038288,
-0.6666666666666666,
-1.0
],
"choice": [
1,
0,
3,
4
]
},
{
"features": [
-0.2739344033008147,
2.5256327060350423,
-0.3333333333333333,
0.0
],
"choice": [
0,
5,
3,
7,
4
]
},
{
"features": [
-0.30168012867582783,
0.9682024659312135,
0.0,
0.0
],
"choice": [
1,
3,
4
]
},
{
"features": [
0.2739344033008147,
-0.06229720960415315,
-0.6666666666666666,
0.0
],
"choice": [
4
]
},
{
"features": [
-0.39981293052204625,
0.21025308241401688,
0.5,
0.0
],
"choice": [
7,
4
]
},
{
"features": [
-0.3949351375922235,
-0.04931862426995458,
0.0,
0.0
],
"choice": [
6,
0,
7,
1,
3,
4
]
},
{
"features": [
-0.41797790132522117,
-0.04672290720311486,
-0.5,
0.0
],
"choice": [
6,
1,
7,
2,
0,
3,
4
]
}
],
"configsource": [
"guillermo",
"connect-4",
"Helena",
"Covertype",
"default",
"cnae-9",
"vehicle",
"mfeat-factors"
]
}

View File

@ -0,0 +1,350 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 4923,
"max_depth": 12,
"min_child_weight": 0.7625732991776795,
"learning_rate": 0.009239549681857523,
"subsample": 0.8193164619615052,
"colsample_bylevel": 0.7785754297307862,
"colsample_bytree": 0.788491073979525,
"reg_alpha": 0.002282749364196872,
"reg_lambda": 131.2194560716441
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 2111,
"max_depth": 9,
"min_child_weight": 3.405822241186395,
"learning_rate": 0.005804247705198151,
"subsample": 0.37848422782052427,
"colsample_bylevel": 0.8228350674288559,
"colsample_bytree": 0.8813475713109656,
"reg_alpha": 0.009761356063132219,
"reg_lambda": 13.187783936727843,
"FLAML_sample_size": 810000
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 1499,
"max_depth": 11,
"min_child_weight": 0.07563529776156448,
"learning_rate": 0.039042609221240955,
"subsample": 0.7832981935783824,
"colsample_bylevel": 1.0,
"colsample_bytree": 1.0,
"reg_alpha": 0.0009765625,
"reg_lambda": 23.513066752844153
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 19722,
"max_depth": 11,
"min_child_weight": 6.46800727978204,
"learning_rate": 0.0010837437950202355,
"subsample": 0.49509562408032115,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.8826299329274134,
"reg_alpha": 0.23887161121959208,
"reg_lambda": 15.163773888208217
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {
"n_estimators": 544,
"max_depth": 12,
"min_child_weight": 79.32555867011995,
"learning_rate": 0.010128107120014433,
"subsample": 0.9799974977817297,
"colsample_bylevel": 0.881815418056542,
"colsample_bytree": 0.9718556912196423,
"reg_alpha": 72.63148950428749,
"reg_lambda": 1.4601415712058006
}
},
{
"class": "xgb_limitdepth",
"hyperparameters": {}
}
],
"preprocessing": {
"center": [
36691.0,
10.0,
0.0,
1.0
],
"scale": [
140856.0,
1.0,
1.0,
0.4444444444444444
]
},
"neighbors": [
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
4,
5
]
},
{
"features": [
-0.17263020389617767,
30.0,
0.0,
0.0
],
"choice": [
2,
0,
5
]
},
{
"features": [
6.129018288180837,
-1.0,
0.0,
-0.7500000000000001
],
"choice": [
1,
0,
2,
4,
5
]
},
{
"features": [
0.48478588061566424,
-1.0,
0.0,
-2.0
],
"choice": [
4,
1,
3,
5
]
},
{
"features": [
-0.14869796103822344,
-1.0,
0.0,
-0.7500000000000001
],
"choice": [
4,
1,
3,
0,
5
]
},
{
"features": [
-0.06175100812176975,
-1.0,
0.0,
-1.7500000000000002
],
"choice": [
4,
1,
5
]
},
{
"features": [
6.129018288180837,
8.0,
0.0,
-1.0
],
"choice": [
0,
2,
1,
4,
5
]
},
{
"features": [
6.129018288180837,
0.0,
0.0,
-2.0250000000000004
],
"choice": [
1,
0,
2,
4,
5
]
},
{
"features": [
0.8713934798659624,
0.0,
0.0,
0.0
],
"choice": [
4,
5
]
},
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
1,
3,
0,
2,
5
]
},
{
"features": [
-0.11491168285341058,
6.0,
0.0,
0.0
],
"choice": [
3,
1,
0,
2,
4,
5
]
},
{
"features": [
-0.11491168285341058,
-2.0,
0.0,
0.0
],
"choice": [
0,
1,
3,
2,
4,
5
]
},
{
"features": [
-0.1286065201340376,
-2.0,
0.0,
0.0
],
"choice": [
3,
0,
2,
1,
4,
5
]
},
{
"features": [
0.0,
0.0,
0.0,
-0.6750000000000002
],
"choice": [
2,
3,
1,
0,
5
]
},
{
"features": [
6.288819787584483,
0.0,
0.0,
0.0
],
"choice": [
2,
0,
1,
5
]
},
{
"features": [
-0.16464332367808257,
38.0,
0.0,
0.0
],
"choice": [
0,
2,
3,
1,
5
]
},
{
"features": [
-0.15343329357641847,
-7.0,
0.0,
-1.5000000000000002
],
"choice": [
3,
5
]
}
],
"configsource": [
"higgs",
"bng_pharynx",
"connect-4",
"house_16H",
"bng_echomonths",
"default"
]
}

View File

@ -0,0 +1,375 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 319,
"max_leaves": 1312,
"min_child_weight": 0.001,
"learning_rate": 0.01872379806270421,
"subsample": 0.6890079660561895,
"colsample_bylevel": 0.7551225121854014,
"colsample_bytree": 0.7860755604500558,
"reg_alpha": 0.17028752704343114,
"reg_lambda": 1.4375743264564231
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 7902,
"max_leaves": 49,
"min_child_weight": 0.038063497848955595,
"learning_rate": 0.0009765625,
"subsample": 0.9357800695141445,
"colsample_bylevel": 0.47031312177249246,
"colsample_bytree": 0.9053386579586192,
"reg_alpha": 1.5286102593845932,
"reg_lambda": 18.96811296717419
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 13499,
"max_leaves": 60,
"min_child_weight": 0.008494221584011285,
"learning_rate": 0.006955765856675575,
"subsample": 0.5965241023754743,
"colsample_bylevel": 0.590641168068946,
"colsample_bytree": 1.0,
"reg_alpha": 0.2522240954379289,
"reg_lambda": 5.351809144038808
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 591,
"max_leaves": 16651,
"min_child_weight": 0.03356567864689129,
"learning_rate": 0.002595066436678338,
"subsample": 0.9114132805513452,
"colsample_bylevel": 0.9503441844594458,
"colsample_bytree": 0.5703338448066768,
"reg_alpha": 0.010405212349127894,
"reg_lambda": 0.05352660657433639
}
}
],
"preprocessing": {
"center": [
18000.0,
28.0,
2.0,
0.7565217391304347
],
"scale": [
42124.0,
130.0,
1.0,
0.5714285714285715
]
},
"neighbors": [
{
"features": [
1.196467571930491,
1.0923076923076922,
0.0,
0.4260869565217391
],
"choice": [
0,
3,
2,
1
]
},
{
"features": [
11.096856898680088,
-0.16153846153846155,
0.0,
-0.5739130434782609
],
"choice": [
0,
2,
3,
1
]
},
{
"features": [
8.658152122305575,
0.38461538461538464,
0.0,
-0.7405797101449274
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
0.27281359794891274,
-0.14615384615384616,
0.0,
-1.3239130434782607
],
"choice": [
3,
0,
2,
1
]
},
{
"features": [
-0.4125676573924604,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
3,
1,
0,
2
]
},
{
"features": [
0.6409647706770487,
1.5538461538461539,
0.0,
0.0
],
"choice": [
1,
0,
2,
3
]
},
{
"features": [
2.3515573069983855,
0.16923076923076924,
0.0,
0.4260869565217391
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
0.6162045389801538,
-0.1076923076923077,
0.0,
-0.5739130434782609
],
"choice": [
1,
0,
2,
3
]
},
{
"features": [
0.5386240622922799,
-0.09230769230769231,
0.0,
-0.5582880434782608
],
"choice": [
0,
1,
3,
2
]
},
{
"features": [
-0.41133320672300827,
-0.18461538461538463,
0.0,
0.4260869565217391
],
"choice": [
2,
1,
0,
3
]
},
{
"features": [
-0.31155635742094767,
12.36923076923077,
0.0,
0.3865087169129372
],
"choice": [
2,
1,
0,
3
]
},
{
"features": [
-0.40594435476213087,
-0.06153846153846154,
0.0,
-0.7114130434782607
],
"choice": [
0,
1,
2,
3
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
0
]
},
{
"features": [
1.6675766783781218,
0.0,
0.0,
0.4260869565217391
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
-0.36356946158959264,
0.8923076923076924,
0.0,
-1.2266908212560386
],
"choice": [
3,
1,
0,
2
]
},
{
"features": [
-0.38225239768303104,
-0.05384615384615385,
0.0,
0.4260869565217391
],
"choice": [
3,
2,
0,
1
]
},
{
"features": [
-0.3590352293229513,
0.06153846153846154,
0.0,
-1.3239130434782607
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
0.3090399772101415,
0.6923076923076923,
0.0,
-0.003997789240972687
],
"choice": [
2,
0,
3,
1
]
},
{
"features": [
-0.3118649700883107,
-0.17692307692307693,
0.0,
0.4260869565217391
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
0.0,
32.83076923076923,
0.0,
0.4260869565217391
],
"choice": [
0,
3
]
},
{
"features": [
-0.3178473079479632,
-0.06153846153846154,
0.0,
0.4260869565217391
],
"choice": [
0,
3,
1,
2
]
}
],
"configsource": [
"fabert",
"bng_lowbwt",
"pol",
"Amazon_employee_access"
]
}

View File

@ -0,0 +1,512 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 392,
"max_leaves": 46,
"min_child_weight": 0.20655273911443411,
"learning_rate": 0.08039123467849849,
"subsample": 0.6482821473906787,
"colsample_bylevel": 0.5448604029329934,
"colsample_bytree": 0.4211786481671673,
"reg_alpha": 0.029040644754759502,
"reg_lambda": 4.60220206538413
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 6357,
"max_leaves": 206,
"min_child_weight": 1.9495322566288034,
"learning_rate": 0.0068766724195393905,
"subsample": 0.9451618245005704,
"colsample_bylevel": 0.9030482524943064,
"colsample_bytree": 0.9278972006416252,
"reg_alpha": 0.01857648400903689,
"reg_lambda": 6.021166480604588,
"FLAML_sample_size": 344444
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 1067,
"max_leaves": 55,
"min_child_weight": 1.578700876556201,
"learning_rate": 0.01882776721912098,
"subsample": 0.6486829588043383,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.6470978147570122,
"reg_alpha": 0.2623396481373557,
"reg_lambda": 12.320026567378322
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 765,
"max_leaves": 6,
"min_child_weight": 0.001,
"learning_rate": 1.0,
"subsample": 0.9833803894285497,
"colsample_bylevel": 1.0,
"colsample_bytree": 1.0,
"reg_alpha": 0.0012553728257619922,
"reg_lambda": 0.03280542610559108
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 2866,
"max_leaves": 2954,
"min_child_weight": 0.003652484923138387,
"learning_rate": 0.006320484540131336,
"subsample": 0.45886345839532916,
"colsample_bylevel": 0.4143419565729296,
"colsample_bytree": 0.9117641224108227,
"reg_alpha": 0.2873746517375349,
"reg_lambda": 17.04964039639045
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 512,
"max_leaves": 3194,
"min_child_weight": 0.004561511536080627,
"learning_rate": 0.05288849444758447,
"subsample": 0.8653058105000044,
"colsample_bylevel": 0.8833689901424637,
"colsample_bytree": 0.9505209943737727,
"reg_alpha": 0.0037017878164852017,
"reg_lambda": 2.1872397928745113,
"FLAML_sample_size": 470620
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 335,
"max_leaves": 37,
"min_child_weight": 0.0013851539632487603,
"learning_rate": 0.2593737370075479,
"subsample": 0.9810091528571387,
"colsample_bylevel": 0.9484250613084422,
"colsample_bytree": 0.192606132199437,
"reg_alpha": 0.10585986776049093,
"reg_lambda": 0.017684465384509407
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 8315,
"max_leaves": 4,
"min_child_weight": 0.7673654415794792,
"learning_rate": 0.002432260930606481,
"subsample": 0.8476000618302348,
"colsample_bylevel": 0.8815698870579244,
"colsample_bytree": 0.7057137578225323,
"reg_alpha": 0.0016838090603716895,
"reg_lambda": 0.28815989841009226
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 319,
"max_leaves": 1312,
"min_child_weight": 0.001,
"learning_rate": 0.01872379806270421,
"subsample": 0.6890079660561895,
"colsample_bylevel": 0.7551225121854014,
"colsample_bytree": 0.7860755604500558,
"reg_alpha": 0.17028752704343114,
"reg_lambda": 1.4375743264564231
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 5739,
"max_leaves": 5,
"min_child_weight": 0.1359602026207002,
"learning_rate": 0.14496176867613397,
"subsample": 0.864897070662231,
"colsample_bylevel": 0.01,
"colsample_bytree": 0.9394057513384305,
"reg_alpha": 0.001103317921178771,
"reg_lambda": 0.1655504349283218
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 3369,
"max_leaves": 23,
"min_child_weight": 0.006136645605168392,
"learning_rate": 0.05726537983358939,
"subsample": 1.0,
"colsample_bylevel": 1.0,
"colsample_bytree": 1.0,
"reg_alpha": 0.40981311572427176,
"reg_lambda": 4.343877111132155
}
}
],
"preprocessing": {
"center": [
24668.5,
54.0,
7.0,
1.0
],
"scale": [
57198.0,
770.5,
6.0,
1.0
]
},
"neighbors": [
{
"features": [
8.710820308402392,
0.0,
0.0,
-0.8148148148148149
],
"choice": [
5,
4,
1,
8,
10,
2,
0,
6,
9,
7,
3
]
},
{
"features": [
0.6701545508584216,
0.9474367293964958,
0.5,
0.0
],
"choice": [
0,
2,
3,
6,
10,
8,
9
]
},
{
"features": [
0.5945575020105598,
-0.03504218040233614,
15.5,
0.0
],
"choice": [
0,
2,
3,
7,
8,
5,
10,
9,
6
]
},
{
"features": [
0.8862285394594217,
0.0,
-0.5,
0.0
],
"choice": [
2,
8,
0,
4,
10,
1,
9,
6,
7,
5,
3
]
},
{
"features": [
-0.2739344033008147,
9.2744970798183,
0.5,
0.0
],
"choice": [
0,
3,
6
]
},
{
"features": [
0.48133676002657433,
-0.058403634003893576,
0.0,
0.0
],
"choice": [
10,
3,
0,
5,
1,
7,
6,
2,
4,
9,
8
]
},
{
"features": [
0.4862145529563971,
0.16353017521090202,
0.5,
0.0
],
"choice": [
1,
0,
2,
3,
10,
8,
6,
5,
9,
7
]
},
{
"features": [
-0.40409629707332423,
-0.06229720960415315,
-0.5,
-1.0
],
"choice": [
3,
9,
5,
10,
1,
7,
2,
8,
4,
6,
0
]
},
{
"features": [
-0.41428896115248787,
1.0408825438027256,
0.3333333333333333,
0.0
],
"choice": [
6,
9,
0,
5,
10,
4,
8,
7,
1,
2,
3
]
},
{
"features": [
0.6317091506696039,
-0.015574302401038288,
-0.6666666666666666,
-1.0
],
"choice": [
1,
10,
4,
5,
8,
6,
2,
0,
3,
9,
7
]
},
{
"features": [
-0.2739344033008147,
2.5256327060350423,
-0.3333333333333333,
0.0
],
"choice": [
0,
2,
3,
9,
6,
10,
5,
8,
7
]
},
{
"features": [
-0.30168012867582783,
0.9682024659312135,
0.0,
0.0
],
"choice": [
8,
4,
0,
2,
10,
1,
5,
6,
9,
7,
3
]
},
{
"features": [
0.2739344033008147,
-0.06229720960415315,
-0.6666666666666666,
0.0
],
"choice": [
10,
3,
9,
1,
4,
2,
8,
5,
0,
7,
6
]
},
{
"features": [
-0.39981293052204625,
0.21025308241401688,
0.5,
0.0
],
"choice": [
0,
9,
1,
7,
5,
10,
6,
2,
4,
8,
3
]
},
{
"features": [
-0.3949351375922235,
-0.04931862426995458,
0.0,
0.0
],
"choice": [
0,
2,
1,
7,
8,
4,
5,
6,
10,
9,
3
]
},
{
"features": [
-0.41797790132522117,
-0.04672290720311486,
-0.5,
0.0
],
"choice": [
7,
4,
8,
2,
0,
5,
10,
1,
6,
9,
3
]
}
],
"configsource": [
"segment",
"Albert",
"Helena",
"car",
"house_8L",
"Covertype",
"cnae-9",
"KDDCup09_appetency",
"fabert",
"dilbert",
"jungle_chess_2pcs_raw_endgame_complete"
]
}

View File

@ -0,0 +1,311 @@
{
"version": "1.0.2",
"meta_feature_names": [
"NumberOfInstances","NumberOfFeatures","NumberOfClasses","PercentageOfNumericFeatures"
],
"portfolio": [
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 6357,
"max_leaves": 206,
"min_child_weight": 1.9495322566288034,
"learning_rate": 0.0068766724195393905,
"subsample": 0.9451618245005704,
"colsample_bylevel": 0.9030482524943064,
"colsample_bytree": 0.9278972006416252,
"reg_alpha": 0.01857648400903689,
"reg_lambda": 6.021166480604588,
"FLAML_sample_size": 344444
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 23045,
"max_leaves": 247,
"min_child_weight": 0.004319397499079841,
"learning_rate": 0.0032914413473281215,
"subsample": 0.7334190564433234,
"colsample_bylevel": 1.0,
"colsample_bytree": 1.0,
"reg_alpha": 0.03514226467919635,
"reg_lambda": 1.2679661021665851
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 1899,
"max_leaves": 59,
"min_child_weight": 0.013389019900720164,
"learning_rate": 0.0028943401472847964,
"subsample": 0.7808944208233943,
"colsample_bylevel": 1.0,
"colsample_bytree": 0.9999355357362375,
"reg_alpha": 0.7905117773932884,
"reg_lambda": 2.916897119216104
}
},
{
"class": "xgboost",
"hyperparameters": {
"n_estimators": 5611,
"max_leaves": 61,
"min_child_weight": 0.01070518287797225,
"learning_rate": 0.005485127037677848,
"subsample": 0.4713518256961299,
"colsample_bylevel": 0.9777437906530106,
"colsample_bytree": 0.9519335125615331,
"reg_alpha": 0.03621564207188963,
"reg_lambda": 1.8045765669466283
}
}
],
"preprocessing": {
"center": [
36691.0,
10.0,
0.0,
1.0
],
"scale": [
324551.25,
2.5,
1.0,
0.36111111111111116
]
},
"neighbors": [
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
2,
3,
0,
1
]
},
{
"features": [
-0.07492191140844474,
12.0,
0.0,
0.0
],
"choice": [
0,
1,
3,
2
]
},
{
"features": [
2.6600082421497375,
-0.4,
0.0,
-0.923076923076923
],
"choice": [
3,
0,
2,
1
]
},
{
"features": [
0.21039820367353385,
-0.4,
0.0,
-2.4615384615384612
],
"choice": [
3,
2,
0,
1
]
},
{
"features": [
-0.06453526215043079,
-0.4,
0.0,
-0.923076923076923
],
"choice": [
2,
3,
0,
1
]
},
{
"features": [
-0.026800081651203008,
-0.4,
0.0,
-2.1538461538461537
],
"choice": [
2,
3,
0,
1
]
},
{
"features": [
2.6600082421497375,
3.2,
0.0,
-1.2307692307692306
],
"choice": [
1,
0,
3,
2
]
},
{
"features": [
2.6600082421497375,
0.0,
0.0,
-2.492307692307692
],
"choice": [
3,
0,
2,
1
]
},
{
"features": [
0.3781868040871819,
0.0,
0.0,
0.0
],
"choice": [
2,
3,
0,
1
]
},
{
"features": [
0.0,
0.0,
0.0,
0.0
],
"choice": [
3,
0,
1,
2
]
},
{
"features": [
-0.04987193856132121,
2.4,
0.0,
0.0
],
"choice": [
3,
1,
0,
2
]
},
{
"features": [
-0.04987193856132121,
-0.8,
0.0,
0.0
],
"choice": [
2,
0,
1,
3
]
},
{
"features": [
-0.0558155299047531,
-0.8,
0.0,
0.0
],
"choice": [
0,
3,
1,
2
]
},
{
"features": [
0.0,
0.0,
0.0,
-0.8307692307692308
],
"choice": [
1,
0,
3,
2
]
},
{
"features": [
2.729362465866331,
0.0,
0.0,
0.0
],
"choice": [
1,
0,
3,
2
]
},
{
"features": [
-0.07145558675247746,
15.2,
0.0,
0.0
],
"choice": [
0,
3,
1,
2
]
}
],
"configsource": [
"Albert",
"mv",
"bng_echomonths",
"house_16H"
]
}

9
flaml/ml.py Normal file
View File

@ -0,0 +1,9 @@
import warnings
from flaml.automl.ml import *
warnings.warn(
"Importing from `flaml.ml` is deprecated. Please use `flaml.automl.ml`.",
DeprecationWarning,
)

9
flaml/model.py Normal file
View File

@ -0,0 +1,9 @@
import warnings
from flaml.automl.model import *
warnings.warn(
"Importing from `flaml.model` is deprecated. Please use `flaml.automl.model`.",
DeprecationWarning,
)

47
flaml/onlineml/README.md Normal file
View File

@ -0,0 +1,47 @@
# ChaCha for Online AutoML
FLAML includes *ChaCha* which is an automatic hyperparameter tuning solution for online machine learning. Online machine learning has the following properties: (1) data comes in sequential order; and (2) the performance of the machine learning model is evaluated online, i.e., at every iteration. *ChaCha* performs online AutoML respecting the aforementioned properties of online learning, and at the same time respecting the following constraints: (1) only a small constant number of 'live' models are allowed to perform online learning at the same time; and (2) no model persistence or offline training is allowed, which means that once we decide to replace a 'live' model with a new one, the replaced model can no longer be retrieved.
For more technical details about *ChaCha*, please check our paper.
* [ChaCha for Online AutoML](https://www.microsoft.com/en-us/research/publication/chacha-for-online-automl/). Qingyun Wu, Chi Wang, John Langford, Paul Mineiro and Marco Rossi. ICML 2021.
```
@inproceedings{wu2021chacha,
title={ChaCha for online AutoML},
author={Qingyun Wu and Chi Wang and John Langford and Paul Mineiro and Marco Rossi},
year={2021},
booktitle={ICML},
}
```
## `AutoVW`
`flaml.AutoVW` is a realization of *ChaCha* AutoML method with online learners from the open-source online machine learning library [Vowpal Wabbit](https://vowpalwabbit.org/) learner. It can be used to tune both conventional numerical and categorical hyperparameters, such as learning rate, and hyperparameters for featurization choices, such as the namespace (a namespace is a group of features) interactions in Vowpal Wabbit.
An example of online namespace interactions tuning in VW:
```python
# require: pip install flaml[vw]
from flaml import AutoVW
'''create an AutoVW instance for tuning namespace interactions'''
autovw = AutoVW(max_live_model_num=5, search_space={'interactions': AutoVW.AUTOMATIC})
```
An example of online tuning of both namespace interactions and learning rate in VW:
```python
# require: pip install flaml[vw]
from flaml import AutoVW
from flaml.tune import loguniform
''' create an AutoVW instance for tuning namespace interactions and learning rate'''
# set up the search space and init config
search_space_nilr = {'interactions': AutoVW.AUTOMATIC, 'learning_rate': loguniform(lower=2e-10, upper=1.0)}
init_config_nilr = {'interactions': set(), 'learning_rate': 0.5}
# create an AutoVW instance
autovw = AutoVW(max_live_model_num=5, search_space=search_space_nilr, init_config=init_config_nilr)
```
A user can use the resulting AutoVW instances `autovw` in a similar way to a vanilla Vowpal Wabbit instance, i.e., `pyvw.vw`, to perform online learning by iteratively calling its `predict(data_example)` and `learn(data_example)` functions at each data example.
For more examples, please check out
[AutoVW notebook](https://github.com/microsoft/FLAML/blob/main/notebook/autovw.ipynb).

View File

@ -0,0 +1,2 @@
from .trial import VowpalWabbitTrial
from .trial_runner import OnlineTrialRunner

214
flaml/onlineml/autovw.py Normal file
View File

@ -0,0 +1,214 @@
from typing import Optional, Union
import logging
from flaml.tune import (
Trial,
Categorical,
Float,
PolynomialExpansionSet,
polynomial_expansion_set,
)
from flaml.onlineml import OnlineTrialRunner
from flaml.tune.scheduler import ChaChaScheduler
from flaml.tune.searcher import ChampionFrontierSearcher
from flaml.onlineml.trial import get_ns_feature_dim_from_vw_example
logger = logging.getLogger(__name__)
class AutoVW:
"""Class for the AutoVW algorithm."""
WARMSTART_NUM = 100
AUTOMATIC = "_auto"
VW_INTERACTION_ARG_NAME = "interactions"
def __init__(
self,
max_live_model_num: int,
search_space: dict,
init_config: Optional[dict] = {},
min_resource_lease: Optional[Union[str, float]] = "auto",
automl_runner_args: Optional[dict] = {},
scheduler_args: Optional[dict] = {},
model_select_policy: Optional[str] = "threshold_loss_ucb",
metric: Optional[str] = "mae_clipped",
random_seed: Optional[int] = None,
model_selection_mode: Optional[str] = "min",
cb_coef: Optional[float] = None,
):
"""Constructor.
Args:
max_live_model_num: An int to specify the maximum number of
'live' models, which, in other words, is the maximum number
of models allowed to update in each learning iteraction.
search_space: A dictionary of the search space. This search space
includes both hyperparameters we want to tune and fixed
hyperparameters. In the latter case, the value is a fixed value.
init_config: A dictionary of a partial or full initial config,
e.g. {'interactions': set(), 'learning_rate': 0.5}
min_resource_lease: string or float | The minimum resource lease
assigned to a particular model/trial. If set as 'auto', it will
be calculated automatically.
automl_runner_args: A dictionary of configuration for the OnlineTrialRunner.
If set {}, default values will be used, which is equivalent to using
the following configs.
Example:
```python
automl_runner_args = {
"champion_test_policy": 'loss_ucb', # the statistic test for a better champion
"remove_worse": False, # whether to do worse than test
}
```
scheduler_args: A dictionary of configuration for the scheduler.
If set {}, default values will be used, which is equivalent to using the
following config.
Example:
```python
scheduler_args = {
"keep_challenger_metric": 'ucb', # what metric to use when deciding the top performing challengers
"keep_challenger_ratio": 0.5, # denotes the ratio of top performing challengers to keep live
"keep_champion": True, # specifcies whether to keep the champion always running
}
```
model_select_policy: A string in ['threshold_loss_ucb',
'threshold_loss_lcb', 'threshold_loss_avg', 'loss_ucb', 'loss_lcb',
'loss_avg'] to specify how to select one model to do prediction from
the live model pool. Default value is 'threshold_loss_ucb'.
metric: A string in ['mae_clipped', 'mae', 'mse', 'absolute_clipped',
'absolute', 'squared'] to specify the name of the loss function used
for calculating the progressive validation loss in ChaCha.
random_seed: An integer of the random seed used in the searcher
(more specifically this the random seed for ConfigOracle).
model_selection_mode: A string in ['min', 'max'] to specify the objective as
minimization or maximization.
cb_coef: A float coefficient (optional) used in the sample complexity bound.
"""
self._max_live_model_num = max_live_model_num
self._search_space = search_space
self._init_config = init_config
self._online_trial_args = {
"metric": metric,
"min_resource_lease": min_resource_lease,
"cb_coef": cb_coef,
}
self._automl_runner_args = automl_runner_args
self._scheduler_args = scheduler_args
self._model_select_policy = model_select_policy
self._model_selection_mode = model_selection_mode
self._random_seed = random_seed
self._trial_runner = None
self._best_trial = None
# code for debugging purpose
self._prediction_trial_id = None
self._iter = 0
def _setup_trial_runner(self, vw_example):
"""Set up the _trial_runner based on one vw_example."""
# setup the default search space for the namespace interaction hyperparameter
search_space = self._search_space.copy()
for k, v in self._search_space.items():
if k == self.VW_INTERACTION_ARG_NAME and v == self.AUTOMATIC:
raw_namespaces = self.get_ns_feature_dim_from_vw_example(vw_example).keys()
search_space[k] = polynomial_expansion_set(init_monomials=set(raw_namespaces))
# setup the init config based on the input _init_config and search space
init_config = self._init_config.copy()
for k, v in search_space.items():
if k not in init_config.keys():
if isinstance(v, PolynomialExpansionSet):
init_config[k] = set()
elif not isinstance(v, Categorical) and not isinstance(v, Float):
init_config[k] = v
searcher_args = {
"init_config": init_config,
"space": search_space,
"random_seed": self._random_seed,
"online_trial_args": self._online_trial_args,
}
logger.info("original search_space %s", self._search_space)
logger.info("original init_config %s", self._init_config)
logger.info("searcher_args %s", searcher_args)
logger.info("scheduler_args %s", self._scheduler_args)
logger.info("automl_runner_args %s", self._automl_runner_args)
searcher = ChampionFrontierSearcher(**searcher_args)
scheduler = ChaChaScheduler(**self._scheduler_args)
self._trial_runner = OnlineTrialRunner(
max_live_model_num=self._max_live_model_num,
searcher=searcher,
scheduler=scheduler,
**self._automl_runner_args
)
def predict(self, data_sample):
"""Predict on the input data sample.
Args:
data_sample: one data example in vw format.
"""
if self._trial_runner is None:
self._setup_trial_runner(data_sample)
self._best_trial = self._select_best_trial()
self._y_predict = self._best_trial.predict(data_sample)
# code for debugging purpose
if self._prediction_trial_id is None or self._prediction_trial_id != self._best_trial.trial_id:
self._prediction_trial_id = self._best_trial.trial_id
logger.info(
"prediction trial id changed to %s at iter %s, resource used: %s",
self._prediction_trial_id,
self._iter,
self._best_trial.result.resource_used,
)
return self._y_predict
def learn(self, data_sample):
"""Perform one online learning step with the given data sample.
Args:
data_sample: one data example in vw format. It will be used to
update the vw model.
"""
self._iter += 1
self._trial_runner.step(data_sample, (self._y_predict, self._best_trial))
def _select_best_trial(self):
"""Select a best trial from the running trials according to the _model_select_policy."""
best_score = float("+inf") if self._model_selection_mode == "min" else float("-inf")
new_best_trial = None
for trial in self._trial_runner.running_trials:
if trial.result is not None and (
"threshold" not in self._model_select_policy or trial.result.resource_used >= self.WARMSTART_NUM
):
score = trial.result.get_score(self._model_select_policy)
if ("min" == self._model_selection_mode and score < best_score) or (
"max" == self._model_selection_mode and score > best_score
):
best_score = score
new_best_trial = trial
if new_best_trial is not None:
logger.debug("best_trial resource used: %s", new_best_trial.result.resource_used)
return new_best_trial
else:
# This branch will be triggered when the resource consumption all trials are smaller
# than the WARMSTART_NUM threshold. In this case, we will select the _best_trial
# selected in the previous iteration.
if self._best_trial is not None and self._best_trial.status == Trial.RUNNING:
logger.debug("old best trial %s", self._best_trial.trial_id)
return self._best_trial
else:
# this will be triggered in the first iteration or in the iteration where we want
# to select the trial from the previous iteration but that trial has been paused
# (i.e., self._best_trial.status != Trial.RUNNING) by the scheduler.
logger.debug(
"using champion trial: %s",
self._trial_runner.champion_trial.trial_id,
)
return self._trial_runner.champion_trial
@staticmethod
def get_ns_feature_dim_from_vw_example(vw_example) -> dict:
"""Get a dictionary of feature dimensionality for each namespace singleton."""
return get_ns_feature_dim_from_vw_example(vw_example)

415
flaml/onlineml/trial.py Normal file
View File

@ -0,0 +1,415 @@
import numpy as np
import logging
import time
import math
import copy
import collections
from typing import Optional, Union
from flaml.tune import Trial
try:
from sklearn.metrics import mean_squared_error, mean_absolute_error
except ImportError:
pass
logger = logging.getLogger(__name__)
def get_ns_feature_dim_from_vw_example(vw_example) -> dict:
"""Get a dictionary of feature dimensionality for each namespace singleton."""
# *************************A NOTE about the input vwexample***********
# Assumption: assume the vw_example takes one of the following format
# depending on whether the example includes the feature names.
# format 1: `y |ns1 feature1:feature_value1 feature2:feature_value2 |ns2
# ns2 feature3:feature_value3 feature4:feature_value4`
# format 2: `y | ns1 feature_value1 feature_value2 |
# ns2 feature_value3 feature_value4`
# The output of both cases are `{'ns1': 2, 'ns2': 2}`.
# For more information about the input formate of vw example, please refer to
# https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format.
ns_feature_dim = {}
data = vw_example.split("|")
for i in range(1, len(data)):
if ":" in data[i]:
ns_w_feature = data[i].split(" ")
ns = ns_w_feature[0]
feature = ns_w_feature[1:]
feature_dim = len(feature)
else:
data_split = data[i].split(" ")
ns = data_split[0]
feature_dim = len(data_split) - 1
if len(data_split[-1]) == 0:
feature_dim -= 1
ns_feature_dim[ns] = feature_dim
logger.debug("name space feature dimension %s", ns_feature_dim)
return ns_feature_dim
class OnlineResult:
"""Class for managing the result statistics of a trial."""
prob_delta = 0.1
LOSS_MIN = 0.0
LOSS_MAX = np.inf
CB_COEF = 0.05 # 0.001 for mse
def __init__(
self,
result_type_name: str,
cb_coef: Optional[float] = None,
init_loss: Optional[float] = 0.0,
init_cb: Optional[float] = 100.0,
mode: Optional[str] = "min",
sliding_window_size: Optional[int] = 100,
):
"""Constructor.
Args:
result_type_name: A String to specify the name of the result type.
cb_coef: a string to specify the coefficient on the confidence bound.
init_loss: a float to specify the inital loss.
init_cb: a float to specify the intial confidence bound.
mode: A string in ['min', 'max'] to specify the objective as
minimization or maximization.
sliding_window_size: An int to specify the size of the sliding window
(for experimental purpose).
"""
self._result_type_name = result_type_name # for example 'mse' or 'mae'
self._mode = mode
self._init_loss = init_loss
# statistics needed for alg
self.observation_count = 0
self.resource_used = 0.0
self._loss_avg = 0.0
self._loss_cb = init_cb # a large number (TODO: this can be changed)
self._cb_coef = cb_coef if cb_coef is not None else self.CB_COEF
# optional statistics
self._sliding_window_size = sliding_window_size
self._loss_queue = collections.deque(maxlen=self._sliding_window_size)
def update_result(
self,
new_loss,
new_resource_used,
data_dimension,
bound_of_range=1.0,
new_observation_count=1.0,
):
"""Update result statistics."""
self.resource_used += new_resource_used
# keep the running average instead of sum of loss to avoid over overflow
self._loss_avg = self._loss_avg * (
self.observation_count / (self.observation_count + new_observation_count)
) + new_loss / (self.observation_count + new_observation_count)
self.observation_count += new_observation_count
self._loss_cb = self._update_loss_cb(bound_of_range, data_dimension)
self._loss_queue.append(new_loss)
def _update_loss_cb(self, bound_of_range, data_dim, bound_name="sample_complexity_bound"):
"""Calculate the coefficient of the confidence bound."""
if bound_name == "sample_complexity_bound":
# set the coefficient in the loss bound
if "mae" in self.result_type_name:
coef = self._cb_coef * bound_of_range
else:
coef = 0.001 * bound_of_range
comp_F = math.sqrt(data_dim)
n = self.observation_count
return coef * comp_F * math.sqrt((np.log10(n / OnlineResult.prob_delta)) / n)
else:
raise NotImplementedError
@property
def result_type_name(self):
return self._result_type_name
@property
def loss_avg(self):
return self._loss_avg if self.observation_count != 0 else self._init_loss
@property
def loss_cb(self):
return self._loss_cb
@property
def loss_lcb(self):
return max(self._loss_avg - self._loss_cb, OnlineResult.LOSS_MIN)
@property
def loss_ucb(self):
return min(self._loss_avg + self._loss_cb, OnlineResult.LOSS_MAX)
@property
def loss_avg_recent(self):
return sum(self._loss_queue) / len(self._loss_queue) if len(self._loss_queue) != 0 else self._init_loss
def get_score(self, score_name, cb_ratio=1):
if "lcb" in score_name:
return max(self._loss_avg - cb_ratio * self._loss_cb, OnlineResult.LOSS_MIN)
elif "ucb" in score_name:
return min(self._loss_avg + cb_ratio * self._loss_cb, OnlineResult.LOSS_MAX)
elif "avg" in score_name:
return self._loss_avg
else:
raise NotImplementedError
class BaseOnlineTrial(Trial):
"""Class for the online trial."""
def __init__(
self,
config: dict,
min_resource_lease: float,
is_champion: Optional[bool] = False,
is_checked_under_current_champion: Optional[bool] = True,
custom_trial_name: Optional[str] = "mae",
trial_id: Optional[str] = None,
):
"""Constructor.
Args:
config: The configuration dictionary.
min_resource_lease: A float specifying the minimum resource lease.
is_champion: A bool variable indicating whether the trial is champion.
is_checked_under_current_champion: A bool indicating whether the trial
has been used under the current champion.
custom_trial_name: A string of a custom trial name.
trial_id: A string for the trial id.
"""
# ****basic variables
self.config = config
self.trial_id = trial_id
self.status = Trial.PENDING
self.start_time = time.time()
self.custom_trial_name = custom_trial_name
# ***resource budget related variable
self._min_resource_lease = min_resource_lease
self._resource_lease = copy.copy(self._min_resource_lease)
# ***champion related variables
self._is_champion = is_champion
# self._is_checked_under_current_champion_ is supposed to be always 1 when the trial is first created
self._is_checked_under_current_champion = is_checked_under_current_champion
@property
def is_champion(self):
return self._is_champion
@property
def is_checked_under_current_champion(self):
return self._is_checked_under_current_champion
@property
def resource_lease(self):
return self._resource_lease
def set_checked_under_current_champion(self, checked_under_current_champion: bool):
# This is needed because sometimes
# we want to know whether a trial has been paused since a new champion is promoted.
# We want to try to pause those running trials (even though they are not yet achieve
# the next scheduling check point according to resource used and resource lease),
# because a better trial is likely to be in the new challengers generated by the new
# champion, so we want to try them as soon as possible.
# If we wait until we reach the next scheduling point, we may waste a lot of resource
# (depending on what is the current resource lease) on the old trials (note that new
# trials is not possible to be scheduled to run until there is a slot openning).
# Intuitively speaking, we want to squize an opening slot as soon as possible once
# a new champion is promoted, such that we are able to try newly generated challengers.
self._is_checked_under_current_champion = checked_under_current_champion
def set_resource_lease(self, resource: float):
"""Sets the resource lease accordingly."""
self._resource_lease = resource
def set_status(self, status):
"""Sets the status of the trial and record the start time."""
self.status = status
if status == Trial.RUNNING:
if self.start_time is None:
self.start_time = time.time()
class VowpalWabbitTrial(BaseOnlineTrial):
"""The class for Vowpal Wabbit online trials."""
# NOTE: 1. About namespaces in vw:
# - Wiki in vw:
# https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Namespaces
# - Namespace vs features:
# https://stackoverflow.com/questions/28586225/in-vowpal-wabbit-what-is-the-difference-between-a-namespace-and-feature
# About result:
# 1. training related results (need to be updated in the trainable class)
# 2. result about resources lease (need to be updated externally)
cost_unit = 1.0
interactions_config_key = "interactions"
MIN_RES_CONST = 5
def __init__(
self,
config: dict,
min_resource_lease: float,
metric: str = "mae",
is_champion: Optional[bool] = False,
is_checked_under_current_champion: Optional[bool] = True,
custom_trial_name: Optional[str] = "vw_mae_clipped",
trial_id: Optional[str] = None,
cb_coef: Optional[float] = None,
):
"""Constructor.
Args:
config (dict): the config of the trial (note that the config is a set
because the hyperparameters are).
min_resource_lease (float): the minimum resource lease.
metric (str): the loss metric.
is_champion (bool): indicates whether the trial is the current champion or not.
is_checked_under_current_champion (bool): indicates whether this trials has
been paused under the current champion.
trial_id (str): id of the trial (if None, it will be generated in the constructor).
"""
try:
from vowpalwabbit import pyvw
except ImportError:
raise ImportError("To use AutoVW, please run pip install flaml[vw] to install vowpalwabbit")
# attributes
self.trial_id = self._config_to_id(config) if trial_id is None else trial_id
logger.info("Create trial with trial_id: %s", self.trial_id)
super().__init__(
config,
min_resource_lease,
is_champion,
is_checked_under_current_champion,
custom_trial_name,
self.trial_id,
)
self.model = None # model is None until the config is scheduled to run
self.result = None
self.trainable_class = pyvw.vw
# variables that are needed during online training
self._metric = metric
self._y_min_observed = None
self._y_max_observed = None
# application dependent variables
self._dim = None
self._cb_coef = cb_coef
@staticmethod
def _config_to_id(config):
"""Generate an id for the provided config."""
# sort config keys
sorted_k_list = sorted(list(config.keys()))
config_id_full = ""
for key in sorted_k_list:
v = config[key]
config_id = "|"
if isinstance(v, set):
value_list = sorted(v)
config_id += "_".join([str(k) for k in value_list])
else:
config_id += str(v)
config_id_full = config_id_full + config_id
return config_id_full
def _initialize_vw_model(self, vw_example):
"""Initialize a vw model using the trainable_class"""
self._vw_config = self.config.copy()
ns_interactions = self.config.get(VowpalWabbitTrial.interactions_config_key, None)
# ensure the feature interaction config is a list (required by VW)
if ns_interactions is not None:
self._vw_config[VowpalWabbitTrial.interactions_config_key] = list(ns_interactions)
# get the dimensionality of the feature according to the namespace configuration
namespace_feature_dim = get_ns_feature_dim_from_vw_example(vw_example)
self._dim = self._get_dim_from_ns(namespace_feature_dim, ns_interactions)
# construct an instance of vw model using the input config and fixed config
self.model = self.trainable_class(**self._vw_config)
self.result = OnlineResult(
self._metric,
cb_coef=self._cb_coef,
init_loss=0.0,
init_cb=100.0,
)
def train_eval_model_online(self, data_sample, y_pred):
"""Train and evaluate model online."""
# extract info needed the first time we see the data
if self._resource_lease == "auto" or self._resource_lease is None:
assert self._dim is not None
self._resource_lease = self._dim * self.MIN_RES_CONST
y = self._get_y_from_vw_example(data_sample)
self._update_y_range(y)
if self.model is None:
# initialize self.model and self.result
self._initialize_vw_model(data_sample)
# do one step of learning
self.model.learn(data_sample)
# update training related results accordingly
new_loss = self._get_loss(y, y_pred, self._metric, self._y_min_observed, self._y_max_observed)
# udpate sample size, sum of loss, and cost
data_sample_size = 1
bound_of_range = self._y_max_observed - self._y_min_observed
if bound_of_range == 0:
bound_of_range = 1.0
self.result.update_result(
new_loss,
VowpalWabbitTrial.cost_unit * data_sample_size,
self._dim,
bound_of_range,
)
def predict(self, x):
"""Predict using the model."""
if self.model is None:
# initialize self.model and self.result
self._initialize_vw_model(x)
return self.model.predict(x)
def _get_loss(self, y_true, y_pred, loss_func_name, y_min_observed, y_max_observed):
"""Get instantaneous loss from y_true and y_pred, and loss_func_name
For mae_clip, we clip y_pred in the observed range of y
"""
if "mse" in loss_func_name or "squared" in loss_func_name:
loss_func = mean_squared_error
elif "mae" in loss_func_name or "absolute" in loss_func_name:
loss_func = mean_absolute_error
if y_min_observed is not None and y_max_observed is not None and "clip" in loss_func_name:
# clip y_pred in the observed range of y
y_pred = min(y_max_observed, max(y_pred, y_min_observed))
else:
raise NotImplementedError
return loss_func([y_true], [y_pred])
def _update_y_range(self, y):
"""Maintain running observed minimum and maximum target value."""
if self._y_min_observed is None or y < self._y_min_observed:
self._y_min_observed = y
if self._y_max_observed is None or y > self._y_max_observed:
self._y_max_observed = y
@staticmethod
def _get_dim_from_ns(namespace_feature_dim: dict, namespace_interactions: Union[set, list]):
"""Get the dimensionality of the corresponding feature of input namespace set."""
total_dim = sum(namespace_feature_dim.values())
if namespace_interactions:
for f in namespace_interactions:
ns_dim = 1.0
for c in f:
ns_dim *= namespace_feature_dim[c]
total_dim += ns_dim
return total_dim
def clean_up_model(self):
self.model = None
self.result = None
@staticmethod
def _get_y_from_vw_example(vw_example):
"""Get y from a vw_example. this works for regression datasets."""
return float(vw_example.split("|")[0])

View File

@ -0,0 +1,534 @@
import numpy as np
import math
from flaml.tune import Trial
from flaml.tune.scheduler import TrialScheduler
import logging
logger = logging.getLogger(__name__)
class OnlineTrialRunner:
"""Class for the OnlineTrialRunner."""
# ************NOTE about the status of a trial***************
# Trial.PENDING: All trials are set to be pending when frist added into the OnlineTrialRunner until
# it is selected to run. By this definition, a trial with status Trial.PENDING is a challenger
# trial added to the OnlineTrialRunner but never been selected to run.
# It denotes the starting of trial's lifespan in the OnlineTrialRunner.
# Trial.RUNNING: It indicates that this trial is one of the concurrently running trials.
# The max number of Trial.RUNNING trials is running_budget.
# The status of a trial will be set to Trial.RUNNING the next time it selected to run.
# A trial's status may have the following change:
# Trial.PENDING -> Trial.RUNNING
# Trial.PAUSED - > Trial.RUNNING
# Trial.PAUSED: The status of a trial is set to Trial.PAUSED once it is removed from the running trials.
# Trial.RUNNING - > Trial.PAUSED
# Trial.TERMINATED: set the status of a trial to Trial.TERMINATED when you never want to select it.
# It denotes the real end of a trial's lifespan.
# Status change routine of a trial:
# Trial.PENDING -> (Trial.RUNNING -> Trial.PAUSED -> Trial.RUNNING -> ...) -> Trial.TERMINATED(optional)
RANDOM_SEED = 123456
WARMSTART_NUM = 100
def __init__(
self, max_live_model_num: int, searcher=None, scheduler=None, champion_test_policy="loss_ucb", **kwargs
):
"""Constructor.
Args:
max_live_model_num: The maximum number of 'live'/running models allowed.
searcher: A class for generating Trial objects progressively.
The ConfigOracle is implemented in the searcher.
scheduler: A class for managing the 'live' trials and allocating the
resources for the trials.
champion_test_policy: A string to specify what test policy to test for
champion. Currently can choose from ['loss_ucb', 'loss_avg', 'loss_lcb', None].
"""
# ************A NOTE about the input searcher and scheduler******
# Required methods of the searcher:
# - next_trial()
# Generate the next trial to add.
# - set_search_properties(metric: Optional[str], mode: Optional[str],
# config: Optional[dict], setting: Optional[dict])
# Generate new challengers based on the current champion and update the challenger list
# - on_trial_result(trial_id: str, result: Dict)
# Reprot results to the scheduler.
# Required methods of the scheduler:
# - on_trial_add(trial_runner, trial: Trial)
# It adds candidate trials to the scheduler. It is called inside of the add_trial
# function in the TrialRunner.
# - on_trial_remove(trial_runner, trial: Trial)
# Remove terminated trials from the scheduler.
# - on_trial_result(trial_runner, trial: Trial, result: Dict)
# Reprot results to the scheduler.
# - choose_trial_to_run(trial_runner) -> Optional[Trial]
# Among them, on_trial_result and choose_trial_to_run are the most important methods
# *****************************************************************
# OnlineTrialRunner setting
self._searcher = searcher
self._scheduler = scheduler
self._champion_test_policy = champion_test_policy
self._max_live_model_num = max_live_model_num
self._remove_worse = kwargs.get("remove_worse", True)
self._bound_trial_num = kwargs.get("bound_trial_num", False)
self._no_model_persistence = True
# stores all the trials added to the OnlineTrialRunner
# i.e., include the champion and all the challengers
self._trials = []
self._champion_trial = None
self._best_challenger_trial = None
self._first_challenger_pool_size = None
self._random_state = np.random.RandomState(self.RANDOM_SEED)
self._running_trials = set()
# initially schedule up to max_live_model_num of live models and
# set the first trial as the champion (which is done inside self.step())
self._total_steps = 0
logger.info("init step %s", self._max_live_model_num)
# TODO: add more comments
self.step()
assert self._champion_trial is not None
@property
def champion_trial(self) -> Trial:
"""The champion trial."""
return self._champion_trial
@property
def running_trials(self):
"""The running/'live' trials."""
return self._running_trials
def step(self, data_sample=None, prediction_trial_tuple=None):
"""Schedule one trial to run each time it is called.
Args:
data_sample: One data example.
prediction_trial_tuple: A list of information containing
(prediction_made, prediction_trial).
"""
# TODO: Will remove prediction_trial_tuple.
# NOTE: This function consists of the following several parts:
# * Update model:
# 0. Update running trials using observations received.
# * Tests for Champion:
# 1. Test for champion (BetterThan test, and WorseThan test)
# 1.1 BetterThan test
# 1.2 WorseThan test: a trial may be removed if WroseThan test is triggered
# * Online Scheduling:
# 2. Report results to the searcher and scheduler (the scheduler will return a decision about
# the status of the running trials).
# 3. Pause or stop a trial according to the scheduler's decision.
# Add a trial into the OnlineTrialRunner if there are opening slots.
# ***********Update running trials with observation*******************
if data_sample is not None:
self._total_steps += 1
prediction_made, prediction_trial = (
prediction_trial_tuple[0],
prediction_trial_tuple[1],
)
# assert prediction_trial.status == Trial.RUNNING
trials_to_pause = []
for trial in list(self._running_trials):
if trial != prediction_trial:
y_predicted = trial.predict(data_sample)
else:
y_predicted = prediction_made
trial.train_eval_model_online(data_sample, y_predicted)
logger.debug(
"running trial at iter %s %s %s %s %s %s",
self._total_steps,
trial.trial_id,
trial.result.loss_avg,
trial.result.loss_cb,
trial.result.resource_used,
trial.resource_lease,
)
# report result to the searcher
self._searcher.on_trial_result(trial.trial_id, trial.result)
# report result to the scheduler and the scheduler makes a decision about
# the running status of the trial
decision = self._scheduler.on_trial_result(self, trial, trial.result)
# set the status of the trial according to the decision made by the scheduler
logger.debug(
"trial decision %s %s at step %s",
decision,
trial.trial_id,
self._total_steps,
)
if decision == TrialScheduler.STOP:
self.stop_trial(trial)
elif decision == TrialScheduler.PAUSE:
trials_to_pause.append(trial)
else:
self.run_trial(trial)
# ***********Statistical test of champion*************************************
self._champion_test()
# Pause the trial after the tests because the tests involves the reset of the trial's result
for trial in trials_to_pause:
self.pause_trial(trial)
# ***********Add and schedule new trials to run if there are opening slots****
# Add trial if needed: add challengers into consideration through _add_trial_from_searcher()
# if there are available slots
for _ in range(self._max_live_model_num - len(self._running_trials)):
self._add_trial_from_searcher()
# Scheduling: schedule up to max_live_model_num number of trials to run
# (set the status as Trial.RUNNING)
while self._max_live_model_num > len(self._running_trials):
trial_to_run = self._scheduler.choose_trial_to_run(self)
if trial_to_run is not None:
self.run_trial(trial_to_run)
else:
break
def get_top_running_trials(self, top_ratio=None, top_metric="ucb") -> list:
"""Get a list of trial ids, whose performance is among the top running trials."""
running_valid_trials = [trial for trial in self._running_trials if trial.result is not None]
if not running_valid_trials:
return
if top_ratio is None:
top_number = 0
elif isinstance(top_ratio, float):
top_number = math.ceil(len(running_valid_trials) * top_ratio)
elif isinstance(top_ratio, str) and "best" in top_ratio:
top_number = 1
else:
raise NotImplementedError
if "ucb" in top_metric:
test_attribute = "loss_ucb"
elif "avg" in top_metric:
test_attribute = "loss_avg"
elif "lcb" in top_metric:
test_attribute = "loss_lcb"
else:
raise NotImplementedError
top_running_valid_trials = []
logger.info("Running trial ids %s", [trial.trial_id for trial in running_valid_trials])
self._random_state.shuffle(running_valid_trials)
results = [trial.result.get_score(test_attribute) for trial in running_valid_trials]
# sorted result (small to large) index
sorted_index = np.argsort(np.array(results))
for i in range(min(top_number, len(running_valid_trials))):
top_running_valid_trials.append(running_valid_trials[sorted_index[i]])
logger.info("Top running ids %s", [trial.trial_id for trial in top_running_valid_trials])
return top_running_valid_trials
def _add_trial_from_searcher(self):
"""Add a new trial to this TrialRunner.
NOTE:
The new trial is acquired from the input search algorithm, i.e. self._searcher.
A 'new' trial means the trial is not in self._trial.
"""
# (optionally) upper bound the number of trials in the OnlineTrialRunner
if self._bound_trial_num and self._first_challenger_pool_size is not None:
active_trial_size = len([t for t in self._trials if t.status != Trial.TERMINATED])
trial_num_upper_bound = (
int(round((np.log10(self._total_steps) + 1) * self._first_challenger_pool_size))
if self._first_challenger_pool_size
else np.inf
)
if active_trial_size > trial_num_upper_bound:
logger.info(
"Not adding new trials: %s exceeds trial limit %s.",
active_trial_size,
trial_num_upper_bound,
)
return None
# output one trial from the trial pool (new challenger pool) maintained in the searcher
# Assumption on the searcher: when all frontiers (i.e., all the challengers generated
# based on the current champion) of the current champion are added, calling next_trial()
# will return None
trial = self._searcher.next_trial()
if trial is not None:
self.add_trial(trial) # dup checked in add_trial
# the champion_trial is initially None, so we need to set it up the first time
# a valid trial is added.
# Assumption on self._searcher: the first trial generated is the champion trial
if self._champion_trial is None:
logger.info("Initial set up of the champion trial %s", trial.config)
self._set_champion(trial)
else:
self._all_new_challengers_added = True
if self._first_challenger_pool_size is None:
self._first_challenger_pool_size = len(self._trials)
def _champion_test(self):
"""Perform tests again the latest champion, including bette_than tests and worse_than tests"""
# for BetterThan test, we only need to compare the best challenger with the champion
self._get_best_challenger()
if self._best_challenger_trial is not None:
assert self._best_challenger_trial.trial_id != self._champion_trial.trial_id
# test whether a new champion is found and set the trial properties accordingly
is_new_champion_found = self._better_than_champion_test(self._best_challenger_trial)
if is_new_champion_found:
self._set_champion(new_champion_trial=self._best_challenger_trial)
# performs _worse_than_champion_test, which is an optional component in ChaCha
if self._remove_worse:
to_stop = []
for trial_to_test in self._trials:
if trial_to_test.status != Trial.TERMINATED:
worse_than_champion = self._worse_than_champion_test(
self._champion_trial, trial_to_test, self.WARMSTART_NUM
)
if worse_than_champion:
to_stop.append(trial_to_test)
# we want to ensure there are at least #max_live_model_num of challengers remaining
max_to_stop_num = len([t for t in self._trials if t.status != Trial.TERMINATED]) - self._max_live_model_num
for i in range(min(max_to_stop_num, len(to_stop))):
self.stop_trial(to_stop[i])
def _get_best_challenger(self):
"""Get the 'best' (in terms of the champion_test_policy) challenger under consideration."""
if self._champion_test_policy is None:
return
if "ucb" in self._champion_test_policy:
test_attribute = "loss_ucb"
elif "avg" in self._champion_test_policy:
test_attribute = "loss_avg"
else:
raise NotImplementedError
active_trials = [
trial
for trial in self._trials
if (
trial.status != Trial.TERMINATED
and trial.trial_id != self._champion_trial.trial_id
and trial.result is not None
)
]
if active_trials:
self._random_state.shuffle(active_trials)
results = [trial.result.get_score(test_attribute) for trial in active_trials]
best_index = np.argmin(results)
self._best_challenger_trial = active_trials[best_index]
def _set_champion(self, new_champion_trial):
"""Set the status of the existing trials once a new champion is found."""
assert new_champion_trial is not None
is_init_update = False
if self._champion_trial is None:
is_init_update = True
self.run_trial(new_champion_trial)
# set the checked_under_current_champion status of the trials
for trial in self._trials:
if trial.trial_id == new_champion_trial.trial_id:
trial.set_checked_under_current_champion(True)
else:
trial.set_checked_under_current_champion(False)
self._champion_trial = new_champion_trial
self._all_new_challengers_added = False
logger.info("Set the champion as %s", self._champion_trial.trial_id)
if not is_init_update:
self._champion_update_times += 1
# calling set_search_properties of searcher will trigger
# new challenger generation. we do not do this for init champion
# as this step is already done when first constructing the searcher
self._searcher.set_search_properties(setting={self._searcher.CHAMPION_TRIAL_NAME: self._champion_trial})
else:
self._champion_update_times = 0
def get_trials(self) -> list:
"""Return the list of trials managed by this TrialRunner."""
return self._trials
def add_trial(self, new_trial):
"""Add a new trial to this TrialRunner.
Trials may be added at any time.
Args:
new_trial (Trial): Trial to queue.
"""
# Only add the new trial when it does not exist (according to the trial_id, which is
# the signature of the trail) in self._trials.
for trial in self._trials:
if trial.trial_id == new_trial.trial_id:
trial.set_checked_under_current_champion(True)
return
logger.info(
"adding trial at iter %s, %s %s",
self._total_steps,
new_trial.trial_id,
len(self._trials),
)
self._trials.append(new_trial)
self._scheduler.on_trial_add(self, new_trial)
def stop_trial(self, trial):
"""Stop a trial: set the status of a trial to be
Trial.TERMINATED and perform other subsequent operations.
"""
if trial.status in [Trial.ERROR, Trial.TERMINATED]:
return
else:
logger.info(
"Terminating trial %s, with trial result %s",
trial.trial_id,
trial.result,
)
trial.set_status(Trial.TERMINATED)
# clean up model and result
trial.clean_up_model()
self._scheduler.on_trial_remove(self, trial)
self._searcher.on_trial_complete(trial.trial_id)
self._running_trials.remove(trial)
def pause_trial(self, trial):
"""Pause a trial: set the status of a trial to be Trial.PAUSED
and perform other subsequent operations.
"""
if trial.status in [Trial.ERROR, Trial.TERMINATED]:
return
else:
logger.info(
"Pausing trial %s, with trial loss_avg: %s, loss_cb: %s, loss_ucb: %s,\
resource_lease: %s",
trial.trial_id,
trial.result.loss_avg,
trial.result.loss_cb,
trial.result.loss_avg + trial.result.loss_cb,
trial.resource_lease,
)
trial.set_status(Trial.PAUSED)
# clean up model and result if no model persistence
if self._no_model_persistence:
trial.clean_up_model()
self._running_trials.remove(trial)
def run_trial(self, trial):
"""Run a trial: set the status of a trial to be Trial.RUNNING
and perform other subsequent operations.
"""
if trial.status in [Trial.ERROR, Trial.TERMINATED]:
return
else:
trial.set_status(Trial.RUNNING)
self._running_trials.add(trial)
def _better_than_champion_test(self, trial_to_test):
"""Test whether there is a config in the existing trials that
is better than the current champion config.
Returns:
A bool indicating whether a new champion is found.
"""
if trial_to_test.result is not None and self._champion_trial.result is not None:
if "ucb" in self._champion_test_policy:
return self._test_lcb_ucb(self._champion_trial, trial_to_test, self.WARMSTART_NUM)
elif "avg" in self._champion_test_policy:
return self._test_avg_loss(self._champion_trial, trial_to_test, self.WARMSTART_NUM)
elif "martingale" in self._champion_test_policy:
return self._test_martingale(self._champion_trial, trial_to_test)
else:
raise NotImplementedError
else:
return False
@staticmethod
def _worse_than_champion_test(champion_trial, trial, warmstart_num=1) -> bool:
"""Test whether the input trial is worse than the champion_trial"""
if trial.result is not None and trial.result.resource_used >= warmstart_num:
if trial.result.loss_lcb > champion_trial.result.loss_ucb:
logger.info(
"=========trial %s is worse than champion %s=====",
trial.trial_id,
champion_trial.trial_id,
)
logger.info("trial %s %s %s", trial.config, trial.result, trial.resource_lease)
logger.info(
"trial loss_avg:%s, trial loss_cb %s",
trial.result.loss_avg,
trial.result.loss_cb,
)
logger.info(
"champion loss_avg:%s, champion loss_cb %s",
champion_trial.result.loss_avg,
champion_trial.result.loss_cb,
)
logger.info("champion %s", champion_trial.config)
logger.info(
"trial loss_avg_recent:%s, trial loss_cb %s",
trial.result.loss_avg_recent,
trial.result.loss_cb,
)
logger.info(
"champion loss_avg_recent:%s, champion loss_cb %s",
champion_trial.result.loss_avg_recent,
champion_trial.result.loss_cb,
)
return True
return False
@staticmethod
def _test_lcb_ucb(champion_trial, trial, warmstart_num=1) -> bool:
"""Comare the challenger(i.e., trial)'s loss upper bound with
champion_trial's loss lower bound - cb
"""
assert trial.trial_id != champion_trial.trial_id
if trial.result.resource_used >= warmstart_num:
if trial.result.loss_ucb < champion_trial.result.loss_lcb - champion_trial.result.loss_cb:
logger.info("======new champion condition satisfied: using lcb vs ucb=====")
logger.info(
"new champion trial %s %s %s",
trial.trial_id,
trial.result.resource_used,
trial.resource_lease,
)
logger.info(
"new champion trial loss_avg:%s, trial loss_cb %s",
trial.result.loss_avg,
trial.result.loss_cb,
)
logger.info(
"old champion trial %s %s %s",
champion_trial.trial_id,
champion_trial.result.resource_used,
champion_trial.resource_lease,
)
logger.info(
"old champion loss avg %s, loss cb %s",
champion_trial.result.loss_avg,
champion_trial.result.loss_cb,
)
return True
return False
@staticmethod
def _test_avg_loss(champion_trial, trial, warmstart_num=1) -> bool:
"""Comare the challenger(i.e., trial)'s average loss with the
champion_trial's average loss
"""
assert trial.trial_id != champion_trial.trial_id
if trial.result.resource_used >= warmstart_num:
if trial.result.loss_avg < champion_trial.result.loss_avg:
logger.info("=====new champion condition satisfied using avg loss=====")
logger.info("trial %s", trial.config)
logger.info(
"trial loss_avg:%s, trial loss_cb %s",
trial.result.loss_avg,
trial.result.loss_cb,
)
logger.info(
"champion loss_avg:%s, champion loss_cb %s",
champion_trial.result.loss_avg,
champion_trial.result.loss_cb,
)
logger.info("champion %s", champion_trial.config)
return True
return False
@staticmethod
def _test_martingale(champion_trial, trial):
"""Comare the challenger and champion using confidence sequence based
test martingale
Not implementated yet
"""
NotImplementedError

217
flaml/tune/README.md Normal file
View File

@ -0,0 +1,217 @@
# Economical Hyperparameter Optimization
`flaml.tune` is a module for economical hyperparameter tuning. It frees users from manually tuning many hyperparameters for a software, such as machine learning training procedures.
It can be used standalone, or together with ray tune or nni. Please find detailed guidelines and use cases about this module in our [documentation website](https://microsoft.github.io/FLAML/docs/Use-Cases/Tune-User-Defined-Function).
Below are some quick examples.
* Example for sequential tuning (recommended when compute resource is limited and each trial can consume all the resources):
```python
# require: pip install flaml[blendsearch]
from flaml import tune
import time
def evaluate_config(config):
'''evaluate a hyperparameter configuration'''
# we uss a toy example with 2 hyperparameters
metric = (round(config['x'])-85000)**2 - config['x']/config['y']
# usually the evaluation takes an non-neglible cost
# and the cost could be related to certain hyperparameters
# in this example, we assume it's proportional to x
time.sleep(config['x']/100000)
# use tune.report to report the metric to optimize
tune.report(metric=metric)
analysis = tune.run(
evaluate_config, # the function to evaluate a config
config={
'x': tune.lograndint(lower=1, upper=100000),
'y': tune.randint(lower=1, upper=100000)
}, # the search space
low_cost_partial_config={'x':1}, # a initial (partial) config with low cost
metric='metric', # the name of the metric used for optimization
mode='min', # the optimization mode, 'min' or 'max'
num_samples=-1, # the maximal number of configs to try, -1 means infinite
time_budget_s=60, # the time budget in seconds
local_dir='logs/', # the local directory to store logs
# verbose=0, # verbosity
# use_ray=True, # uncomment when performing parallel tuning using ray
)
print(analysis.best_trial.last_result) # the best trial's result
print(analysis.best_config) # the best config
```
* Example for using ray tune's API:
```python
# require: pip install flaml[blendsearch,ray]
from ray import tune as raytune
from flaml import CFO, BlendSearch
import time
def evaluate_config(config):
'''evaluate a hyperparameter configuration'''
# we use a toy example with 2 hyperparameters
metric = (round(config['x'])-85000)**2 - config['x']/config['y']
# usually the evaluation takes a non-neglible cost
# and the cost could be related to certain hyperparameters
# in this example, we assume it's proportional to x
time.sleep(config['x']/100000)
# use tune.report to report the metric to optimize
tune.report(metric=metric)
# provide a time budget (in seconds) for the tuning process
time_budget_s = 60
# provide the search space
config_search_space = {
'x': tune.lograndint(lower=1, upper=100000),
'y': tune.randint(lower=1, upper=100000)
}
# provide the low cost partial config
low_cost_partial_config={'x':1}
# set up CFO
cfo = CFO(low_cost_partial_config=low_cost_partial_config)
# set up BlendSearch
blendsearch = BlendSearch(
metric="metric", mode="min",
space=config_search_space,
low_cost_partial_config=low_cost_partial_config,
time_budget_s=time_budget_s
)
# NOTE: when using BlendSearch as a search_alg in ray tune, you need to
# configure the 'time_budget_s' for BlendSearch accordingly such that
# BlendSearch is aware of the time budget. This step is not needed when
# BlendSearch is used as the search_alg in flaml.tune as it is done
# automatically in flaml.
analysis = raytune.run(
evaluate_config, # the function to evaluate a config
config=config_search_space,
metric='metric', # the name of the metric used for optimization
mode='min', # the optimization mode, 'min' or 'max'
num_samples=-1, # the maximal number of configs to try, -1 means infinite
time_budget_s=time_budget_s, # the time budget in seconds
local_dir='logs/', # the local directory to store logs
search_alg=blendsearch # or cfo
)
print(analysis.best_trial.last_result) # the best trial's result
print(analysis.best_config) # the best config
```
* Example for using NNI: An example of using BlendSearch with NNI can be seen in [test](https://github.com/microsoft/FLAML/tree/main/test/nni). CFO can be used as well in a similar manner. To run the example, first make sure you have [NNI](https://nni.readthedocs.io/en/stable/) installed, then run:
```shell
$nnictl create --config ./config.yml
```
* For more examples, please check out
[notebooks](https://github.com/microsoft/FLAML/tree/main/notebook/).
`flaml` offers two HPO methods: CFO and BlendSearch.
`flaml.tune` uses BlendSearch by default.
## CFO: Frugal Optimization for Cost-related Hyperparameters
<p align="center">
<img src="https://github.com/microsoft/FLAML/blob/main/website/docs/Use-Cases/images/CFO.png" width=200>
<br>
</p>
CFO uses the randomized direct search method FLOW<sup>2</sup> with adaptive stepsize and random restart.
It requires a low-cost initial point as input if such point exists.
The search begins with the low-cost initial point and gradually move to
high cost region if needed. The local search method has a provable convergence
rate and bounded cost.
About FLOW<sup>2</sup>: FLOW<sup>2</sup> is a simple yet effective randomized direct search method.
It is an iterative optimization method that can optimize for black-box functions.
FLOW<sup>2</sup> only requires pairwise comparisons between function values to perform iterative update. Comparing to existing HPO methods, FLOW<sup>2</sup> has the following appealing properties:
1. It is applicable to general black-box functions with a good convergence rate in terms of loss.
1. It provides theoretical guarantees on the total evaluation cost incurred.
The GIFs attached below demonstrate an example search trajectory of FLOW<sup>2</sup> shown in the loss and evaluation cost (i.e., the training time ) space respectively. From the demonstration, we can see that (1) FLOW<sup>2</sup> can quickly move toward the low-loss region, showing good convergence property and (2) FLOW<sup>2</sup> tends to avoid exploring the high-cost region until necessary.
<p align="center">
<img align="center", src="https://github.com/microsoft/FLAML/blob/website/docs/Use-Cases/images/heatmap_loss_cfo_12s.gif" width=360> <img align="center", src="https://github.com/microsoft/FLAML/blob/main/website/docs/Use-Cases/images/heatmap_cost_cfo_12s.gif" width=360>
<br>
<figcaption>Figure 1. FLOW<sup>2</sup> in tuning the # of leaves and the # of trees for XGBoost. The two background heatmaps show the loss and cost distribution of all configurations. The black dots are the points evaluated in FLOW<sup>2</sup>. Black dots connected by lines are points that yield better loss performance when evaluated.</figcaption>
</p>
Example:
```python
from flaml import CFO
tune.run(...
search_alg = CFO(low_cost_partial_config=low_cost_partial_config),
)
```
Recommended scenario: there exist cost-related hyperparameters and a low-cost
initial point is known before optimization.
If the search space is complex and CFO gets trapped into local optima, consider
using BlendSearch.
## BlendSearch: Economical Hyperparameter Optimization With Blended Search Strategy
<p align="center">
<img src="https://github.com/microsoft/FLAML/blob/main/website/docs/Use-Cases/images/BlendSearch.png" width=200>
<br>
</p>
BlendSearch combines local search with global search. It leverages the frugality
of CFO and the space exploration ability of global search methods such as
Bayesian optimization. Like CFO, BlendSearch requires a low-cost initial point
as input if such point exists, and starts the search from there. Different from
CFO, BlendSearch will not wait for the local search to fully converge before
trying new start points. The new start points are suggested by the global search
method and filtered based on their distance to the existing points in the
cost-related dimensions. BlendSearch still gradually increases the trial cost.
It prioritizes among the global search thread and multiple local search threads
based on optimism in face of uncertainty.
Example:
```python
# require: pip install flaml[blendsearch]
from flaml import BlendSearch
tune.run(...
search_alg = BlendSearch(low_cost_partial_config=low_cost_partial_config),
)
```
* Recommended scenario: cost-related hyperparameters exist, a low-cost
initial point is known, and the search space is complex such that local search
is prone to be stuck at local optima.
* Suggestion about using larger search space in BlendSearch:
In hyperparameter optimization, a larger search space is desirable because it is more likely to include the optimal configuration (or one of the optimal configurations) in hindsight. However the performance (especially anytime performance) of most existing HPO methods is undesirable if the cost of the configurations in the search space has a large variation. Thus hand-crafted small search spaces (with relatively homogeneous cost) are often used in practice for these methods, which is subject to idiosyncrasy. BlendSearch combines the benefits of local search and global search, which enables a smart (economical) way of deciding where to explore in the search space even though it is larger than necessary. This allows users to specify a larger search space in BlendSearch, which is often easier and a better practice than narrowing down the search space by hand.
For more technical details, please check our papers.
* [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571). Qingyun Wu, Chi Wang, Silu Huang. AAAI 2021.
```bibtex
@inproceedings{wu2021cfo,
title={Frugal Optimization for Cost-related Hyperparameters},
author={Qingyun Wu and Chi Wang and Silu Huang},
year={2021},
booktitle={AAAI'21},
}
```
* [Economical Hyperparameter Optimization With Blended Search Strategy](https://www.microsoft.com/en-us/research/publication/economical-hyperparameter-optimization-with-blended-search-strategy/). Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021.
```bibtex
@inproceedings{wang2021blendsearch,
title={Economical Hyperparameter Optimization With Blended Search Strategy},
author={Chi Wang and Qingyun Wu and Silu Huang and Amin Saied},
year={2021},
booktitle={ICLR'21},
}
```

Some files were not shown because too many files have changed in this diff Show More