autogen/notebook/integrate_sklearn.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) 2021. All rights reserved.\n",
    "\n",
    "Contributed by: @bnriiitb\n",
    "\n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using AutoML in Sklearn Pipeline\n",
    "\n",
    "This tutorial will help you understand how FLAML's AutoML can be used as a transformer in the Sklearn pipeline."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## 1.Introduction\n",
    "\n",
    "### 1.1 FLAML - Fast and Lightweight AutoML\n",
    "\n",
    "FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models with low computational cost. It is fast and economical. The simple and lightweight design makes it easy  to use and extend, such as adding new learners. \n",
    "\n",
    "FLAML can \n",
    "- serve as an economical AutoML engine,\n",
    "- be used as a fast hyperparameter tuning tool, or \n",
    "- be embedded in self-tuning software that requires low latency & resource in repetitive\n",
    "   tuning tasks.\n",
    "\n",
    "In this notebook, we use one real data example (binary classification) to showcase how to use FLAML library.\n",
    "\n",
    "FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the `[automl]` option (this option is introduced from version 2, for version 1 it is installed by default):\n",
    "```bash\n",
    "pip install flaml[automl]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install flaml[automl] openml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Why are pipelines a silver bullet?\n",
    "\n",
    "In a typical machine learning workflow we have to apply all the transformations at least twice. \n",
    "1. During Training\n",
    "2. During Inference\n",
    "\n",
    "Scikit-learn pipelines provide an easy to use inteface to automate ML workflows by allowing several transformers to be chained together. \n",
    "\n",
    "The key benefits of using pipelines:\n",
    "* Make ML workflows highly readable, enabling fast development and easy review\n",
    "* Help to build sequential and parallel processes\n",
    "* Allow hyperparameter tuning across the estimators\n",
    "* Easier to share and collaborate with multiple users (bug fixes, enhancements etc)\n",
    "* Enforce the implementation and order of steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### As FLAML's AutoML module can be used a transformer in the Sklearn's pipeline we can get all the benefits of pipeline and thereby write extremley clean, and resuable code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Classification Example\n",
    "### Load data and preprocess\n",
    "\n",
    "Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "download dataset from openml\n",
      "Dataset name: airlines\n",
      "X_train.shape: (404537, 7), y_train.shape: (404537,);\n",
      "X_test.shape: (134846, 7), y_test.shape: (134846,)\n"
     ]
    }
   ],
   "source": [
    "from flaml.data import load_openml_dataset\n",
    "X_train, X_test, y_train, y_test = load_openml_dataset(\n",
    "    dataset_id=1169, data_dir='./', random_state=1234, dataset_format='array')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  12., 2648.,    4.,   15.,    4.,  450.,   67.], dtype=float32)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Create a Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;imputuer&#x27;, SimpleImputer()),\n",
       "                (&#x27;standardizer&#x27;, StandardScaler()),\n",
       "                (&#x27;automl&#x27;,\n",
       "                 AutoML(append_log=False, auto_augment=True, custom_hp={},\n",
       "                        early_stop=False, ensemble=False, estimator_list=&#x27;auto&#x27;,\n",
       "                        eval_method=&#x27;auto&#x27;, fit_kwargs_by_estimator={},\n",
       "                        hpo_method=&#x27;auto&#x27;, keep_search_state=False,\n",
       "                        learner_selector=&#x27;sample&#x27;, log_file_name=&#x27;&#x27;,\n",
       "                        log_training_metric=False, log_type=&#x27;better&#x27;,\n",
       "                        max_iter=None, mem_thres=4294967296, metric=&#x27;auto&#x27;,\n",
       "                        metric_constraints=[], min_sample_size=10000,\n",
       "                        model_history=False, n_concurrent_trials=1, n_jobs=-1,\n",
       "                        n_splits=5, pred_time_limit=inf, retrain_full=True,\n",
       "                        sample=True, split_ratio=0.1, split_type=&#x27;auto&#x27;,\n",
       "                        starting_points=&#x27;static&#x27;, task=&#x27;classification&#x27;, ...))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;imputuer&#x27;, SimpleImputer()),\n",
       "                (&#x27;standardizer&#x27;, StandardScaler()),\n",
       "                (&#x27;automl&#x27;,\n",
       "                 AutoML(append_log=False, auto_augment=True, custom_hp={},\n",
       "                        early_stop=False, ensemble=False, estimator_list=&#x27;auto&#x27;,\n",
       "                        eval_method=&#x27;auto&#x27;, fit_kwargs_by_estimator={},\n",
       "                        hpo_method=&#x27;auto&#x27;, keep_search_state=False,\n",
       "                        learner_selector=&#x27;sample&#x27;, log_file_name=&#x27;&#x27;,\n",
       "                        log_training_metric=False, log_type=&#x27;better&#x27;,\n",
       "                        max_iter=None, mem_thres=4294967296, metric=&#x27;auto&#x27;,\n",
       "                        metric_constraints=[], min_sample_size=10000,\n",
       "                        model_history=False, n_concurrent_trials=1, n_jobs=-1,\n",
       "                        n_splits=5, pred_time_limit=inf, retrain_full=True,\n",
       "                        sample=True, split_ratio=0.1, split_type=&#x27;auto&#x27;,\n",
       "                        starting_points=&#x27;static&#x27;, task=&#x27;classification&#x27;, ...))])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">StandardScaler</label><div class=\"sk-toggleable__content\"><pre>StandardScaler()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">AutoML</label><div class=\"sk-toggleable__content\"><pre>AutoML(append_log=False, auto_augment=True, custom_hp={}, early_stop=False,\n",
       "       ensemble=False, estimator_list=&#x27;auto&#x27;, eval_method=&#x27;auto&#x27;,\n",
       "       fit_kwargs_by_estimator={}, hpo_method=&#x27;auto&#x27;, keep_search_state=False,\n",
       "       learner_selector=&#x27;sample&#x27;, log_file_name=&#x27;&#x27;, log_training_metric=False,\n",
       "       log_type=&#x27;better&#x27;, max_iter=None, mem_thres=4294967296, metric=&#x27;auto&#x27;,\n",
       "       metric_constraints=[], min_sample_size=10000, model_history=False,\n",
       "       n_concurrent_trials=1, n_jobs=-1, n_splits=5, pred_time_limit=inf,\n",
       "       retrain_full=True, sample=True, split_ratio=0.1, split_type=&#x27;auto&#x27;,\n",
       "       starting_points=&#x27;static&#x27;, task=&#x27;classification&#x27;, ...)</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('imputuer', SimpleImputer()),\n",
       "                ('standardizer', StandardScaler()),\n",
       "                ('automl',\n",
       "                 AutoML(append_log=False, auto_augment=True, custom_hp={},\n",
       "                        early_stop=False, ensemble=False, estimator_list='auto',\n",
       "                        eval_method='auto', fit_kwargs_by_estimator={},\n",
       "                        hpo_method='auto', keep_search_state=False,\n",
       "                        learner_selector='sample', log_file_name='',\n",
       "                        log_training_metric=False, log_type='better',\n",
       "                        max_iter=None, mem_thres=4294967296, metric='auto',\n",
       "                        metric_constraints=[], min_sample_size=10000,\n",
       "                        model_history=False, n_concurrent_trials=1, n_jobs=-1,\n",
       "                        n_splits=5, pred_time_limit=inf, retrain_full=True,\n",
       "                        sample=True, split_ratio=0.1, split_type='auto',\n",
       "                        starting_points='static', task='classification', ...))])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import set_config\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from flaml import AutoML\n",
    "\n",
    "set_config(display='diagram')\n",
    "\n",
    "imputer = SimpleImputer()\n",
    "standardizer = StandardScaler()\n",
    "automl = AutoML()\n",
    "\n",
    "automl_pipeline = Pipeline([\n",
    "    (\"imputuer\",imputer),\n",
    "    (\"standardizer\", standardizer),\n",
    "    (\"automl\", automl)\n",
    "])\n",
    "automl_pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run FLAML\n",
    "In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. For example, the default ML learners of FLAML are `['lgbm', 'xgboost', 'catboost', 'rf', 'extra_tree', 'lrl1']`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_settings = {\n",
    "    \"time_budget\": 60,  # total running time in seconds\n",
    "    \"metric\": 'accuracy',  # primary metrics can be chosen from: ['accuracy','roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'f1','log_loss','mae','mse','r2']\n",
    "    \"task\": 'classification',  # task type   \n",
    "    \"estimator_list\": ['xgboost','catboost','lgbm'],\n",
    "    \"log_file_name\": 'airlines_experiment.log',  # flaml log file\n",
    "}\n",
    "pipeline_settings = {f\"automl__{key}\": value for key, value in automl_settings.items()}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[flaml.automl: 06-22 08:01:43] {2390} INFO - task = classification\n",
      "[flaml.automl: 06-22 08:01:43] {2392} INFO - Data split method: stratified\n",
      "[flaml.automl: 06-22 08:01:43] {2396} INFO - Evaluation method: holdout\n",
      "[flaml.automl: 06-22 08:01:44] {2465} INFO - Minimizing error metric: 1-accuracy\n",
      "[flaml.automl: 06-22 08:01:44] {2605} INFO - List of ML learners in AutoML Run: ['xgboost', 'catboost', 'lgbm']\n",
      "[flaml.automl: 06-22 08:01:44] {2897} INFO - iteration 0, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:44] {3025} INFO - Estimated sufficient time budget=105341s. Estimated necessary time budget=116s.\n",
      "[flaml.automl: 06-22 08:01:44] {3072} INFO -  at 0.7s,\testimator xgboost's best error=0.3755,\tbest estimator xgboost's best error=0.3755\n",
      "[flaml.automl: 06-22 08:01:44] {2897} INFO - iteration 1, current learner lgbm\n",
      "[flaml.automl: 06-22 08:01:44] {3072} INFO -  at 0.9s,\testimator lgbm's best error=0.3814,\tbest estimator xgboost's best error=0.3755\n",
      "[flaml.automl: 06-22 08:01:44] {2897} INFO - iteration 2, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:45] {3072} INFO -  at 1.3s,\testimator xgboost's best error=0.3755,\tbest estimator xgboost's best error=0.3755\n",
      "[flaml.automl: 06-22 08:01:45] {2897} INFO - iteration 3, current learner lgbm\n",
      "[flaml.automl: 06-22 08:01:45] {3072} INFO -  at 1.5s,\testimator lgbm's best error=0.3814,\tbest estimator xgboost's best error=0.3755\n",
      "[flaml.automl: 06-22 08:01:45] {2897} INFO - iteration 4, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:45] {3072} INFO -  at 1.8s,\testimator xgboost's best error=0.3755,\tbest estimator xgboost's best error=0.3755\n",
      "[flaml.automl: 06-22 08:01:45] {2897} INFO - iteration 5, current learner lgbm\n",
      "[flaml.automl: 06-22 08:01:45] {3072} INFO -  at 2.0s,\testimator lgbm's best error=0.3755,\tbest estimator xgboost's best error=0.3755\n",
      "[flaml.automl: 06-22 08:01:45] {2897} INFO - iteration 6, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:46] {3072} INFO -  at 2.3s,\testimator xgboost's best error=0.3724,\tbest estimator xgboost's best error=0.3724\n",
      "[flaml.automl: 06-22 08:01:46] {2897} INFO - iteration 7, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:46] {3072} INFO -  at 2.6s,\testimator xgboost's best error=0.3724,\tbest estimator xgboost's best error=0.3724\n",
      "[flaml.automl: 06-22 08:01:46] {2897} INFO - iteration 8, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:47] {3072} INFO -  at 3.1s,\testimator xgboost's best error=0.3657,\tbest estimator xgboost's best error=0.3657\n",
      "[flaml.automl: 06-22 08:01:47] {2897} INFO - iteration 9, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:47] {3072} INFO -  at 3.6s,\testimator xgboost's best error=0.3657,\tbest estimator xgboost's best error=0.3657\n",
      "[flaml.automl: 06-22 08:01:47] {2897} INFO - iteration 10, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:48] {3072} INFO -  at 4.8s,\testimator xgboost's best error=0.3592,\tbest estimator xgboost's best error=0.3592\n",
      "[flaml.automl: 06-22 08:01:48] {2897} INFO - iteration 11, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:50] {3072} INFO -  at 6.8s,\testimator xgboost's best error=0.3580,\tbest estimator xgboost's best error=0.3580\n",
      "[flaml.automl: 06-22 08:01:50] {2897} INFO - iteration 12, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:51] {3072} INFO -  at 8.1s,\testimator xgboost's best error=0.3580,\tbest estimator xgboost's best error=0.3580\n",
      "[flaml.automl: 06-22 08:01:51] {2897} INFO - iteration 13, current learner lgbm\n",
      "[flaml.automl: 06-22 08:01:52] {3072} INFO -  at 8.4s,\testimator lgbm's best error=0.3644,\tbest estimator xgboost's best error=0.3580\n",
      "[flaml.automl: 06-22 08:01:52] {2897} INFO - iteration 14, current learner lgbm\n",
      "[flaml.automl: 06-22 08:01:52] {3072} INFO -  at 8.7s,\testimator lgbm's best error=0.3644,\tbest estimator xgboost's best error=0.3580\n",
      "[flaml.automl: 06-22 08:01:52] {2897} INFO - iteration 15, current learner lgbm\n",
      "[flaml.automl: 06-22 08:01:53] {3072} INFO -  at 9.3s,\testimator lgbm's best error=0.3644,\tbest estimator xgboost's best error=0.3580\n",
      "[flaml.automl: 06-22 08:01:53] {2897} INFO - iteration 16, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:56] {3072} INFO -  at 12.1s,\testimator xgboost's best error=0.3559,\tbest estimator xgboost's best error=0.3559\n",
      "[flaml.automl: 06-22 08:01:56] {2897} INFO - iteration 17, current learner lgbm\n",
      "[flaml.automl: 06-22 08:01:56] {3072} INFO -  at 12.6s,\testimator lgbm's best error=0.3604,\tbest estimator xgboost's best error=0.3559\n",
      "[flaml.automl: 06-22 08:01:56] {2897} INFO - iteration 18, current learner catboost\n",
      "[flaml.automl: 06-22 08:01:56] {3072} INFO -  at 13.0s,\testimator catboost's best error=0.3615,\tbest estimator xgboost's best error=0.3559\n",
      "[flaml.automl: 06-22 08:01:56] {2897} INFO - iteration 19, current learner catboost\n",
      "[flaml.automl: 06-22 08:01:57] {3072} INFO -  at 13.7s,\testimator catboost's best error=0.3615,\tbest estimator xgboost's best error=0.3559\n",
      "[flaml.automl: 06-22 08:01:57] {2897} INFO - iteration 20, current learner catboost\n",
      "[flaml.automl: 06-22 08:01:57] {3072} INFO -  at 13.9s,\testimator catboost's best error=0.3615,\tbest estimator xgboost's best error=0.3559\n",
      "[flaml.automl: 06-22 08:01:57] {2897} INFO - iteration 21, current learner xgboost\n",
      "[flaml.automl: 06-22 08:01:59] {3072} INFO -  at 15.7s,\testimator xgboost's best error=0.3559,\tbest estimator xgboost's best error=0.3559\n",
      "[flaml.automl: 06-22 08:01:59] {2897} INFO - iteration 22, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:00] {3072} INFO -  at 16.5s,\testimator catboost's best error=0.3489,\tbest estimator catboost's best error=0.3489\n",
      "[flaml.automl: 06-22 08:02:00] {2897} INFO - iteration 23, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:02] {3072} INFO -  at 18.9s,\testimator catboost's best error=0.3489,\tbest estimator catboost's best error=0.3489\n",
      "[flaml.automl: 06-22 08:02:02] {2897} INFO - iteration 24, current learner lgbm\n",
      "[flaml.automl: 06-22 08:02:03] {3072} INFO -  at 19.2s,\testimator lgbm's best error=0.3604,\tbest estimator catboost's best error=0.3489\n",
      "[flaml.automl: 06-22 08:02:03] {2897} INFO - iteration 25, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:03] {3072} INFO -  at 20.0s,\testimator catboost's best error=0.3472,\tbest estimator catboost's best error=0.3472\n",
      "[flaml.automl: 06-22 08:02:03] {2897} INFO - iteration 26, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:06] {3072} INFO -  at 22.2s,\testimator catboost's best error=0.3472,\tbest estimator catboost's best error=0.3472\n",
      "[flaml.automl: 06-22 08:02:06] {2897} INFO - iteration 27, current learner lgbm\n",
      "[flaml.automl: 06-22 08:02:06] {3072} INFO -  at 22.6s,\testimator lgbm's best error=0.3604,\tbest estimator catboost's best error=0.3472\n",
      "[flaml.automl: 06-22 08:02:06] {2897} INFO - iteration 28, current learner lgbm\n",
      "[flaml.automl: 06-22 08:02:06] {3072} INFO -  at 22.9s,\testimator lgbm's best error=0.3604,\tbest estimator catboost's best error=0.3472\n",
      "[flaml.automl: 06-22 08:02:06] {2897} INFO - iteration 29, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:07] {3072} INFO -  at 23.6s,\testimator catboost's best error=0.3472,\tbest estimator catboost's best error=0.3472\n",
      "[flaml.automl: 06-22 08:02:07] {2897} INFO - iteration 30, current learner xgboost\n",
      "[flaml.automl: 06-22 08:02:09] {3072} INFO -  at 25.4s,\testimator xgboost's best error=0.3548,\tbest estimator catboost's best error=0.3472\n",
      "[flaml.automl: 06-22 08:02:09] {2897} INFO - iteration 31, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:16] {3072} INFO -  at 32.3s,\testimator catboost's best error=0.3388,\tbest estimator catboost's best error=0.3388\n",
      "[flaml.automl: 06-22 08:02:16] {2897} INFO - iteration 32, current learner lgbm\n",
      "[flaml.automl: 06-22 08:02:16] {3072} INFO -  at 32.7s,\testimator lgbm's best error=0.3604,\tbest estimator catboost's best error=0.3388\n",
      "[flaml.automl: 06-22 08:02:16] {2897} INFO - iteration 33, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:22] {3072} INFO -  at 38.5s,\testimator catboost's best error=0.3388,\tbest estimator catboost's best error=0.3388\n",
      "[flaml.automl: 06-22 08:02:22] {2897} INFO - iteration 34, current learner catboost\n",
      "[flaml.automl: 06-22 08:02:43] {3072} INFO -  at 59.6s,\testimator catboost's best error=0.3388,\tbest estimator catboost's best error=0.3388\n",
      "[flaml.automl: 06-22 08:02:46] {3336} INFO - retrain catboost for 2.8s\n",
      "[flaml.automl: 06-22 08:02:46] {3343} INFO - retrained model: <catboost.core.CatBoostClassifier object at 0x7fbeeb3859d0>\n",
      "[flaml.automl: 06-22 08:02:46] {2636} INFO - fit succeeded\n",
      "[flaml.automl: 06-22 08:02:46] {2637} INFO - Time taken to find the best model: 32.311296463012695\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;imputuer&#x27;, SimpleImputer()),\n",
       "                (&#x27;standardizer&#x27;, StandardScaler()),\n",
       "                (&#x27;automl&#x27;,\n",
       "                 AutoML(append_log=False, auto_augment=True, custom_hp={},\n",
       "                        early_stop=False, ensemble=False, estimator_list=&#x27;auto&#x27;,\n",
       "                        eval_method=&#x27;auto&#x27;, fit_kwargs_by_estimator={},\n",
       "                        hpo_method=&#x27;auto&#x27;, keep_search_state=False,\n",
       "                        learner_selector=&#x27;sample&#x27;, log_file_name=&#x27;&#x27;,\n",
       "                        log_training_metric=False, log_type=&#x27;better&#x27;,\n",
       "                        max_iter=None, mem_thres=4294967296, metric=&#x27;auto&#x27;,\n",
       "                        metric_constraints=[], min_sample_size=10000,\n",
       "                        model_history=False, n_concurrent_trials=1, n_jobs=-1,\n",
       "                        n_splits=5, pred_time_limit=inf, retrain_full=True,\n",
       "                        sample=True, split_ratio=0.1, split_type=&#x27;auto&#x27;,\n",
       "                        starting_points=&#x27;static&#x27;, task=&#x27;classification&#x27;, ...))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" ><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;imputuer&#x27;, SimpleImputer()),\n",
       "                (&#x27;standardizer&#x27;, StandardScaler()),\n",
       "                (&#x27;automl&#x27;,\n",
       "                 AutoML(append_log=False, auto_augment=True, custom_hp={},\n",
       "                        early_stop=False, ensemble=False, estimator_list=&#x27;auto&#x27;,\n",
       "                        eval_method=&#x27;auto&#x27;, fit_kwargs_by_estimator={},\n",
       "                        hpo_method=&#x27;auto&#x27;, keep_search_state=False,\n",
       "                        learner_selector=&#x27;sample&#x27;, log_file_name=&#x27;&#x27;,\n",
       "                        log_training_metric=False, log_type=&#x27;better&#x27;,\n",
       "                        max_iter=None, mem_thres=4294967296, metric=&#x27;auto&#x27;,\n",
       "                        metric_constraints=[], min_sample_size=10000,\n",
       "                        model_history=False, n_concurrent_trials=1, n_jobs=-1,\n",
       "                        n_splits=5, pred_time_limit=inf, retrain_full=True,\n",
       "                        sample=True, split_ratio=0.1, split_type=&#x27;auto&#x27;,\n",
       "                        starting_points=&#x27;static&#x27;, task=&#x27;classification&#x27;, ...))])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-6\" type=\"checkbox\" ><label for=\"sk-estimator-id-6\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-7\" type=\"checkbox\" ><label for=\"sk-estimator-id-7\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">StandardScaler</label><div class=\"sk-toggleable__content\"><pre>StandardScaler()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-8\" type=\"checkbox\" ><label for=\"sk-estimator-id-8\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">AutoML</label><div class=\"sk-toggleable__content\"><pre>AutoML(append_log=False, auto_augment=True, custom_hp={}, early_stop=False,\n",
       "       ensemble=False, estimator_list=&#x27;auto&#x27;, eval_method=&#x27;auto&#x27;,\n",
       "       fit_kwargs_by_estimator={}, hpo_method=&#x27;auto&#x27;, keep_search_state=False,\n",
       "       learner_selector=&#x27;sample&#x27;, log_file_name=&#x27;&#x27;, log_training_metric=False,\n",
       "       log_type=&#x27;better&#x27;, max_iter=None, mem_thres=4294967296, metric=&#x27;auto&#x27;,\n",
       "       metric_constraints=[], min_sample_size=10000, model_history=False,\n",
       "       n_concurrent_trials=1, n_jobs=-1, n_splits=5, pred_time_limit=inf,\n",
       "       retrain_full=True, sample=True, split_ratio=0.1, split_type=&#x27;auto&#x27;,\n",
       "       starting_points=&#x27;static&#x27;, task=&#x27;classification&#x27;, ...)</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('imputuer', SimpleImputer()),\n",
       "                ('standardizer', StandardScaler()),\n",
       "                ('automl',\n",
       "                 AutoML(append_log=False, auto_augment=True, custom_hp={},\n",
       "                        early_stop=False, ensemble=False, estimator_list='auto',\n",
       "                        eval_method='auto', fit_kwargs_by_estimator={},\n",
       "                        hpo_method='auto', keep_search_state=False,\n",
       "                        learner_selector='sample', log_file_name='',\n",
       "                        log_training_metric=False, log_type='better',\n",
       "                        max_iter=None, mem_thres=4294967296, metric='auto',\n",
       "                        metric_constraints=[], min_sample_size=10000,\n",
       "                        model_history=False, n_concurrent_trials=1, n_jobs=-1,\n",
       "                        n_splits=5, pred_time_limit=inf, retrain_full=True,\n",
       "                        sample=True, split_ratio=0.1, split_type='auto',\n",
       "                        starting_points='static', task='classification', ...))])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "automl_pipeline.fit(X_train, y_train, **pipeline_settings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best ML leaner: xgboost\n",
      "Best hyperparmeter config: {'n_estimators': 63, 'max_leaves': 1797, 'min_child_weight': 0.07275175679381725, 'learning_rate': 0.06234183309508761, 'subsample': 0.9814772488195874, 'colsample_bylevel': 0.810466508891351, 'colsample_bytree': 0.8005378817953572, 'reg_alpha': 0.5768305704485758, 'reg_lambda': 6.867180836557797, 'FLAML_sample_size': 364083}\n",
      "Best accuracy on validation data: 0.6721\n",
      "Training duration of best run: 15.45 s\n"
     ]
    }
   ],
   "source": [
    "# Get the automl object from the pipeline\n",
    "automl = automl_pipeline.steps[2][1]\n",
    "\n",
    "# Get the best config and best learner\n",
    "print('Best ML leaner:', automl.best_estimator)\n",
    "print('Best hyperparmeter config:', automl.best_config)\n",
    "print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))\n",
    "print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<flaml.model.XGBoostSklearnEstimator at 0x7f03a5eada00>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "automl.model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Persist the model binary file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Persist the automl object as pickle file\n",
    "import pickle\n",
    "with open('automl.pkl', 'wb') as f:\n",
    "    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicted labels [0 1 1 ... 0 1 0]\n",
      "True labels [0 0 0 ... 1 0 1]\n",
      "Predicted probas  [0.3764987  0.6126277  0.699604   0.27359942 0.25294745]\n"
     ]
    }
   ],
   "source": [
    "# Performance inference on the testing dataset\n",
    "y_pred = automl_pipeline.predict(X_test)\n",
    "print('Predicted labels', y_pred)\n",
    "print('True labels', y_test)\n",
    "y_pred_proba = automl_pipeline.predict_proba(X_test)[:,1]\n",
    "print('Predicted probas ',y_pred_proba[:5])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}