autogen/notebook/integrate_spark.ipynb


			
				
				
					
						
						
						
							
							
							{"cells":[{"attachments":{},"cell_type":"markdown","metadata":{"slideshow":{"slide_type":"slide"}},"source":["Copyright (c) Microsoft Corporation. All rights reserved. \n","\n","Licensed under the MIT License.\n","\n","# Run FLAML Parallel tuning with Spark\n","\n","\n","## 1. Introduction\n","\n","FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models \n","with low computational cost. It is fast and economical. The simple and lightweight design makes it easy \n","to use and extend, such as adding new learners. FLAML can \n","- serve as an economical AutoML engine,\n","- be used as a fast hyperparameter tuning tool, or \n","- be embedded in self-tuning software that requires low latency & resource in repetitive\n","   tuning tasks.\n","\n","In this notebook, we demonstrate how to run FLAML parallel tuning using Spark as the backend.\n","\n","FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the following options:\n","```bash\n","pip install flaml[automl,spark,blendsearch]\n","```\n","*Spark support is added in v1.1.0*"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:16:51.6335768Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:17:21.9028602Z\",\"execution_finish_time\":\"2022-12-07T08:18:52.3646576Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}"},"outputs":[],"source":["# %pip install flaml[automl,spark,blendsearch] matplotlib openml"]},{"attachments":{},"cell_type":"markdown","metadata":{"slideshow":{"slide_type":"slide"}},"source":["## 2. Regression Example\n","### Load data and preprocess\n","\n","Download [houses dataset](https://www.openml.org/d/537) from OpenML. The task is to predict median price of the house in the region based on demographic composition and a state of housing market in the region."]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:53.4783943Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:20:55.7666047Z\",\"execution_finish_time\":\"2022-12-07T08:21:10.9050139Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"subslide"},"tags":[]},"outputs":[],"source":["from minio.error import ServerError\n","from flaml.data import load_openml_dataset\n","\n","try:\n","    X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir='./')\n","except (ServerError, Exception):\n","    from sklearn.datasets import fetch_california_housing\n","    from sklearn.model_selection import train_test_split\n","\n","    X, y = fetch_california_housing(return_X_y=True)\n","    X_train, X_test, y_train, y_test = train_test_split(X, y)\n"]},{"attachments":{},"cell_type":"markdown","metadata":{"slideshow":{"slide_type":"slide"}},"source":["### Run FLAML\n","In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. \n","\n","Notice that here `use_spark` is set to `True` in order to use Spark as the parallel training backend."]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:53.7001471Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:21:10.9846131Z\",\"execution_finish_time\":\"2022-12-07T08:21:11.3604062Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"},"tags":[]},"outputs":[],"source":["''' import AutoML class from flaml package '''\n","from flaml import AutoML\n","automl = AutoML()"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:53.8983341Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:21:11.4417491Z\",\"execution_finish_time\":\"2022-12-07T08:21:11.8242955Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"}},"outputs":[],"source":["settings = {\n","    \"time_budget\": 30,  # total running time in seconds\n","    \"metric\": 'r2',  # primary metrics for regression can be chosen from: ['mae','mse','r2','rmse','mape']\n","    \"estimator_list\": ['lgbm'],  # list of ML learners; we tune lightgbm in this example\n","    \"task\": 'regression',  # task type    \n","    \"log_file_name\": 'houses_experiment.log',  # flaml log file\n","    \"seed\": 7654321,    # random seed\n","    \"use_spark\": True,  # whether to use Spark for distributed training\n","    \"n_concurrent_trials\": 2,  # the maximum number of concurrent trials\n","}"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:54.3953298Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:21:11.9003975Z\",\"execution_finish_time\":\"2022-12-07T08:27:58.525709Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"},"tags":[]},"outputs":[],"source":["'''The main flaml automl API'''\n","automl.fit(X_train=X_train, y_train=y_train, **settings)"]},{"attachments":{},"cell_type":"markdown","metadata":{"slideshow":{"slide_type":"slide"}},"source":["### Best model and metric"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:54.789647Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:27:58.6014435Z\",\"execution_finish_time\":\"2022-12-07T08:27:58.9745212Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"},"tags":[]},"outputs":[],"source":["''' retrieve best config'''\n","print('Best hyperparmeter config:', automl.best_config)\n","print('Best r2 on validation data: {0:.4g}'.format(1-automl.best_loss))\n","print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:54.9962623Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:27:59.0491242Z\",\"execution_finish_time\":\"2022-12-07T08:27:59.4076477Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"}},"outputs":[],"source":["automl.model.estimator"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:55.2539877Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:27:59.5247209Z\",\"execution_finish_time\":\"2022-12-07T08:28:00.4849272Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}"},"outputs":[],"source":["import matplotlib.pyplot as plt\n","plt.barh(automl.feature_names_in_, automl.feature_importances_)"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:55.5182783Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:28:00.5644015Z\",\"execution_finish_time\":\"2022-12-07T08:28:01.5531147Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"}},"outputs":[],"source":["''' pickle and save the automl object '''\n","import pickle\n","with open('automl.pkl', 'wb') as f:\n","    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:55.803107Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:28:01.6350567Z\",\"execution_finish_time\":\"2022-12-07T08:28:02.5774117Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"},"tags":[]},"outputs":[],"source":["''' compute predictions of testing dataset ''' \n","y_pred = automl.predict(X_test)\n","print('Predicted labels', y_pred)\n","print('True labels', y_test)"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:56.0585537Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:28:02.6537337Z\",\"execution_finish_time\":\"2022-12-07T08:28:03.0177805Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"},"tags":[]},"outputs":[],"source":["''' compute different metric values on testing dataset'''\n","from flaml.ml import sklearn_metric_loss_score\n","print('r2', '=', 1 - sklearn_metric_loss_score('r2', y_pred, y_test))\n","print('mse', '=', sklearn_metric_loss_score('mse', y_pred, y_test))\n","print('mae', '=', sklearn_metric_loss_score('mae', y_pred, y_test))"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:56.2226463Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:28:03.1150781Z\",\"execution_finish_time\":\"2022-12-07T08:28:03.4858362Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"subslide"},"tags":[]},"outputs":[],"source":["from flaml.data import get_output_from_log\n","time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history = \\\n","    get_output_from_log(filename=settings['log_file_name'], time_budget=60)\n","\n","for config in config_history:\n","    print(config)"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T08:20:56.4020235Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T08:28:03.5811012Z\",\"execution_finish_time\":\"2022-12-07T08:28:04.5493292Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","slideshow":{"slide_type":"slide"}},"outputs":[],"source":["import numpy as np\n","\n","plt.title('Learning Curve')\n","plt.xlabel('Wall Clock Time (s)')\n","plt.ylabel('Validation r2')\n","plt.scatter(time_history, 1 - np.array(valid_loss_history))\n","plt.step(time_history, 1 - np.array(best_valid_loss_history), where='post')\n","plt.show()"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## 3. Add a customized LightGBM learner in FLAML\n","The native API of LightGBM allows one to specify a custom objective function in the model constructor. You can easily enable it by adding a customized LightGBM learner in FLAML. In the following example, we show how to add such a customized LightGBM learner with a custom objective function for parallel tuning with Spark.\n","\n","It's a little bit different from adding customized learners for sequential training. In sequential training, we can define the customized learner in a notebook cell. However, in spark training, we have to import it from a file so that Spark can use it in executors. We can easily do it by leveraging `broadcast_code` function in `flaml.tune.spark.utils`."]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Create a customized LightGBM learner with a custom objective function"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T09:09:49.540914Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T09:09:49.6259637Z\",\"execution_finish_time\":\"2022-12-07T09:09:50.5841239Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}"},"outputs":[],"source":["custom_code = \"\"\"\n","import numpy as np \n","from flaml.model import LGBMEstimator\n","from flaml import tune\n","\n","\n","''' define your customized objective function '''\n","def my_loss_obj(y_true, y_pred):\n","    c = 0.5\n","    residual = y_pred - y_true\n","    grad = c * residual /(np.abs(residual) + c)\n","    hess = c ** 2 / (np.abs(residual) + c) ** 2\n","    # rmse grad and hess\n","    grad_rmse = residual\n","    hess_rmse = 1.0\n","    \n","    # mae grad and hess\n","    grad_mae = np.array(residual)\n","    grad_mae[grad_mae > 0] = 1.\n","    grad_mae[grad_mae <= 0] = -1.\n","    hess_mae = 1.0\n","\n","    coef = [0.4, 0.3, 0.3]\n","    return coef[0] * grad + coef[1] * grad_rmse + coef[2] * grad_mae, \\\n","        coef[0] * hess + coef[1] * hess_rmse + coef[2] * hess_mae\n","\n","\n","''' create a customized LightGBM learner class with your objective function '''\n","class MyLGBM(LGBMEstimator):\n","    '''LGBMEstimator with my_loss_obj as the objective function\n","    '''\n","\n","    def __init__(self, **config):\n","        super().__init__(objective=my_loss_obj, **config)\n","\"\"\"\n","\n","from flaml.tune.spark.utils import broadcast_code\n","custom_learner_path = broadcast_code(custom_code=custom_code)\n","print(custom_learner_path)\n","from flaml.tune.spark.mylearner import MyLGBM"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Add the customized learner in FLAML"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T09:14:16.2449566Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T09:14:16.3227204Z\",\"execution_finish_time\":\"2022-12-07T09:16:49.7573919Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","tags":[]},"outputs":[],"source":["automl = AutoML()\n","automl.add_learner(learner_name='my_lgbm', learner_class=MyLGBM)\n","settings = {\n","    \"time_budget\": 30,  # total running time in seconds\n","    \"metric\": 'r2',  # primary metrics for regression can be chosen from: ['mae','mse','r2']\n","    \"estimator_list\": ['my_lgbm',],  # list of ML learners; we tune lightgbm in this example\n","    \"task\": 'regression',  # task type    \n","    \"log_file_name\": 'houses_experiment_my_lgbm.log',  # flaml log file\n","    \"n_concurrent_trials\": 2,\n","    \"use_spark\": True,\n","}\n","automl.fit(X_train=X_train, y_train=y_train, **settings)"]},{"cell_type":"code","execution_count":null,"metadata":{"cellStatus":"{\"Li Jiang\":{\"queued_time\":\"2022-12-07T09:17:06.0159529Z\",\"session_start_time\":null,\"execution_start_time\":\"2022-12-07T09:17:06.1042554Z\",\"execution_finish_time\":\"2022-12-07T09:17:06.467989Z\",\"state\":\"finished\",\"livy_statement_state\":\"available\"}}","tags":[]},"outputs":[],"source":["print('Best hyperparmeter config:', automl.best_config)\n","print('Best r2 on validation data: {0:.4g}'.format(1-automl.best_loss))\n","print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))\n","\n","y_pred = automl.predict(X_test)\n","print('Predicted labels', y_pred)\n","print('True labels', y_test)\n","\n","from flaml.ml import sklearn_metric_loss_score\n","print('r2', '=', 1 - sklearn_metric_loss_score('r2', y_pred, y_test))\n","print('mse', '=', sklearn_metric_loss_score('mse', y_pred, y_test))\n","print('mae', '=', sklearn_metric_loss_score('mae', y_pred, y_test))"]},{"cell_type":"code","execution_count":null,"metadata":{"jupyter":{"outputs_hidden":false,"source_hidden":false},"nteract":{"transient":{"deleting":false}}},"outputs":[],"source":[]}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Python 3.8.13 ('syml-py38')","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.13 (default, Oct 21 2022, 23:50:54) \n[GCC 11.2.0]"},"notebook_environment":{},"save_output":true,"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{"spark.livy.synapse.ipythonInterpreter.enabled":"true"},"enableDebugMode":false,"keepAliveTimeout":30}},"synapse_widget":{"state":{},"version":"0.1"},"trident":{"lakehouse":{}},"vscode":{"interpreter":{"hash":"e3d9487e2ef008ade0db1bc293d3206d35cb2b6081faff9f66b40b257b7398f7"}}},"nbformat":4,"nbformat_minor":0}

						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Copy Permalink