Machine Learning and Data Mining (Module 2)

Matteo Francia
DISI — University of Bologna
m.francia@unibo.it

A telemetry system for precision agriculture

The impacts of climate change are so rapid that what happened last year is no longer true today.

  • We need to act quickly and make decisions based on evidence, not just history.

F1 control room

The “Agro.Big.Data.Science” platform

ABDS: A modular platform for collecting, analyzing, and visualizing heterogeneous data from the field and the post-harvest phase.

Cimice.Net

The brown marmorated stink bug (Halyomorpha halys) is an insect pest species causing economic damage to several agricultural commodities

  • We want to build a data-driven approach to support the application of Integrated Pest Management strategies
  • We want to monitor the spread of H. Halys
  • And learn the most important environmental factors

Three years-long project

Brown marmorated stink bug

Timeline

Cimice.Net

Goal: help farmers protect crops

  • A network of monitoring traps has been deployed in Emilia-Romagna
  • Monitoring 145, 168, and 101 farms in 2020-2022 (also in 2023 and 2024)

Trap

Collaborative Agro SEnsing

The acquisition of data concerning the installation and monitoring of traps have been aided by CASE (Collaborative Agro SEnsing)

  • Dynamic questionnaire application for on-field data crowdsourcing in the agricultural domain
  • Facilitate and standardize the communications between on-field operators with first-hand visuals of a given field/orchard and the technicians needing a 360-degree view of all fields

Database

Multidimensional cube (DFM)

Live!

The Stinkbug dataset

Loading the data

Code
import pandas as pd
df = pd.read_csv("./datasets/cimice/captures-raw.csv", sep=',')
df
GID TIMESTAMP CROP_ID ADULTS SMALL_INSTARS LARGE_INSTARS TEMPERATURE_AVG TEMPERATURE_MAX TEMPERATURE_MIN HUMIDITY_AVG HUMIDITY_MAX HUMIDITY_MIN PRECIPITATIONS WINDSPEED_AVG WINDSPEED_MAX TOTAL_CAPTURES
0 149 2020-04-27 41 0.0 0.0 0.0 16.966669 22.683333 10.550002 63.500000 84.833333 41.500000 2.199999 1.500000 3.200000 0.0
1 144 2020-04-27 31 0.0 0.0 0.0 15.785717 22.885719 8.042862 64.571429 90.142857 39.000000 4.000001 1.657143 3.157143 0.0
2 215 2020-04-27 47 5.0 0.0 0.0 17.037496 22.787504 11.149999 53.000000 72.375000 36.250000 2.599999 2.625000 4.737500 5.0
3 222 2020-04-27 19 1.0 0.0 0.0 16.942861 23.457143 9.414279 59.857143 90.571429 38.142857 3.299999 2.857143 5.471429 1.0
4 217 2020-04-27 19 9.0 0.0 0.0 17.028571 22.885710 10.371432 58.428571 84.428571 39.571429 5.000000 2.657143 5.000000 9.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10944 810 2022-10-17 19 10.0 0.0 0.0 17.687498 24.512502 12.149995 76.125000 NaN NaN 0.200000 NaN NaN 10.0
10945 845 2022-10-17 39 26.0 0.0 0.0 18.025003 23.975000 13.274995 76.500000 NaN NaN 0.000000 NaN NaN 26.0
10946 888 2022-10-17 11 1.0 0.0 0.0 16.900003 23.875002 10.900003 77.500000 NaN NaN 0.400000 NaN NaN 1.0
10947 885 2022-10-17 36 16.0 0.0 0.0 17.599991 23.614282 12.300003 76.857143 NaN NaN 0.299999 NaN NaN 16.0
10948 815 2022-10-17 41 3.0 0.0 0.0 17.175001 23.887502 11.062498 81.750000 NaN NaN 0.600001 NaN NaN 3.0

10949 rows × 16 columns

… and then?

How many of you apply GenAI (e.g., LLMs such as ChatGPT) to data science?

(Dis)Agree? (Welsh 2023)

“Programming will be obsolete. I believe the conventional idea of ‘writing a program’ is headed for extinction, and indeed, for all but very specialized applications, most software, as we know it, will be replaced by AI systems that are trained rather than programmed. In situations where one needs a ‘simple’ program (after all, not everything should require a model of hundreds of billions of parameters running on a cluster of GPUs), those programs will, themselves, be generated by an AI rather than coded by hand.”

(Waldo and Boussard 2025)

Despite their powerful capabilities, however, generative pre-trained transformers (GPTs) have the tendency to “hallucinate” responses. A hallucination occurs when an LLM-based GPT generates a response that is seemingly realistic yet is nonfactual, nonsensical, or inconsistent with the given prompt.

How many of you use AutoML for training Machine Learning pipelines?

AutoML in brief

Repetitive tasks in ML

Check the crash course from Google

AutoML in brief

Example of Bayesian Optimization

(Bie et al. 2022)

AutoML falls squarely into the first form of automation, mechanization […] as it can be seen as yet another level of abstraction over a series of automation stages.

  1. First, there is the well-known use of programming for automation.
  2. Second, machine learning automatically generates hypotheses and predictive models, which typically take the form of algorithms (for example, in the case of a decision tree or a neural network); therefore, machine learning methods can be seen as meta-algorithms that automate programming tasks, and hence “automate automation.”
  3. And third, automated machine learning makes use of algorithms that select and configure machine learning algorithms—that is, of meta-meta-algorithms that can be understood as automating the automation of automation.”

(Bie et al. 2022)

“The impressive performance levels reached by AutoML systems are evident in the results from recent competitions.

Notably, Auto-sklearn significantly outperformed human experts in the human track of the 2015/2016 ChaLearn AutoML Challenge.”

Do we still need data scientists?

GenAI and AutoML

This code was written by ChatGPT 4o and uses AutoML to train several models given a specific budget

Code
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from flaml import AutoML

X = df.drop(columns=['ADULTS'])
y = df['ADULTS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

automl = AutoML()
automl_settings = {
    "time_budget": 30,  # Time budget in seconds
    "max_iter": 30,  # Max number of iterations
    "metric": 'rmse',  # Evaluation metric
    "task": 'regression',  # Task type
}
automl.fit(X_train, y_train, **automl_settings)
y_pred = automl.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
[flaml.automl.logger: 04-08 20:44:01] {1728} INFO - task = regression
[flaml.automl.logger: 04-08 20:44:01] {1739} INFO - Evaluation method: holdout
[flaml.automl.logger: 04-08 20:44:00] {1838} INFO - Minimizing error metric: rmse
[flaml.automl.logger: 04-08 20:44:00] {1955} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd']
[flaml.automl.logger: 04-08 20:44:00] {2258} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 04-08 20:44:00] {2393} INFO - Estimated sufficient time budget=0s. Estimated necessary time budget=0s.
[flaml.automl.logger: 04-08 20:44:00] {2442} INFO -  at -0.5s,  estimator lgbm's best error=12.0178,    best estimator lgbm's best error=12.0178
[flaml.automl.logger: 04-08 20:44:00] {2258} INFO - iteration 1, current learner rf
[flaml.automl.logger: 04-08 20:44:00] {2442} INFO -  at -0.4s,  estimator rf's best error=7.9172,   best estimator rf's best error=7.9172
[flaml.automl.logger: 04-08 20:44:00] {2258} INFO - iteration 2, current learner xgboost
[flaml.automl.logger: 04-08 20:44:00] {2442} INFO -  at -0.1s,  estimator xgboost's best error=12.1446, best estimator rf's best error=7.9172
[flaml.automl.logger: 04-08 20:44:00] {2258} INFO - iteration 3, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:00] {2442} INFO -  at -0.0s,  estimator extra_tree's best error=10.9591,  best estimator rf's best error=7.9172
[flaml.automl.logger: 04-08 20:44:00] {2258} INFO - iteration 4, current learner xgb_limitdepth
[flaml.automl.logger: 04-08 20:44:01] {2442} INFO -  at 0.5s,   estimator xgb_limitdepth's best error=1.8870,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:01] {2258} INFO - iteration 5, current learner sgd
[flaml.automl.logger: 04-08 20:44:03] {2442} INFO -  at 2.5s,   estimator sgd's best error=14.5275, best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:03] {2258} INFO - iteration 6, current learner lgbm
[flaml.automl.logger: 04-08 20:44:03] {2442} INFO -  at 2.8s,   estimator lgbm's best error=12.0178,    best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:03] {2258} INFO - iteration 7, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:03] {2442} INFO -  at 2.9s,   estimator extra_tree's best error=8.3080,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:03] {2258} INFO - iteration 8, current learner lgbm
[flaml.automl.logger: 04-08 20:44:04] {2442} INFO -  at 3.1s,   estimator lgbm's best error=8.1328, best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:04] {2258} INFO - iteration 9, current learner rf
[flaml.automl.logger: 04-08 20:44:04] {2442} INFO -  at 3.3s,   estimator rf's best error=5.3341,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:04] {2258} INFO - iteration 10, current learner rf
[flaml.automl.logger: 04-08 20:44:04] {2442} INFO -  at 3.5s,   estimator rf's best error=5.3341,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:04] {2258} INFO - iteration 11, current learner rf
[flaml.automl.logger: 04-08 20:44:04] {2442} INFO -  at 3.8s,   estimator rf's best error=3.5688,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:04] {2258} INFO - iteration 12, current learner rf
[flaml.automl.logger: 04-08 20:44:05] {2442} INFO -  at 4.1s,   estimator rf's best error=2.8167,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:05] {2258} INFO - iteration 13, current learner xgboost
[flaml.automl.logger: 04-08 20:44:05] {2442} INFO -  at 4.3s,   estimator xgboost's best error=12.1446, best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:05] {2258} INFO - iteration 14, current learner xgboost
[flaml.automl.logger: 04-08 20:44:05] {2442} INFO -  at 4.5s,   estimator xgboost's best error=8.1855,  best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:05] {2258} INFO - iteration 15, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:05] {2442} INFO -  at 4.6s,   estimator extra_tree's best error=8.3080,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:05] {2258} INFO - iteration 16, current learner xgboost
[flaml.automl.logger: 04-08 20:44:05] {2442} INFO -  at 4.8s,   estimator xgboost's best error=5.7739,  best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:05] {2258} INFO - iteration 17, current learner xgb_limitdepth
[flaml.automl.logger: 04-08 20:44:06] {2442} INFO -  at 5.2s,   estimator xgb_limitdepth's best error=1.8870,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:06] {2258} INFO - iteration 18, current learner rf
[flaml.automl.logger: 04-08 20:44:06] {2442} INFO -  at 5.5s,   estimator rf's best error=2.8167,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:06] {2258} INFO - iteration 19, current learner xgb_limitdepth
[flaml.automl.logger: 04-08 20:44:06] {2442} INFO -  at 5.9s,   estimator xgb_limitdepth's best error=1.8870,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:06] {2258} INFO - iteration 20, current learner lgbm
[flaml.automl.logger: 04-08 20:44:07] {2442} INFO -  at 6.1s,   estimator lgbm's best error=4.8052, best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:07] {2258} INFO - iteration 21, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:07] {2442} INFO -  at 6.3s,   estimator extra_tree's best error=5.3649,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:07] {2258} INFO - iteration 22, current learner xgboost
[flaml.automl.logger: 04-08 20:44:07] {2442} INFO -  at 6.4s,   estimator xgboost's best error=5.7739,  best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:07] {2258} INFO - iteration 23, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:07] {2442} INFO -  at 6.6s,   estimator extra_tree's best error=3.5723,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:07] {2258} INFO - iteration 24, current learner lgbm
[flaml.automl.logger: 04-08 20:44:07] {2442} INFO -  at 6.7s,   estimator lgbm's best error=4.8052, best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:07] {2258} INFO - iteration 25, current learner rf
[flaml.automl.logger: 04-08 20:44:08] {2442} INFO -  at 7.0s,   estimator rf's best error=2.8167,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:08] {2258} INFO - iteration 26, current learner xgb_limitdepth
[flaml.automl.logger: 04-08 20:44:08] {2442} INFO -  at 7.4s,   estimator xgb_limitdepth's best error=1.8870,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:08] {2258} INFO - iteration 27, current learner sgd
[flaml.automl.logger: 04-08 20:44:10] {2442} INFO -  at 9.8s,   estimator sgd's best error=13.3065, best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:10] {2258} INFO - iteration 28, current learner xgboost
[flaml.automl.logger: 04-08 20:44:11] {2442} INFO -  at 10.1s,  estimator xgboost's best error=5.7739,  best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:11] {2258} INFO - iteration 29, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:11] {2442} INFO -  at 10.3s,  estimator extra_tree's best error=3.5723,   best estimator xgb_limitdepth's best error=1.8870
[flaml.automl.logger: 04-08 20:44:11] {2685} INFO - retrain xgb_limitdepth for 0.3s
[flaml.automl.logger: 04-08 20:44:11] {2688} INFO - retrained model: XGBRegressor(base_score=None, booster=None, callbacks=[], colsample_bylevel=1.0,
             colsample_bynode=None, colsample_bytree=1.0, device=None,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, feature_types=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=np.float64(0.29999999999999993), max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=6, max_leaves=None,
             min_child_weight=np.float64(0.9999999999999993), missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=10,
             n_jobs=-1, num_parallel_tree=None, random_state=None, ...)
[flaml.automl.logger: 04-08 20:44:11] {1985} INFO - fit succeeded
[flaml.automl.logger: 04-08 20:44:11] {1986} INFO - Time taken to find the best model: 0.4630732536315918

Code
print("RMSE using AutoML:", round(rmse, 2))
RMSE using AutoML: 2.19

Is this error high/low?

RMSE vs NRMSE

\(RMSE = \sqrt{\frac{1}{n}\sum(y_{true}−y_{pred})^2}\)

The squared differences have squared units, but the square root brings it back to the same unit as the target variable

\(NRMSE = \frac{RMSE}{avg(y_{true})}\)

Code
import matplotlib.pyplot as plt
mean_df = df.groupby('TIMESTAMP')['ADULTS'].mean()
mean_df.plot(kind='bar', figsize=(4, 3), color='blue')
plt.xticks([])  # Hide the x-axis ticks
mean = mean_df.mean()
plt.axhline(y=mean, color='red', linestyle='--', label='Average ADULTS')
plt.legend()

print(f"RMSE {rmse:.3}, Mean {mean:.3}, NRMSE {rmse / mean:.2%}")
RMSE 2.19, Mean 8.92, NRMSE 24.52%

Let’s try to better understand the data

Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10949 entries, 0 to 10948
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   GID              10949 non-null  int64  
 1   TIMESTAMP        10949 non-null  object 
 2   CROP_ID          10949 non-null  int64  
 3   ADULTS           10949 non-null  float64
 4   SMALL_INSTARS    10949 non-null  float64
 5   LARGE_INSTARS    10949 non-null  float64
 6   TEMPERATURE_AVG  10949 non-null  float64
 7   TEMPERATURE_MAX  10949 non-null  float64
 8   TEMPERATURE_MIN  10949 non-null  float64
 9   HUMIDITY_AVG     10949 non-null  float64
 10  HUMIDITY_MAX     8180 non-null   float64
 11  HUMIDITY_MIN     8180 non-null   float64
 12  PRECIPITATIONS   10949 non-null  float64
 13  WINDSPEED_AVG    8180 non-null   float64
 14  WINDSPEED_MAX    8180 non-null   float64
 15  TOTAL_CAPTURES   10949 non-null  float64
dtypes: float64(13), int64(2), object(1)
memory usage: 1.3+ MB

Let’s try simpler models

Code
from sklearn.linear_model import LinearRegression

X = df.drop(columns=['ADULTS'])  # Drop the target variable 'ADULTS' from the feature set
X['TIMESTAMP'] = pd.to_datetime(X['TIMESTAMP']).astype(int) / 10**9  # Convert the 'TIMESTAMP' column to datetime and then to a Unix timestamp (seconds)
X.fillna(X.mean(), inplace=True)  # Fill missing values with the mean of each column
y = df['ADULTS']  # Define the target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Split the dataset into training and testing sets (80% train, 20% test)
log_reg = LinearRegression()  # Initialize and train a Linear Regression model
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)  # Predict the target variable on the test set
# Calculate Root Mean Squared Error (RMSE)
rmse_log_reg = root_mean_squared_error(y_test, y_pred_log_reg)
print("RMSE using Linear Regression:", round(rmse_log_reg, 2))
RMSE using Linear Regression: 0.0

Why?

Let’s take a look at the features. Do they ring a bell?

Code
df.columns
Index(['GID', 'TIMESTAMP', 'CROP_ID', 'ADULTS', 'SMALL_INSTARS',
       'LARGE_INSTARS', 'TEMPERATURE_AVG', 'TEMPERATURE_MAX',
       'TEMPERATURE_MIN', 'HUMIDITY_AVG', 'HUMIDITY_MAX', 'HUMIDITY_MIN',
       'PRECIPITATIONS', 'WINDSPEED_AVG', 'WINDSPEED_MAX', 'TOTAL_CAPTURES'],
      dtype='object')

Code
log_reg.coef_  # weights (coefficients) of the Linear Regression model
array([-1.36733869e-15,  9.57400348e-19,  3.29282180e-16, -1.00000000e+00,
       -1.00000000e+00, -3.97477621e-15,  2.59706955e-15,  1.71515845e-15,
        7.82904180e-16,  1.03568861e-15, -1.04128233e-15, -1.15471256e-16,
       -1.29189124e-15, -3.68890616e-17,  1.00000000e+00])

LinearRegression correctly captures this pattern, AutoML does not

ADULTS = - SMALL_INSTARS - LARGE_INSTARS + TOTAL_CAPTURES

Code
equation = " ".join(f"{'' if weight < 0 else '+'}{round(weight, 2)}*{col}" for weight, col in zip(log_reg.coef_, X_train.columns) if abs(weight) > 0.01)
print(f"ADULTS = {equation}")
ADULTS = -1.0*SMALL_INSTARS -1.0*LARGE_INSTARS +1.0*TOTAL_CAPTURES

Let’s drop the TOTAL_CAPTURES

This was a trick, usually we do not have this type of derived features in the dataset

Code
X = df.drop(columns=['ADULTS', 'TOTAL_CAPTURES'])
y = df['ADULTS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
automl = AutoML()
automl_settings = {
    "time_budget": 30,  # Time budget in seconds
    "max_iter": 30,  # Max number of iterations
    "metric": 'rmse',  # Evaluation metric
    "task": 'regression',  # Task type
}
automl.fit(X_train, y_train, **automl_settings)
y_pred = automl.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
[flaml.automl.logger: 04-08 20:44:12] {1728} INFO - task = regression
[flaml.automl.logger: 04-08 20:44:12] {1739} INFO - Evaluation method: holdout
[flaml.automl.logger: 04-08 20:44:12] {1838} INFO - Minimizing error metric: rmse
[flaml.automl.logger: 04-08 20:44:12] {1955} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd']
[flaml.automl.logger: 04-08 20:44:12] {2258} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 04-08 20:44:12] {2393} INFO - Estimated sufficient time budget=582s. Estimated necessary time budget=4s.
[flaml.automl.logger: 04-08 20:44:12] {2442} INFO -  at 0.2s,   estimator lgbm's best error=14.1248,    best estimator lgbm's best error=14.1248
[flaml.automl.logger: 04-08 20:44:12] {2258} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 04-08 20:44:13] {2442} INFO -  at 0.2s,   estimator lgbm's best error=14.1248,    best estimator lgbm's best error=14.1248
[flaml.automl.logger: 04-08 20:44:13] {2258} INFO - iteration 2, current learner sgd
[flaml.automl.logger: 04-08 20:44:14] {2442} INFO -  at 2.1s,   estimator sgd's best error=16.9466, best estimator lgbm's best error=14.1248
[flaml.automl.logger: 04-08 20:44:14] {2258} INFO - iteration 3, current learner lgbm
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.3s,   estimator lgbm's best error=12.7827,    best estimator lgbm's best error=12.7827
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 4, current learner xgboost
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.4s,   estimator xgboost's best error=14.8380, best estimator lgbm's best error=12.7827
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 5, current learner lgbm
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.4s,   estimator lgbm's best error=12.2781,    best estimator lgbm's best error=12.2781
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 6, current learner lgbm
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.5s,   estimator lgbm's best error=12.2781,    best estimator lgbm's best error=12.2781
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 7, current learner lgbm
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.5s,   estimator lgbm's best error=12.0187,    best estimator lgbm's best error=12.0187
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 8, current learner lgbm
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.6s,   estimator lgbm's best error=12.0187,    best estimator lgbm's best error=12.0187
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 9, current learner lgbm
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.7s,   estimator lgbm's best error=12.0187,    best estimator lgbm's best error=12.0187
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 10, current learner lgbm
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.8s,   estimator lgbm's best error=11.6593,    best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 11, current learner xgboost
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 2.9s,   estimator xgboost's best error=14.8380, best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 12, current learner xgboost
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 3.0s,   estimator xgboost's best error=13.9034, best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 13, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 3.1s,   estimator extra_tree's best error=14.7095,  best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 14, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:15] {2442} INFO -  at 3.2s,   estimator extra_tree's best error=14.1609,  best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:15] {2258} INFO - iteration 15, current learner rf
[flaml.automl.logger: 04-08 20:44:16] {2442} INFO -  at 3.3s,   estimator rf's best error=13.8960,  best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:16] {2258} INFO - iteration 16, current learner rf
[flaml.automl.logger: 04-08 20:44:16] {2442} INFO -  at 3.5s,   estimator rf's best error=12.9380,  best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:16] {2258} INFO - iteration 17, current learner lgbm
[flaml.automl.logger: 04-08 20:44:16] {2442} INFO -  at 3.9s,   estimator lgbm's best error=11.6593,    best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:16] {2258} INFO - iteration 18, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:16] {2442} INFO -  at 4.0s,   estimator extra_tree's best error=14.1609,  best estimator lgbm's best error=11.6593
[flaml.automl.logger: 04-08 20:44:16] {2258} INFO - iteration 19, current learner lgbm
[flaml.automl.logger: 04-08 20:44:21] {2442} INFO -  at 8.4s,   estimator lgbm's best error=11.2939,    best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:21] {2258} INFO - iteration 20, current learner xgboost
[flaml.automl.logger: 04-08 20:44:21] {2442} INFO -  at 8.5s,   estimator xgboost's best error=13.0185, best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:21] {2258} INFO - iteration 21, current learner xgboost
[flaml.automl.logger: 04-08 20:44:21] {2442} INFO -  at 8.7s,   estimator xgboost's best error=13.0185, best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:21] {2258} INFO - iteration 22, current learner xgboost
[flaml.automl.logger: 04-08 20:44:21] {2442} INFO -  at 8.8s,   estimator xgboost's best error=13.0185, best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:21] {2258} INFO - iteration 23, current learner rf
[flaml.automl.logger: 04-08 20:44:21] {2442} INFO -  at 9.0s,   estimator rf's best error=12.9380,  best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:21] {2258} INFO - iteration 24, current learner rf
[flaml.automl.logger: 04-08 20:44:22] {2442} INFO -  at 9.3s,   estimator rf's best error=12.5801,  best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:22] {2258} INFO - iteration 25, current learner sgd
[flaml.automl.logger: 04-08 20:44:22] {2442} INFO -  at 9.4s,   estimator sgd's best error=16.9466, best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:22] {2258} INFO - iteration 26, current learner sgd
[flaml.automl.logger: 04-08 20:44:22] {2442} INFO -  at 9.5s,   estimator sgd's best error=16.0682, best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:22] {2258} INFO - iteration 27, current learner xgboost
[flaml.automl.logger: 04-08 20:44:22] {2442} INFO -  at 9.7s,   estimator xgboost's best error=12.5459, best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:22] {2258} INFO - iteration 28, current learner extra_tree
[flaml.automl.logger: 04-08 20:44:22] {2442} INFO -  at 9.9s,   estimator extra_tree's best error=12.7479,  best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:22] {2258} INFO - iteration 29, current learner sgd
[flaml.automl.logger: 04-08 20:44:24] {2442} INFO -  at 12.0s,  estimator sgd's best error=16.0682, best estimator lgbm's best error=11.2939
[flaml.automl.logger: 04-08 20:44:29] {2685} INFO - retrain lgbm for 4.7s
[flaml.automl.logger: 04-08 20:44:29] {2688} INFO - retrained model: LGBMRegressor(colsample_bytree=np.float64(0.6649148062238498),
              learning_rate=np.float64(0.17402065726724145), max_bin=255,
              min_child_samples=3, n_estimators=99, n_jobs=-1, num_leaves=15,
              reg_alpha=0.0009765625,
              reg_lambda=np.float64(0.006761362450996487), verbose=-1)
[flaml.automl.logger: 04-08 20:44:29] {1985} INFO - fit succeeded
[flaml.automl.logger: 04-08 20:44:29] {1986} INFO - Time taken to find the best model: 8.385986566543579

Code
print("RMSE using AutoML:", round(rmse, 2))
RMSE using AutoML: 11.13
Code
automl.best_estimator
'lgbm'
Code
automl.best_config
{'n_estimators': 99,
 'num_leaves': 15,
 'min_child_samples': 3,
 'learning_rate': np.float64(0.17402065726724145),
 'log_max_bin': 8,
 'colsample_bytree': np.float64(0.6649148062238498),
 'reg_alpha': 0.0009765625,
 'reg_lambda': np.float64(0.006761362450996487)}

Can we do better?

From the literature, we know that the spreading of bugs depends on:

  1. Cumulative degree days: \(\sum_{d \in [May-Sep]} max(0, T^{avg}_d-12.2)\) (Forresi et al. 2024)
  2. Time. How?

Captures over time

Code
fig, ax = plt.subplots(1, 1, figsize=(4, 3))

for c in ["ADULTS", "LARGE_INSTARS", "SMALL_INSTARS"]:
    df.groupby("TIMESTAMP")[c].mean().plot(x="TIMESTAMP", y=c, ax=ax)

plt.xticks(rotation=90)
fig.legend()
fig.tight_layout()

Predicting Adults with past (previous) Adults

Code
fig, ax = plt.subplots(1, 1, figsize=(4, 3))

for c in [0, 1, 4]:
    df.groupby('TIMESTAMP')['ADULTS'].mean().shift(c).fillna(0).plot(x="ADULTS", y=c, ax=ax, label=f"ADULTS-{c}")

plt.xticks(rotation=90)
fig.legend()
fig.tight_layout()

Preprocessing

Code
df = pd.read_csv("./datasets/cimice/captures-raw.csv", sep=',')  # Load the dataset from a CSV file

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])  # Convert 'TIMESTAMP' to datetime format for time-based analysis
df.sort_values(by=['GID', 'TIMESTAMP'], inplace=True)  # Sort dataset by 'GID' (Group ID) and 'TIMESTAMP' to maintain chronological order

df['WEEK'] = df['TIMESTAMP'].dt.isocalendar().week  # Extract the week number from the timestamp for seasonal analysis

df['DEGREE_DAYS'] = df['TEMPERATURE_AVG'].apply(lambda x: max(0, x - 12.2))  # Calculate degree days: only consider temperatures above 12.2°C
df['DAYS_DIFF'] = df.groupby('GID')['TIMESTAMP'].diff().dt.days.fillna(14)  # Compute the time difference (in days) between consecutive observations per 'GID'
df['CUM_DEGREE_DAYS'] = df.groupby('GID').apply(lambda group: (group['DEGREE_DAYS'] * group['DAYS_DIFF']).cumsum()).reset_index(level=0, drop=True)  # Compute cumulative degree days for each 'GID' (sum of DEGREE_DAYS * DAYS_DIFF)

df['ADULTS-1'] = df.groupby('GID')['ADULTS'].shift(1).fillna(0)  # Create a lag feature: previous 'ADULTS' count for each 'GID', fill missing values with 0

df['TIMESTAMP'] = df['TIMESTAMP'].astype(int) / 10**9  # Convert 'TIMESTAMP' to Unix timestamp (seconds) for numerical processing

df.fillna(df.groupby('WEEK').transform('mean'), inplace=True)  # Fill missing values using the mean of each column, grouped by 'WEEK'

df.drop(columns=['TOTAL_CAPTURES'], inplace=True)
X = df.drop(columns=['ADULTS'])
y = df['ADULTS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

df.head()
GID TIMESTAMP CROP_ID ADULTS SMALL_INSTARS LARGE_INSTARS TEMPERATURE_AVG TEMPERATURE_MAX TEMPERATURE_MIN HUMIDITY_AVG HUMIDITY_MAX HUMIDITY_MIN PRECIPITATIONS WINDSPEED_AVG WINDSPEED_MAX WEEK DEGREE_DAYS DAYS_DIFF CUM_DEGREE_DAYS ADULTS-1
7 24 1.587946e+09 47 2.0 0.0 0.0 16.871423 23.171424 9.642855 61.142857 84.571429 41.285714 3.199999 2.114286 3.885714 18 4.671423 14.0 65.399927 0.0
43 24 1.588550e+09 47 1.0 0.0 0.0 16.700002 22.900001 9.314286 62.714286 86.857143 37.714286 0.000000 2.014286 3.728571 19 4.500002 7.0 96.899939 2.0
194 24 1.589155e+09 47 1.0 0.0 0.0 18.485712 24.899997 12.028571 65.857143 87.571429 42.714286 3.199997 1.885714 3.514286 20 6.285712 7.0 140.899921 1.0
358 24 1.589760e+09 47 2.0 0.0 0.0 19.362497 24.737505 13.624998 67.875000 89.375000 46.250000 2.000004 1.762500 3.237500 21 7.162497 7.0 191.037399 1.0
453 24 1.590365e+09 47 0.0 0.0 0.0 19.699996 25.883335 11.683333 55.333333 84.500000 32.500000 2.600000 2.033333 3.816667 22 7.499996 7.0 243.537371 2.0

Average temperature vs cumulative degree days

Code
import matplotlib.pyplot as plt
# Calculate weekly averages for temperature and cumulative degree days
weekly_avg_temp = df.groupby('WEEK')['TEMPERATURE_AVG'].mean()
weekly_avg_cum_degree_days = df.groupby('WEEK')['CUM_DEGREE_DAYS'].mean()
fig, axes = plt.subplots(1, 2, figsize=(8, 4))  # Create subplots
axes[0].plot(weekly_avg_temp.index, weekly_avg_temp.values, marker='o', color='blue')  # Plot average temperature by week
axes[0].set_title('Average Temperature by Week')
axes[0].set_xlabel('Week')
axes[0].set_ylabel('Temperature (°C)')
axes[1].plot(weekly_avg_cum_degree_days.index, weekly_avg_cum_degree_days.values, marker='o', color='green')  # Plot average cumulative degree days by week
axes[1].set_title('Average Cumulative Degree Days by Week')
axes[1].set_xlabel('Week')
axes[1].set_ylabel('Cumulative Degree Days')
plt.tight_layout()

Data understanding

Code
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns  # Select numerical columns
df[numerical_columns].hist(bins=50, figsize=(16, 9), color='blue', edgecolor='black')  # Plot histograms for numerical columns
plt.suptitle('Distribution of Numerical Features', fontsize=16)
plt.tight_layout()

Data understanding

Code
import seaborn as sns
import numpy as np
correlation_matrix = df.corr()  # Compute the correlation matrix
filtered_correlation_matrix = correlation_matrix[correlation_matrix.abs() >= 0.25]  # Mask correlations with absolute values below 0.25
filtered_correlation_matrix[np.eye(filtered_correlation_matrix.shape[0], dtype=bool)] = np.nan  # Mask the diagonal
plt.figure(figsize=(10, 10))
sns.heatmap(filtered_correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title("Pairwise Correlations (|correlation| >= 0.25)")
plt.tight_layout()

Plotting the correlations

Code
plt.figure(figsize=(10, 10))
sns.pairplot(df[["ADULTS" , "LARGE_INSTARS", "WEEK", "CUM_DEGREE_DAYS", "ADULTS-1"]])
<Figure size 1000x1000 with 0 Axes>

Let’s apply simple models

Code
from sklearn.linear_model import LinearRegression
log_reg = LinearRegression()
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)
rmse_log_reg = root_mean_squared_error(y_test, y_pred_log_reg)
print("RMSE using Logistic Regression:", round(rmse_log_reg, 2))
RMSE using Logistic Regression: 9.88
Code
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(random_state=42)
rf_regressor.fit(X_train, y_train)
y_pred_rf = rf_regressor.predict(X_test)
rmse_rf = root_mean_squared_error(y_test, y_pred_rf)
print("RMSE using Random Forest Regressor:", round(rmse_rf, 2))
RMSE using Random Forest Regressor: 9.57

Let’s do hyper-parameter tuning on RF

Code
from sklearn.model_selection import RandomizedSearchCV
rf = RandomForestRegressor(random_state=42)  # Define the base model
param_dist = {  # Define the hyperparameter grid
    'n_estimators': [50, 100, 200, 300, 400],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=20, cv=5, scoring='neg_root_mean_squared_error', verbose=2, n_jobs=-1, random_state=42)  # Use RandomizedSearchCV for rough tuning
random_search.fit(X_train, y_train)  # Best parameters from Randomized Search
best_model = random_search.best_estimator_
y_pred_rf = best_model.predict(X_test)  # Make predictions
rmse_rf = root_mean_squared_error(y_test, y_pred_rf)  # Compute RMSE
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   1.8s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   1.7s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   1.8s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   2.6s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   2.7s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   4.7s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   4.9s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   4.9s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   7.4s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   7.7s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=400; total time=  13.7s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   9.2s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=400; total time=  14.1s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   6.0s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   9.5s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   9.5s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   5.9s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=400; total time=  13.7s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=  11.8s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=  11.7s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=400; total time=  13.9s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=400; total time=  14.5s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   5.3s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   5.3s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   6.0s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.0s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.0s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.0s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.9s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.9s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  11.8s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  11.9s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  11.9s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  17.9s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  18.7s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=400; total time=  21.5s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=400; total time=  21.4s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  17.6s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=400; total time=  24.4s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  11.7s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  18.9s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  19.1s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=400; total time=  30.4s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  20.2s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  11.4s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  25.3s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=400; total time=  33.6s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  15.5s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  10.5s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  17.8s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=4, min_samples_split=10, n_estimators=300; total time=  14.5s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=4, min_samples_split=10, n_estimators=300; total time=  16.1s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=  19.7s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=4, min_samples_split=10, n_estimators=300; total time=  15.8s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=  19.9s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=  20.4s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=  20.6s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  14.4s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=  29.3s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   8.4s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=4, min_samples_split=10, n_estimators=300; total time=  25.0s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   8.2s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=4, min_samples_split=10, n_estimators=300; total time=  25.5s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   8.5s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   8.1s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   2.4s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   2.3s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=  10.6s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   2.6s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   2.5s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  14.5s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   2.9s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  14.6s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  22.0s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  13.7s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   5.3s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   4.3s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   6.4s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   5.0s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   5.5s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   5.4s
[CV] END max_depth=30, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   6.3s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   5.4s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   5.4s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   2.0s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   8.8s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   2.1s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   2.2s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   3.0s
[CV] END max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   2.2s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  22.3s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  22.5s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  22.1s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  22.1s
[CV] END max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  26.4s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  21.1s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  21.0s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  24.3s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  19.8s
[CV] END max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=400; total time=  23.2s

Code
random_search.best_params_
{'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'log2',
 'max_depth': 30}
Code
random_search.best_estimator_
RandomForestRegressor(max_depth=30, max_features='log2', n_estimators=200,
                      random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
print("Optimized RMSE using Random Forest Regressor:", round(rmse_rf, 2))
Optimized RMSE using Random Forest Regressor: 9.49

Let’s try more complex models

Code
import xgboost as xgb
xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)  # Define base XGBoost model
param_dist = {  # Define hyperparameter grid
    'n_estimators': [100, 200, 300, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7, 10],
    'min_child_weight': [1, 3, 5, 7],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 0.1, 0.2, 0.5]
}
# Perform Randomized Search for broad tuning
random_search = RandomizedSearchCV(
    estimator=xgb_regressor, param_distributions=param_dist,
    n_iter=30, cv=5, scoring='neg_root_mean_squared_error',
    verbose=2, n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)  # Get best parameters from Randomized Search
best_xgb = random_search.best_estimator_
y_pred_xgb = best_xgb.predict(X_test)  # Make predictions
rmse_xgb = root_mean_squared_error(y_test, y_pred_xgb)  # Compute RMSE
Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.01, max_depth=3, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   0.6s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.01, max_depth=3, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   0.7s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.01, max_depth=3, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   0.6s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.01, max_depth=3, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   0.6s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.01, max_depth=3, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   0.6s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=5, min_child_weight=7, n_estimators=300, subsample=1.0; total time=   2.3s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=5, min_child_weight=7, n_estimators=300, subsample=1.0; total time=   2.3s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=5, min_child_weight=7, n_estimators=300, subsample=1.0; total time=   2.7s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8; total time=   1.6s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8; total time=   1.4s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8; total time=   1.4s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8; total time=   1.5s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=300, subsample=0.8; total time=   1.6s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=5, min_child_weight=7, n_estimators=300, subsample=1.0; total time=   4.9s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=5, min_child_weight=7, n_estimators=300, subsample=1.0; total time=   5.0s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   5.4s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   5.7s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   2.5s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   4.3s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.1, max_depth=5, min_child_weight=5, n_estimators=500, subsample=0.8; total time=   2.5s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.1, max_depth=5, min_child_weight=5, n_estimators=500, subsample=0.8; total time=   3.8s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  20.1s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  20.9s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  17.5s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  18.2s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.1, max_depth=5, min_child_weight=5, n_estimators=500, subsample=0.8; total time=   3.8s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  22.7s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  19.1s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.1, max_depth=5, min_child_weight=5, n_estimators=500, subsample=0.8; total time=   4.0s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.1, max_depth=5, min_child_weight=5, n_estimators=500, subsample=0.8; total time=   3.4s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  24.1s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.4s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  20.8s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.2s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.2, max_depth=5, min_child_weight=3, n_estimators=300, subsample=0.6; total time=   2.9s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.2, max_depth=5, min_child_weight=3, n_estimators=300, subsample=0.6; total time=   2.8s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=1, n_estimators=500, subsample=0.8; total time=   3.8s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8; total time=   0.7s
[CV] END colsample_bytree=0.8, gamma=0.5, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  24.0s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8; total time=   0.9s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8; total time=   0.6s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.2, max_depth=5, min_child_weight=3, n_estimators=300, subsample=0.6; total time=   3.0s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.2, max_depth=5, min_child_weight=3, n_estimators=300, subsample=0.6; total time=   3.0s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.2, max_depth=5, min_child_weight=3, n_estimators=300, subsample=0.6; total time=   4.6s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.05, max_depth=10, min_child_weight=3, n_estimators=500, subsample=0.8; total time=  28.5s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8; total time=   0.6s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=7, n_estimators=100, subsample=0.8; total time=   0.6s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=1, n_estimators=100, subsample=0.8; total time=   0.9s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=7, n_estimators=100, subsample=0.8; total time=   0.8s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=7, n_estimators=100, subsample=0.8; total time=   0.9s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=7, n_estimators=100, subsample=0.8; total time=   0.6s
[CV] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, min_child_weight=7, n_estimators=100, subsample=0.8; total time=   1.0s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   5.1s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=500, subsample=0.6; total time=   6.9s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=500, subsample=0.6; total time=   7.3s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   6.6s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=500, subsample=0.6; total time=   6.9s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=1, n_estimators=200, subsample=1.0; total time=  12.1s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=1, n_estimators=200, subsample=1.0; total time=  12.2s[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=1, n_estimators=200, subsample=1.0; total time=  11.9s

[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=500, subsample=0.6; total time=   9.6s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   4.3s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.01, max_depth=7, min_child_weight=5, n_estimators=500, subsample=0.6; total time=   9.9s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=1, n_estimators=200, subsample=1.0; total time=  12.0s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.6; total time=   0.9s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.6; total time=   1.0s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.6; total time=   0.8s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.6; total time=   1.0s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.9s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   4.1s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   4.3s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=7, n_estimators=100, subsample=0.6; total time=   0.9s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   5.0s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=300, subsample=1.0; total time=   1.2s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=300, subsample=1.0; total time=   1.4s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.6s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=300, subsample=1.0; total time=   1.3s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=300, subsample=1.0; total time=   1.3s
[CV] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=3, min_child_weight=5, n_estimators=300, subsample=1.0; total time=   1.4s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.7s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.9s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=1, n_estimators=200, subsample=1.0; total time=  16.7s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   1.6s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   1.6s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=0.6; total time=   1.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   1.8s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=0.6; total time=   1.0s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=0.6; total time=   1.1s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=0.6; total time=   1.9s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   2.9s
[CV] END colsample_bytree=1.0, gamma=0, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=0.6; total time=   1.2s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=100, subsample=1.0; total time=   0.8s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=300, subsample=0.8; total time=   3.7s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.05, max_depth=7, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   2.9s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=300, subsample=0.8; total time=   4.3s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=100, subsample=1.0; total time=   0.5s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=100, subsample=1.0; total time=   0.6s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=300, subsample=0.8; total time=   3.6s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=100, subsample=1.0; total time=   0.9s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.05, max_depth=3, min_child_weight=5, n_estimators=100, subsample=1.0; total time=   0.8s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   0.8s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   1.0s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   1.0s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   1.0s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.05, max_depth=5, min_child_weight=3, n_estimators=100, subsample=1.0; total time=   1.0s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=300, subsample=0.8; total time=   4.4s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=300, subsample=0.8; total time=   5.5s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.1, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   2.4s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.1, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   2.5s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.1, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   2.6s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.1, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   2.6s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   2.6s
[CV] END colsample_bytree=0.6, gamma=0.2, learning_rate=0.1, max_depth=7, min_child_weight=5, n_estimators=300, subsample=0.8; total time=   3.0s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   4.0s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   4.3s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   4.7s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   4.4s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.0s
[CV] END colsample_bytree=1.0, gamma=0.2, learning_rate=0.01, max_depth=10, min_child_weight=5, n_estimators=100, subsample=0.8; total time=   4.7s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.2s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.4s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=5, min_child_weight=1, n_estimators=500, subsample=1.0; total time=   3.6s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=500, subsample=1.0; total time=   4.7s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=500, subsample=1.0; total time=   4.5s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=500, subsample=1.0; total time=   5.0s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=7, n_estimators=500, subsample=0.8; total time=   2.1s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=500, subsample=1.0; total time=   4.5s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=7, n_estimators=500, subsample=0.8; total time=   2.1s
[CV] END colsample_bytree=1.0, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=7, n_estimators=500, subsample=1.0; total time=   5.2s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=7, n_estimators=500, subsample=0.8; total time=   2.1s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=7, n_estimators=500, subsample=0.8; total time=   2.0s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.1, max_depth=3, min_child_weight=7, n_estimators=500, subsample=0.8; total time=   2.0s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=200, subsample=1.0; total time=   2.1s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=200, subsample=1.0; total time=   2.2s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=200, subsample=1.0; total time=   2.1s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=200, subsample=1.0; total time=   2.2s
[CV] END colsample_bytree=0.6, gamma=0.5, learning_rate=0.2, max_depth=7, min_child_weight=5, n_estimators=200, subsample=1.0; total time=   1.9s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=10, min_child_weight=7, n_estimators=500, subsample=0.6; total time=   9.5s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=10, min_child_weight=7, n_estimators=500, subsample=0.6; total time=  10.0s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=10, min_child_weight=7, n_estimators=500, subsample=0.6; total time=   9.8s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=10, min_child_weight=7, n_estimators=500, subsample=0.6; total time=  10.1s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=10, min_child_weight=7, n_estimators=500, subsample=0.6; total time=   9.3s

Code
random_search.best_params_
{'subsample': 0.6,
 'n_estimators': 500,
 'min_child_weight': 5,
 'max_depth': 7,
 'learning_rate': 0.01,
 'gamma': 0.5,
 'colsample_bytree': 0.6}
Code
random_search.best_estimator_
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.6, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=0.5, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.01, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=7, max_leaves=None,
             min_child_weight=5, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=500, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
print("Optimized RMSE using XGBoost:", round(rmse_xgb, 2))
Optimized RMSE using XGBoost: 9.48

Plotting feature importance

Code
feature_importance = best_xgb.feature_importances_  # Get feature importance
feature_importance_df = pd.DataFrame({ 'Feature': X_train.columns, 'Importance': feature_importance }).sort_values(by='Importance', ascending=False)
plt.figure(figsize=(8, 4))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance from XGBoost')
plt.gca().invert_yaxis()
plt.tight_layout()

Let’s retry AutoML

Code
automl = AutoML()
automl_settings = {
    "time_budget": 30,  # Time budget in seconds
    "max_iter": 30,  # Max number of iterations
    "metric": 'rmse',  # Evaluation metric
    # "estimator_list": ["xgboost"],
    "task": 'regression',  # Task type
}
automl.fit(X_train, y_train, **automl_settings)
y_pred = automl.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
[flaml.automl.logger: 04-08 20:50:07] {1728} INFO - task = regression
[flaml.automl.logger: 04-08 20:50:07] {1739} INFO - Evaluation method: holdout
[flaml.automl.logger: 04-08 20:50:07] {1838} INFO - Minimizing error metric: rmse
[flaml.automl.logger: 04-08 20:50:07] {1955} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd']
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2393} INFO - Estimated sufficient time budget=561s. Estimated necessary time budget=4s.
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.2s,   estimator lgbm's best error=13.7019,    best estimator lgbm's best error=13.7019
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.3s,   estimator lgbm's best error=13.7019,    best estimator lgbm's best error=13.7019
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 2, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.3s,   estimator lgbm's best error=11.1891,    best estimator lgbm's best error=11.1891
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 3, current learner sgd
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.4s,   estimator sgd's best error=18.1963, best estimator lgbm's best error=11.1891
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 4, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.4s,   estimator lgbm's best error=10.0279,    best estimator lgbm's best error=10.0279
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 5, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.5s,   estimator lgbm's best error=10.0279,    best estimator lgbm's best error=10.0279
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 6, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.5s,   estimator lgbm's best error=9.9213, best estimator lgbm's best error=9.9213
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 7, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.6s,   estimator lgbm's best error=9.9213, best estimator lgbm's best error=9.9213
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 8, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.6s,   estimator lgbm's best error=9.9213, best estimator lgbm's best error=9.9213
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 9, current learner sgd
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.7s,   estimator sgd's best error=18.1868, best estimator lgbm's best error=9.9213
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 10, current learner xgboost
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.8s,   estimator xgboost's best error=13.7583, best estimator lgbm's best error=9.9213
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 11, current learner lgbm
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 0.9s,   estimator lgbm's best error=9.6392, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 12, current learner extra_tree
[flaml.automl.logger: 04-08 20:50:07] {2442} INFO -  at 1.0s,   estimator extra_tree's best error=11.7083,  best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:07] {2258} INFO - iteration 13, current learner xgboost
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 1.1s,   estimator xgboost's best error=13.7583, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 14, current learner xgboost
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 1.2s,   estimator xgboost's best error=11.2826, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 15, current learner extra_tree
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 1.3s,   estimator extra_tree's best error=10.0393,  best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 16, current learner lgbm
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 1.4s,   estimator lgbm's best error=9.6392, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 17, current learner extra_tree
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 1.5s,   estimator extra_tree's best error=10.0393,  best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 18, current learner lgbm
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 1.8s,   estimator lgbm's best error=9.6392, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 19, current learner extra_tree
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 1.9s,   estimator extra_tree's best error=9.8639,   best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 20, current learner rf
[flaml.automl.logger: 04-08 20:50:08] {2442} INFO -  at 2.1s,   estimator rf's best error=11.0011,  best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:08] {2258} INFO - iteration 21, current learner rf
[flaml.automl.logger: 04-08 20:50:09] {2442} INFO -  at 2.3s,   estimator rf's best error=9.9127,   best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:09] {2258} INFO - iteration 22, current learner xgboost
[flaml.automl.logger: 04-08 20:50:09] {2442} INFO -  at 2.4s,   estimator xgboost's best error=11.1308, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:09] {2258} INFO - iteration 23, current learner lgbm
[flaml.automl.logger: 04-08 20:50:09] {2442} INFO -  at 2.5s,   estimator lgbm's best error=9.6392, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:09] {2258} INFO - iteration 24, current learner lgbm
[flaml.automl.logger: 04-08 20:50:09] {2442} INFO -  at 2.6s,   estimator lgbm's best error=9.6392, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:09] {2258} INFO - iteration 25, current learner xgboost
[flaml.automl.logger: 04-08 20:50:09] {2442} INFO -  at 2.7s,   estimator xgboost's best error=10.9476, best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:09] {2258} INFO - iteration 26, current learner extra_tree
[flaml.automl.logger: 04-08 20:50:09] {2442} INFO -  at 2.8s,   estimator extra_tree's best error=9.7118,   best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:09] {2258} INFO - iteration 27, current learner rf
[flaml.automl.logger: 04-08 20:50:09] {2442} INFO -  at 3.0s,   estimator rf's best error=9.9127,   best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:09] {2258} INFO - iteration 28, current learner extra_tree
[flaml.automl.logger: 04-08 20:50:10] {2442} INFO -  at 3.2s,   estimator extra_tree's best error=9.7118,   best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:10] {2258} INFO - iteration 29, current learner extra_tree
[flaml.automl.logger: 04-08 20:50:10] {2442} INFO -  at 3.3s,   estimator extra_tree's best error=9.7118,   best estimator lgbm's best error=9.6392
[flaml.automl.logger: 04-08 20:50:10] {2685} INFO - retrain lgbm for 0.1s
[flaml.automl.logger: 04-08 20:50:10] {2688} INFO - retrained model: LGBMRegressor(colsample_bytree=np.float64(0.7610534336273627),
              learning_rate=np.float64(0.41929025492645006), max_bin=255,
              min_child_samples=4, n_estimators=34, n_jobs=-1, num_leaves=4,
              reg_alpha=0.0009765625,
              reg_lambda=np.float64(0.009280655005879927), verbose=-1)
[flaml.automl.logger: 04-08 20:50:10] {1985} INFO - fit succeeded
[flaml.automl.logger: 04-08 20:50:10] {1986} INFO - Time taken to find the best model: 0.9153485298156738

Code
automl.best_estimator
'lgbm'
Code
automl.best_config
{'n_estimators': 34,
 'num_leaves': 4,
 'min_child_samples': 4,
 'learning_rate': np.float64(0.41929025492645006),
 'log_max_bin': 8,
 'colsample_bytree': np.float64(0.7610534336273627),
 'reg_alpha': 0.0009765625,
 'reg_lambda': np.float64(0.009280655005879927)}
Code
print("RMSE using AutoML:", round(rmse, 2))
RMSE using AutoML: 9.77

Questions

  • Can this model generalize to any country? E.g., to Australia

References

Bie, Tijl De, Luc De Raedt, José Hernández-Orallo, Holger H. Hoos, Padhraic Smyth, and Christopher K. I. Williams. 2022. “Automating Data Science.” Commun. ACM 65 (3): 76–87.
Forresi, Chiara, Enrico Gallinucci, Matteo Golfarelli, Lara Maistrello, Michele Preti, and Giacomo Vaccari. 2024. “A Data Platform for Real-Time Monitoring and Analysis of the Brown Marmorated Stink Bug in Northern Italy.” Ecol. Informatics 82: 102713. https://doi.org/10.1016/J.ECOINF.2024.102713.
Waldo, Jim, and Soline Boussard. 2025. “GPTs and Hallucination.” Commun. ACM 68 (1): 40–45.
Welsh, Matt. 2023. “The End of Programming.” Commun. ACM 66 (1): 34–35.