๐ฌ Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
Optimization is the process of fine-tuning the hyperparameters in our experiment to optimize towards a particular objective. It can be a computationally involved process depending on the number of parameters, search space and model architectures. Hyperparameters don't just include the model's parameters but they also include parameters (choices) from preprocessing, splitting, etc. When we look at all the different parameters that can be tuned, it quickly becomes a very large search space. However, just because something is a hyperparameter doesn't mean we need to tune it.
It's absolutely alright to fix some hyperparameters (ex. lower=True during preprocessing) and remove them from the current tuning subset. Just be sure to note which parameters you are fixing and your reasoning for doing so.
You can initially just tune a small, yet influential, subset of hyperparameters that you believe will yield best results.
We want to optimize our hyperparameters so that we can understand how each of them affects our objective. By running many trials across a reasonable search space, we can determine near ideal values for our different parameters. It's also a great opportunity to determine if a smaller parameters yield similar performances as larger ones (efficiency).
Tools
There are many options for hyperparameter tuning (Optuna, Ray tune, Hyperopt, etc.). We'll be using Optuna for it's simplicity, popularity and efficiency though they are all equally so. It really comes down to familiarity and whether a library has a specific implementation readily tested and available.
Application
There are many factors to consider when performing hyperparameter optimization and luckily Optuna allows us to implement them with ease. We'll be conducting a small study where we'll tune a set of arguments (we'll do a much more thorough study of the parameter space when we move our code to Python scripts). Here's the process for the study:
Define an objective (metric) and identifying the direction to optimize.
[OPTIONAL] Choose a sampler for determining parameters for subsequent trials. (default is a tree based sampler).
[OPTIONAL] Choose a pruner to end unpromising trials early.
Define the parameters to tune in each trial and the distribution of values to sample.
There are many more options (multiple objectives, storage options, etc.) to explore but this basic set up will allow us to optimize quite well.
We're going to modify our Trainer object to be able to prune unpromising trials based on the trial's validation loss.
123456789
classTrainer(object):...deftrain(self,...):...# Pruning based on the intermediate valueself.trial.report(val_loss,epoch)ifself.trial.should_prune():raiseoptuna.TrialPruned()...
# Trainer (modified for experiment tracking)classTrainer(object):def__init__(self,model,device,loss_fn=None,optimizer=None,scheduler=None,trial=None):# Set paramsself.model=modelself.device=deviceself.loss_fn=loss_fnself.optimizer=optimizerself.scheduler=schedulerself.trial=trialdeftrain_step(self,dataloader):"""Train step."""# Set model to train modeself.model.train()loss=0.0# Iterate over train batchesfori,batchinenumerate(dataloader):# Stepbatch=[item.to(self.device)foriteminbatch]inputs,targets=batch[:-1],batch[-1]self.optimizer.zero_grad()# Reset gradientsz=self.model(inputs)# Forward passJ=self.loss_fn(z,targets)# Define lossJ.backward()# Backward passself.optimizer.step()# Update weights# Cumulative Metricsloss+=(J.detach().item()-loss)/(i+1)returnlossdefeval_step(self,dataloader):"""Validation or test step."""# Set model to eval modeself.model.eval()loss=0.0y_trues,y_probs=[],[]# Iterate over val batcheswithtorch.inference_mode():fori,batchinenumerate(dataloader):# Stepbatch=[item.to(self.device)foriteminbatch]# Set deviceinputs,y_true=batch[:-1],batch[-1]z=self.model(inputs)# Forward passJ=self.loss_fn(z,y_true).item()# Cumulative Metricsloss+=(J-loss)/(i+1)# Store outputsy_prob=torch.sigmoid(z).cpu().numpy()y_probs.extend(y_prob)y_trues.extend(y_true.cpu().numpy())returnloss,np.vstack(y_trues),np.vstack(y_probs)defpredict_step(self,dataloader):"""Prediction step."""# Set model to eval modeself.model.eval()y_probs=[]# Iterate over val batcheswithtorch.inference_mode():fori,batchinenumerate(dataloader):# Forward pass w/ inputsinputs,targets=batch[:-1],batch[-1]z=self.model(inputs)# Store outputsy_prob=torch.sigmoid(z).cpu().numpy()y_probs.extend(y_prob)returnnp.vstack(y_probs)deftrain(self,num_epochs,patience,train_dataloader,val_dataloader):best_val_loss=np.infforepochinrange(num_epochs):# Stepstrain_loss=self.train_step(dataloader=train_dataloader)val_loss,_,_=self.eval_step(dataloader=val_dataloader)self.scheduler.step(val_loss)# Early stoppingifval_loss<best_val_loss:best_val_loss=val_lossbest_model=self.model_patience=patience# reset _patienceelse:_patience-=1ifnot_patience:# 0print("Stopping early!")break# Loggingprint(f"Epoch: {epoch+1} | "f"train_loss: {train_loss:.5f}, "f"val_loss: {val_loss:.5f}, "f"lr: {self.optimizer.param_groups[0]['lr']:.2E}, "f"_patience: {_patience}")# Pruning based on the intermediate valueself.trial.report(val_loss,epoch)ifself.trial.should_prune():raiseoptuna.TrialPruned()returnbest_model,best_val_loss
We'll also modify our train_cnn function to include information about the trial.
deftrain_cnn(params,df,trial=None):"""Train a CNN using specific arguments."""# Set seedsset_seeds()# Get data splitspreprocessed_df=df.copy()preprocessed_df.text=preprocessed_df.text.apply(preprocess,lower=True)X_train,X_val,X_test,y_train,y_val,y_test,label_encoder=get_data_splits(preprocessed_df)X_test_raw=X_testnum_classes=len(label_encoder)# Set devicecuda=Truedevice=torch.device("cuda"if(torch.cuda.is_available()andcuda)else"cpu")torch.set_default_tensor_type("torch.FloatTensor")ifdevice.type=="cuda":torch.set_default_tensor_type("torch.cuda.FloatTensor")# Tokenizetokenizer=Tokenizer(char_level=params.char_level)tokenizer.fit_on_texts(texts=X_train)vocab_size=len(tokenizer)# Convert texts to sequences of indicesX_train=np.array(tokenizer.texts_to_sequences(X_train))X_val=np.array(tokenizer.texts_to_sequences(X_val))X_test=np.array(tokenizer.texts_to_sequences(X_test))# Class weightscounts=np.bincount([label_encoder.class_to_index[class_]forclass_inall_tags])class_weights={i:1.0/countfori,countinenumerate(counts)}# Create datasetstrain_dataset=CNNTextDataset(X=X_train,y=y_train,max_filter_size=max(params.filter_sizes))val_dataset=CNNTextDataset(X=X_val,y=y_val,max_filter_size=max(params.filter_sizes))test_dataset=CNNTextDataset(X=X_test,y=y_test,max_filter_size=max(params.filter_sizes))# Create dataloaderstrain_dataloader=train_dataset.create_dataloader(batch_size=params.batch_size)val_dataloader=val_dataset.create_dataloader(batch_size=params.batch_size)test_dataloader=test_dataset.create_dataloader(batch_size=params.batch_size)# Initialize modelmodel=CNN(embedding_dim=params.embedding_dim,vocab_size=vocab_size,num_filters=params.num_filters,filter_sizes=params.filter_sizes,hidden_dim=params.hidden_dim,dropout_p=params.dropout_p,num_classes=num_classes)model=model.to(device)# Define lossclass_weights_tensor=torch.Tensor(np.array(list(class_weights.values())))loss_fn=nn.BCEWithLogitsLoss(weight=class_weights_tensor)# Define optimizer & scheduleroptimizer=torch.optim.Adam(model.parameters(),lr=params.lr)scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,mode="min",factor=0.1,patience=5)# Trainer moduletrainer=Trainer(model=model,device=device,loss_fn=loss_fn,optimizer=optimizer,scheduler=scheduler,trial=trial)# Trainbest_model,best_val_loss=trainer.train(params.num_epochs,params.patience,train_dataloader,val_dataloader)# Best threshold for f1train_loss,y_true,y_prob=trainer.eval_step(dataloader=train_dataloader)precisions,recalls,thresholds=precision_recall_curve(y_true.ravel(),y_prob.ravel())threshold=find_best_threshold(y_true.ravel(),y_prob.ravel())# Determine predictions using thresholdtest_loss,y_true,y_prob=trainer.eval_step(dataloader=test_dataloader)y_pred=np.array([np.where(prob>=threshold,1,0)forprobiny_prob])# Evaluateperformance=get_metrics(y_true=y_test,y_pred=y_pred,classes=label_encoder.classes)return{"params":params,"tokenizer":tokenizer,"label_encoder":label_encoder,"model":best_model,"performance":performance,"best_val_loss":best_val_loss,"threshold":threshold,}
Objective
We need to define an Objective function that will consume a trial and a set of arguments and produce the metric to optimize on (f1 in our case).
1 2 3 4 5 6 7 8 91011121314151617181920
defobjective(trial,params):"""Objective function for optimization trials."""# Paramters (to tune)params.embedding_dim=trial.suggest_int("embedding_dim",128,512)params.num_filters=trial.suggest_int("num_filters",128,512)params.hidden_dim=trial.suggest_int("hidden_dim",128,512)params.dropout_p=trial.suggest_uniform("dropout_p",0.3,0.8)params.lr=trial.suggest_loguniform("lr",5e-5,5e-4)# Train & evaluateartifacts=train_cnn(params=params,df=df,trial=trial)# Set additional attributestrial.set_user_attr("precision",artifacts["performance"]["precision"])trial.set_user_attr("recall",artifacts["performance"]["recall"])trial.set_user_attr("f1",artifacts["performance"]["f1"])trial.set_user_attr("threshold",artifacts["threshold"])returnartifacts["performance"]["f1"]
Study
We're ready to kick off our study with our MLFlowCallback so we can track all of the different trials.
1 2 3 4 5 6 7 8 91011
fromoptuna.integration.mlflowimportMLflowCallback# OptimizeNUM_TRIALS=50# small sample for nowpruner=optuna.pruners.MedianPruner(n_startup_trials=5,n_warmup_steps=5)study=optuna.create_study(study_name="optimization",direction="maximize",pruner=pruner)mlflow_callback=MLflowCallback(tracking_uri=mlflow.get_tracking_uri(),metric_name="f1")study.optimize(lambdatrial:objective(trial,params),n_trials=NUM_TRIALS,callbacks=[mlflow_callback])
A new study created in memory with name: optimization
Epoch: 1 | train_loss: 0.00645, val_loss: 0.00314, lr: 3.48E-04, _patience: 10
...
Epoch: 23 | train_loss: 0.00029, val_loss: 0.00175, lr: 3.48E-05, _patience: 1
Stopping early!
Trial 0 finished with value: 0.5999225606985846 and parameters: {'embedding_dim': 508, 'num_filters': 359, 'hidden_dim': 262, 'dropout_p': 0.6008497926241321, 'lr': 0.0003484755175747328}. Best is trial 0 with value: 0.5999225606985846.
INFO: 'optimization' does not exist. Creating a new experiment
...
Trial 10 pruned.
...
Epoch: 25 | train_loss: 0.00029, val_loss: 0.00156, lr: 2.73E-05, _patience: 2
Epoch: 26 | train_loss: 0.00028, val_loss: 0.00152, lr: 2.73E-05, _patience: 1
Stopping early!
Trial 49 finished with value: 0.6220047640997922 and parameters: {'embedding_dim': 485, 'num_filters': 420, 'hidden_dim': 477, 'dropout_p': 0.7984462152799114, 'lr': 0.0002619841505205434}. Best is trial 46 with value: 0.63900047716579.
You can compare all (or a subset) of the trials in our experiment.
We can then view the results through various lens (contours, parallel coordinates, etc.)
1234
# All trialstrials_df=study.trials_dataframe()trials_df=trials_df.sort_values(["value"],ascending=False)# sort by metrictrials_df.head()
number
value
datetime_start
datetime_complete
duration
params_dropout_p
params_embedding_dim
params_hidden_dim
params_lr
params_num_filters
user_attrs_f1
user_attrs_precision
user_attrs_recall
user_attrs_threshold
state
46
46
0.639000
2021-01-26 21:29:09.435991
2021-01-26 21:30:20.637867
0 days 00:01:11.201876
0.670784
335
458
0.000298
477
0.639000
0.852947
0.540094
0.221352
COMPLETE
32
32
0.638382
2021-01-26 21:08:27.456865
2021-01-26 21:09:54.151386
0 days 00:01:26.694521
0.485060
322
329
0.000143
458
0.638382
0.860706
0.535624
0.285308
COMPLETE
33
33
0.638135
2021-01-26 21:09:54.182560
2021-01-26 21:11:14.038009
0 days 00:01:19.855449
0.567419
323
405
0.000163
482
0.638135
0.872309
0.537566
0.298093
COMPLETE
39
39
0.637652
2021-01-26 21:18:37.735567
2021-01-26 21:20:01.271413
0 days 00:01:23.535846
0.689044
391
401
0.000496
512
0.637652
0.852757
0.536279
0.258009
COMPLETE
34
34
0.634339
2021-01-26 21:11:14.068099
2021-01-26 21:12:33.645090
0 days 00:01:19.576991
0.592627
371
379
0.000213
486
0.634339
0.863092
0.531822
0.263524
COMPLETE
123
# Best trialprint(f"Best value (val loss): {study.best_trial.value}")print(f"Best hyperparameters: {study.best_trial.params}")
Best value (f1): 0.6953118802537894
Best hyperparameters: {'embedding_dim': 234, 'num_filters': 383, 'hidden_dim': 265, 'dropout_p': 0.4174131267446717, 'lr': 0.0004392663090337615}
Don't forget to save learned parameters (ex. decision threshold) during training which you'll need later for inference.
1234
# Save best parametersparams={**params.__dict__,**study.best_trial.params}params["threshold"]=study.best_trial.user_attrs["threshold"]print(json.dumps(params,indent=2,cls=NumpyEncoder))
... and now we're finally ready to move from working in Jupyter notebooks to Python scripts. We'll be revisiting everything we did so far, but this time with proper software engineering prinicples such as object oriented programming (OOPs), styling, testing, etc. โ https://madewithml.com/#mlops
You'll most likely be using the CLI application to optimize and train your models. If you don't have access to GPUs (personal machine, AWS, GCP, etc.), check out the optimize.ipynb notebook for how to train on Google Colab and transfer the entire MLFlow experiment to your local machine. We essentially run optimization, then train the best model to download and transfer it's artifacts.
To cite this lesson, please use:
123456
@article{madewithml,author={Goku Mohandas},title={ Optimization - Made With ML },howpublished={\url{https://madewithml.com/}},year={2021}}