Supervised Learning¶
Training on interstate traffic data to make predictions about the future¶
In this example, we will use two datasets that contain data on interstate traffic volumes and features that may contribute to changes in traffic volume. Metro_Interstate_Traffic_Volume_Cleaned.csv
, was generated in our Cleaning Data example and is the cleaned data we will use to build our supervised learning models. Metro_Interstate_Traffic_Volume_Predict.csv
, contains fictional “forecast” data that we will use to simulate making traffic volume predictions using our supervised machine
learning model.
In this example we will learn about:
Getting Started¶
[2]:
import nimble
bucket = 'https://storage.googleapis.com/nimble/datasets/'
traffic = nimble.data(bucket + 'Metro_Interstate_Traffic_Volume_Cleaned.csv',
returnType="Matrix")
forecast = nimble.data(bucket + 'Metro_Interstate_Traffic_Volume_Predict.csv',
returnType="Matrix")
Test five different machine learning algorithms¶
We’ll divide our traffic
data into training and testing sets. The test set (used to measure the out-of-sample performance) will contain 25% of our data and the remaining 75% will be used to train each machine learning algorithm.
[3]:
testFraction = 0.25
yFeature = 'traffic_volume'
trainX, trainY, testX, testY = traffic.trainAndTestSets(testFraction, yFeature)
For this example, we will use algorithms from the Sci-kit Learn package so it must be installed in the current environment. To check if Nimble has access to Sci-kit Learn in your environment, you can use nimble.showAvailablePackages
. Additionally, we can see a list of all of the learners available to Nimble by using nimble.showLearnerNames
. Uncomment the lines below if you would like to see the available packages and learners in your environment.
[4]:
# nimble.showAvailablePackages()
# nimble.showLearnerNames()
Nimble’s training functions only need the package name and learner name to be identified. There is no need to recall, for example, that LinearRegression
is in sklearn.linear_model
or KNeighborsRegressor
is in sklearn.neighbors
, all Nimble requires are the strings ‘sklearn.LinearRegression’ and ‘sklearn.KNeighborsRegressor’, respectively. Using nimble.trainAndTest
, we will quickly test the performance of five different regression algorithms (initially, we’ll use default
arguments to keep things simple). We can then analyze the performance by comparing each learning algorithm’s root mean square error.
[5]:
learners = ['sklearn.LinearRegression', 'sklearn.Ridge', 'sklearn.Lasso',
'sklearn.KNeighborsRegressor', 'sklearn.HistGradientBoostingRegressor']
rootMeanSquareError = nimble.calculate.rootMeanSquareError
for learner in learners:
performance = nimble.trainAndTest(learner, rootMeanSquareError, trainX,
trainY, testX, testY)
print(learner, 'error:', performance)
1.5.0
sklearn.LinearRegression error: 1797.1961560950451
sklearn.Ridge error: 1797.1709026127874
sklearn.Lasso error: 1797.209927459736
sklearn.KNeighborsRegressor error: 810.308794917623
sklearn.HistGradientBoostingRegressor error: 325.1225415964457
'sklearn.KNeighborsRegressor'
and 'sklearn.HistGradientBoostingRegressor'
had better out-of-the-box performance with this data than the linear regression based learners, so let’s focus on optimizing those two.
Improve performance by tuning hyperparameters¶
The default arguments are unlikely to yield the best performance, so now we will adjust some parameter values for our two best learners. These adjustments can be made through arguments
as a python dict
or as keyword arguments. If we need more information about a learner’s parameters, we can use nimble.learnerParameters
and nimble.learnerParameterDefaults
. Let’s try it for KNeighborsRegressor.
[6]:
nimble.showLearnerParameters('sklearn.KNeighborsRegressor')
nimble.showLearnerParameterDefaults('sklearn.KNeighborsRegressor')
algorithm
leaf_size
metric
metric_params
n_jobs
n_neighbors
p
weights
algorithm 'auto'
leaf_size 30
metric 'minkowski'
metric_params None
n_jobs None
n_neighbors 5
p 2
weights 'uniform'
Furthermore, we can test multiple values for the same parameter by using the nimble.Tune
object. The presence of nimble.Tune
will trigger hyperparameter tuning. By default, this tunes the arguments consecutively (optimizing one argument at a time while holding the others constant) and uses 5-fold cross-validation. This can be modified by providing a Tuning
object to the tuning
parameter. The tuning will find the argument combination with the best average performanceFunction
result and return the TrainedLearner
using the best arguments.
For KNeighborsRegressor, we will use nimble.Tune
to try 3, 5, and 7 for the number of nearest neighbors and for HistGradientBoostingRegressor
we will try different learning rate values. The Tuning
object defines both a method for selecting each argument set and how each argument set will be validated. Below, we will use the default “consecutive” method but instead of the default “cross validation”, we will hold out a random 20% of our training data for validation. For details on all
tuning options, see the Tuning
documentation.
[7]:
tuning = nimble.Tuning(validation=0.2, performanceFunction=rootMeanSquareError)
# some interfaces have alias options for the package name
# below we use the alias 'skl' for the 'sklearn' package.
knnTrained = nimble.train('skl.KNeighborsRegressor', trainX, trainY,
arguments={'n_neighbors': nimble.Tune([3, 5, 7])},
tuning=tuning)
hgbrTrained = nimble.train('skl.HistGradientBoostingRegressor', trainX, trainY,
learning_rate=nimble.Tune([0.1, 0.5, 1]),
tuning=tuning)
The nimble.train
function returns a TrainedLearner
. With a TrainedLearner
we can apply
(make predictions on a test set), test
(measure the performance on a test set with known labels) and it provides many other additional methods and attributes. In this case, beacuse hyperparameter tuning occured, TrainedLearner.tuning
provides access to the tuning results. Let’s check the best score and argument combination for knnTrained
.
862.086327909546 {'n_neighbors': 3}
As such, the knnTrained
object we have access to was trained with n_neighbors=3
since it had the best performance. Similarly hgbrTrained
was trained with the best of our three possible learning rates. For hgbrTrained
we will try checking the allResults
and allArguments
properties, which are sorted from best to worst performance and show the results for each of the tested argument sets.
[9]:
306.96200904305067 {'learning_rate': 0.5}
337.45159775008915 {'learning_rate': 0.1}
358.7523695812055 {'learning_rate': 1}
knnTrained
found n_neighbors
of 3 to be the best choice, but even so the best performance was not that great. However, hgbrTrained
seems promising, with a learning_rate
of 0.5 outperforming the default of 0.1. As a final check, let’s see how it performs on our testing (out-of-sample) data.
[10]:
hgbPerf = hgbrTrained.test(rootMeanSquareError, testX, testY)
print('sklearn.HistGradientBoostingRegressor', 'learning_rate=0.5', 'error', hgbPerf)
sklearn.HistGradientBoostingRegressor learning_rate=0.5 error 295.7659151552714
Applying our learner¶
We see a further improvement in the performance as compared to our original nimble.trainAndTest
calls, so the HistGradientBoostingRegressor
with a learning rate of 0.5 is our best model. Now we will apply our hgbrTrained
trained learner to our forecast
dataset to predict traffic volumes for a future day.
[11]:
predictedTraffic = hgbrTrained.apply(forecast)
predictedTraffic.features.setNames('volume', oldIdentifiers=0)
Before printing, we will append the hour
feature from forecasts
to get a better visual of the traffic throughout the day.
[12]:
predictedTraffic.features.append(forecast.features['hour'])
predictedTraffic.show('Traffic Volume Predictions')
Traffic Volume Predictions
24pt x 2ft
volume hour
┌─────────────────
0 │ 724.973 0.000
1 │ 465.648 1.000
2 │ 368.940 2.000
3 │ 470.118 3.000
4 │ 745.702 4.000
5 │ 2435.529 5.000
6 │ 5115.131 6.000
7 │ 5704.463 7.000
8 │ 5283.069 8.000
9 │ 4602.258 9.000
10 │ 4350.797 10.000
11 │ 4469.605 11.000
12 │ 4862.187 12.000
13 │ 5100.790 13.000
14 │ 5436.456 14.000
15 │ 5809.010 15.000
16 │ 6379.484 16.000
17 │ 5880.763 17.000
18 │ 4558.882 18.000
19 │ 3362.865 19.000
20 │ 2932.029 20.000
21 │ 2371.768 21.000
22 │ 2325.762 22.000
23 │ 1508.499 23.000
Based on our forecasted data, our learner is predicting heavier traffic volumes between 6 am and 6 pm, trailing off into the evening. The peak congestion is expected around the 7 am hour for the morning commute and the 4 pm hour for the afternoon commute.
Reference:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Link to original dataset: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume