Cleaning Data¶
Preparing interstate traffic data for machine learning¶
The goal here is to address several issues present in the Metro_Interstate_Traffic_Volume.csv
dataset to make it suitable for supervised machine learning (i.e., predicting one of the columns in the data using other columns). Each data point in this dataset represents a moment in time where 9 features are recorded. The features are variables which may affect the traffic volume on an interstate highway as well as the current traffic volume. A new data point is recorded each time a change
occurs in one of the weather forecast features.
In this example we will learn about:
Getting started¶
We first use nimble.fetchFiles
to retrieve our dataset. This will return a list of paths to the files in the dataset, downloading them from the web if not already available locally. Nimble has a built in shorthand for datasets in the UCI repository that we use below.
[2]:
from datetime import datetime
import nimble
paths = nimble.fetchFiles('uci::Metro Interstate Traffic Volume')
print(paths)
['/home/runner/nimbleData/archive.ics.uci.edu/static/public/492/Metro_Interstate_Traffic_Volume.csv.gz']
The first argument for nimble.data
(source
) can be a regular python data object, a path to a file, an open file, or URL pointing to the location on a website where the data resides. Since our fetched dataset only has one file, we use the single path returned by nimble.fetchFile
to load the data. For clarity of the outputs further on we’re cutting off a portion of the data; but all operations still work successfully on the whole dataset.
[3]:
traffic = nimble.data(paths[0], name='Metro Interstate Traffic Volume',
returnType="Matrix")
traffic = traffic[2557:, :]
The show
method provides more flexibility for the printed output than using print
or repr
. It prints a description, the name
and shape
of the object and the object data (truncating if necessary) given the parameters. The points and features parameters control the number (or range) of rows and columns of data shown in the printed table. Also, the lineLimit and lineWidthLimit parameters can place caps on the how many lines are displayed and how many characters may be in a line.
In the absence of any of those parameters, the displayed data is determined dynamically based on the terminal size. To do a simple preview our data, we will limit the output to only 10 of the points.
[4]:
traffic.show("Raw traffic data", points=10)
Raw traffic data
45647pt x 9ft
holiday temp rain_1h snow_1h ── weather_description date_time traffic_volume
┌───────────────────────────────────────────────────────────────────────────────────────────────────────
0 │ 264.580 0.000 0.000 ── mist 2012-12-31 22:00:00 2293
1 │ 264.720 0.000 0.000 ── few clouds 2012-12-31 23:00:00 1596
2 │ New Years Day 263.490 0.000 0.000 ── broken clouds 2013-01-01 00:00:00 1439
3 │ 263.780 0.000 0.000 ── scattered clouds 2013-01-01 01:00:00 1502
4 │ 264.160 0.000 0.000 ── heavy snow 2013-01-01 02:00:00 933
│ │ │ │ │ │ ── │ │ │
45643 │ 282.760 0.000 0.000 ── overcast clouds 2018-09-30 20:00:00 2781
45644 │ 282.730 0.000 0.000 ── proximity thunders─ 2018-09-30 21:00:00 2159
45645 │ 282.090 0.000 0.000 ── overcast clouds 2018-09-30 22:00:00 1450
45646 │ 282.120 0.000 0.000 ── overcast clouds 2018-09-30 23:00:00 954
The machine learning algorithms we plan to use require numeric data and can be sensitive to outliers. Our data contains 48,204 points and 9 features, but some points and features will require cleaning before these machine learning algorithms can be applied to the data. The default features.report
contains 10 statistics, but we will limit it to the following four to help identify non-numeric data, missing data and outliers.
[5]:
stats = ['mode', 'minimum', 'maximum', 'count']
report = traffic.features.report(stats)
report.show("Feature statistics")
Feature statistics
9pt x 5ft
index mode minimum maximum count
┌─────────────────────────────────────────────────────
holiday │ 0 Labor Day 57
temp │ 1 274.150 0.000 310.070 45647
rain_1h │ 2 0.000 0.000 9831.300 45647
snow_1h │ 3 0.000 0.000 0.510 45647
clouds_all │ 4 90 0.000 100.000 45647
weather_main │ 5 Clouds 45647
weather_description │ 6 sky is clear 45647
date_time │ 7 2013-04-18 22:00:00 45647
traffic_volume │ 8 353 0.000 7280.000 45647
Statistics could not be calculated for all features, indicating some are non-numeric. 0 Kelvin in temp
and 9831.3 mm in rain_1h
are also possible recording errors so we will also perform cleaning on some numeric features.
Cleaning numeric data¶
Let’s extract (i.e., separate from the rest of the data) any rows with the value 0 in temp
or 9831.3 in rain_1h
since they seem very unlikely to be accurate, then we can reevaluate the statistics without those values. Because ‘extract’ modifies the original base object, traffic will lose the points we find with ‘badPointIdentifier’.
[6]:
def badPointIdentifier(pt):
return pt['temp'] == 0 or pt['rain_1h'] == 9831.3
extracted = traffic.points.extract(badPointIdentifier)
fixedReport = traffic.features.report(stats)
fixedReport[['temp', 'rain_1h'], :].show("Modified feature report")
print('Number of points with errors:', len(extracted.points))
Modified feature report
2pt x 5ft
index mode minimum maximum count
┌────────────────────────────────────────
temp │ 1 274.150 243.390 310.070 45636
rain_1h │ 2 0.000 0.000 55.630 45636
Number of points with errors: 11
After extracting those values, our features.report
statistics look much more reasonable for those features. Since the values for those “bad” data points were implausible, we can assume that the 11 extracted points contain recording errors so we will ignore extracted
and continue with the 48,193 points still remaining in traffic
.
Cleaning non-numeric data¶
The values in the date_time
feature are strings, so we will parse each string to generate five new numeric features (‘year’, ‘month’, ‘day’, ‘hour’, ‘weekday’) to replace this feature.
[7]:
def dateTimeSplitter(value):
dt = datetime.strptime(value, "%Y-%m-%d %H:%M:%S")
return [dt.year, dt.month, dt.day, dt.hour, dt.weekday()]
traffic.features.splitByParsing('date_time', dateTimeSplitter,
['year', 'month', 'day', 'hour', 'weekday'])
Now let’s take a look at our data again after splitting a single feature of text into 5 numeric features.
[8]:
traffic.show('New parsed features in traffic data', points=10)
New parsed features in traffic data
45636pt x 13ft
holiday temp rain_1h snow_1h clouds_all ── month day hour weekday traffic_volume
┌────────────────────────────────────────────────────────────────────────────────────────────────────
0 │ 264.580 0.000 0.000 40 ── 12 31 22 0 2293
1 │ 264.720 0.000 0.000 20 ── 12 31 23 0 1596
2 │ New Years Day 263.490 0.000 0.000 58 ── 1 1 0 1 1439
3 │ 263.780 0.000 0.000 40 ── 1 1 1 1 1502
4 │ 264.160 0.000 0.000 75 ── 1 1 2 1 933
│ │ │ │ │ │ │ ── │ │ │ │ │
45632 │ 282.760 0.000 0.000 90 ── 9 30 20 6 2781
45633 │ 282.730 0.000 0.000 90 ── 9 30 21 6 2159
45634 │ 282.090 0.000 0.000 90 ── 9 30 22 6 1450
45635 │ 282.120 0.000 0.000 90 ── 9 30 23 6 954
The holiday
feature is extremely idiosyncratic, so it will require a complex function to transform. Since this case this unique to this dataset, we won’t dig into the details but Nimble can definitely handle complex cases like this one. The purpose of the function is to create a binary feature that identifies the points in the data that occur on a holiday.
[9]:
holidayIndex = traffic.features.getIndex('holiday')
currentHoliday = {'date': None}
def holidayToBinary(point):
newPt = list(point)
dateTuple = (point['year'], point['month'], point['day'])
if isinstance(point['holiday'], str):
currentHoliday['date'] = dateTuple
if currentHoliday['date'] == dateTuple:
newPt[holidayIndex] = True
else:
newPt[holidayIndex] = False
return newPt
dateInfoFeatures = ['holiday', 'year', 'month', 'day', 'hour']
traffic.points.transform(holidayToBinary)
sample = traffic[:, dateInfoFeatures]
sample.show('Data sample with converted holiday feature', points=10)
Data sample with converted holiday feature
45636pt x 5ft
holiday year month day hour
┌────────────────────────────────
0 │ False 2012 12 31 22
1 │ False 2012 12 31 23
2 │ True 2013 1 1 0
3 │ True 2013 1 1 1
4 │ True 2013 1 1 2
│ │ │ │ │ │ │
45632 │ False 2018 9 30 20
45633 │ False 2018 9 30 21
45634 │ False 2018 9 30 22
45635 │ False 2018 9 30 23
We have two features related to categorizing the weather conditions. We saw in our first look at the data that the weather_description
feature is more detailed than the weather_main
feature. “Clouds” in the weather_main
feature could be “scattered clouds”, “broken clouds” or “overcast clouds” in weather_description
. Since these features are very similar, we will use only one of them. The weather_main
feature provides a good general idea of the current weather so let’s delete
weather_description
from our data.
[10]:
traffic.features.delete('weather_description')
To make the string values in weather_main
suitable for machine learning, we will represent each of the 11 unique values contained in this column as 11 new binary features. This technique is typically called one-hot encoding because for every point in the dataset only one of these new features will be equal to 1 (hot) while others will be 0 (cold).
[11]:
newCols = traffic.replaceFeatureWithBinaryFeatures('weather_main')
sampleFts = ['weather_main=Clouds', 'weather_main=Clear', 'weather_main=Mist']
traffic[:, sampleFts].show('Sample of binary weather features',
points=10)
Sample of binary weather features
45636pt x 3ft
weather_main=Clouds weather_main=Clear weather_main=Mist
┌───────────────────────────────────────────────────────────
0 │ 0.000 0.000 1.000
1 │ 1.000 0.000 0.000
2 │ 1.000 0.000 0.000
3 │ 1.000 0.000 0.000
4 │ 0.000 0.000 0.000
│ │ │ │ │
45632 │ 1.000 0.000 0.000
45633 │ 0.000 0.000 0.000
45634 │ 1.000 0.000 0.000
45635 │ 1.000 0.000 0.000
Now that we have removed any bad points and transformed all of our data to numeric values, our dataset is ready for machine learning. We will be using this data to predict the traffic_volume
feature from the other features.
[12]:
traffic.show('Cleaned traffic data', points=10)
Cleaned traffic data
45636pt x 22ft
holiday temp rain_1h snow_1h clouds_all weather_mai─ ── month day hour weekday traffic_vol─
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────
0 │ False 264.580 0.000 0.000 40 1.000 ── 12 31 22 0 2293
1 │ False 264.720 0.000 0.000 20 0.000 ── 12 31 23 0 1596
2 │ True 263.490 0.000 0.000 58 0.000 ── 1 1 0 1 1439
3 │ True 263.780 0.000 0.000 40 0.000 ── 1 1 1 1 1502
4 │ True 264.160 0.000 0.000 75 0.000 ── 1 1 2 1 933
│ │ │ │ │ │ │ │ ── │ │ │ │ │
45632 │ False 282.760 0.000 0.000 90 0.000 ── 9 30 20 6 2781
45633 │ False 282.730 0.000 0.000 90 0.000 ── 9 30 21 6 2159
45634 │ False 282.090 0.000 0.000 90 0.000 ── 9 30 22 6 1450
45635 │ False 282.120 0.000 0.000 90 0.000 ── 9 30 23 6 954
Writing to a file¶
We’d like to be able to load the cleaned data for our Supervised Learning example any time we want, so we will write it to a new csv file.
[13]:
traffic.save('Metro_Interstate_Traffic_Volume_Cleaned.csv')
References:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Link: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume