Cleaning Data

Preparing interstate traffic data for machine learning

The goal here is to address several issues present in the Metro_Interstate_Traffic_Volume.csv dataset to make it suitable for supervised machine learning (i.e., predicting one of the columns in the data using other columns). Each data point in this dataset represents a moment in time where 9 features are recorded. The features are variables which may affect the traffic volume on an interstate highway as well as the current traffic volume. A new data point is recorded each time a change occurs in one of the weather forecast features.

In this example we will learn about:

Getting started

We first use nimble.fetchFiles to retrieve our dataset. This will return a list of paths to the files in the dataset, downloading them from the web if not already available locally. Nimble has a built in shorthand for datasets in the UCI repository that we use below.

[2]:
from datetime import datetime
import nimble

paths = nimble.fetchFiles('uci::Metro Interstate Traffic Volume')
print(paths)
['/home/runner/nimbleData/archive.ics.uci.edu/static/public/492/Metro_Interstate_Traffic_Volume.csv.gz']

The first argument for nimble.data (source) can be a regular python data object, a path to a file, an open file, or URL pointing to the location on a website where the data resides. Since our fetched dataset only has one file, we use the single path returned by nimble.fetchFile to load the data. For clarity of the outputs further on we’re cutting off a portion of the data; but all operations still work successfully on the whole dataset.

[3]:
traffic = nimble.data(paths[0], name='Metro Interstate Traffic Volume',
                      returnType="Matrix")
traffic = traffic[2557:, :]

The show method provides more flexibility for the printed output than using print or repr. It prints a description, the name and shape of the object and the object data (truncating if necessary) given the parameters. The points and features parameters control the number (or range) of rows and columns of data shown in the printed table. Also, the lineLimit and lineWidthLimit parameters can place caps on the how many lines are displayed and how many characters may be in a line. In the absence of any of those parameters, the displayed data is determined dynamically based on the terminal size. To do a simple preview our data, we will limit the output to only 10 of the points.

[4]:
traffic.show("Raw traffic data", points=10)
Raw traffic data
45647pt x 9ft
           holiday       temp   rain_1h  snow_1h  ──  weather_description       date_time       traffic_volume
      ┌───────────────────────────────────────────────────────────────────────────────────────────────────────
    0 │                264.580   0.000    0.000   ──                 mist  2012-12-31 22:00:00       2293
    1 │                264.720   0.000    0.000   ──           few clouds  2012-12-31 23:00:00       1596
    2 │ New Years Day  263.490   0.000    0.000   ──        broken clouds  2013-01-01 00:00:00       1439
    3 │                263.780   0.000    0.000   ──     scattered clouds  2013-01-01 01:00:00       1502
    4 │                264.160   0.000    0.000   ──           heavy snow  2013-01-01 02:00:00        933
    │ │       │           │        │        │     ──             │                  │                 │
45643 │                282.760   0.000    0.000   ──      overcast clouds  2018-09-30 20:00:00       2781
45644 │                282.730   0.000    0.000   ──  proximity thunders─  2018-09-30 21:00:00       2159
45645 │                282.090   0.000    0.000   ──      overcast clouds  2018-09-30 22:00:00       1450
45646 │                282.120   0.000    0.000   ──      overcast clouds  2018-09-30 23:00:00        954

The machine learning algorithms we plan to use require numeric data and can be sensitive to outliers. Our data contains 48,204 points and 9 features, but some points and features will require cleaning before these machine learning algorithms can be applied to the data. The default features.report contains 10 statistics, but we will limit it to the following four to help identify non-numeric data, missing data and outliers.

[5]:
stats = ['mode', 'minimum', 'maximum', 'count']
report = traffic.features.report(stats)
report.show("Feature statistics")
Feature statistics
9pt x 5ft
                      index          mode         minimum  maximum   count
                    ┌─────────────────────────────────────────────────────
            holiday │   0              Labor Day                        57
               temp │   1                274.150   0.000    310.070  45647
            rain_1h │   2                  0.000   0.000   9831.300  45647
            snow_1h │   3                  0.000   0.000      0.510  45647
         clouds_all │   4                     90   0.000    100.000  45647
       weather_main │   5                 Clouds                     45647
weather_description │   6           sky is clear                     45647
          date_time │   7    2013-04-18 22:00:00                     45647
     traffic_volume │   8                    353   0.000   7280.000  45647

Statistics could not be calculated for all features, indicating some are non-numeric. 0 Kelvin in temp and 9831.3 mm in rain_1h are also possible recording errors so we will also perform cleaning on some numeric features.

Cleaning numeric data

Let’s extract (i.e., separate from the rest of the data) any rows with the value 0 in temp or 9831.3 in rain_1h since they seem very unlikely to be accurate, then we can reevaluate the statistics without those values. Because ‘extract’ modifies the original base object, traffic will lose the points we find with ‘badPointIdentifier’.

[6]:
def badPointIdentifier(pt):
    return pt['temp'] == 0 or pt['rain_1h'] == 9831.3

extracted = traffic.points.extract(badPointIdentifier)

fixedReport = traffic.features.report(stats)
fixedReport[['temp', 'rain_1h'], :].show("Modified feature report")
print('Number of points with errors:', len(extracted.points))
Modified feature report
2pt x 5ft
          index    mode   minimum  maximum  count
        ┌────────────────────────────────────────
   temp │   1    274.150  243.390  310.070  45636
rain_1h │   2      0.000    0.000   55.630  45636

Number of points with errors: 11

After extracting those values, our features.report statistics look much more reasonable for those features. Since the values for those “bad” data points were implausible, we can assume that the 11 extracted points contain recording errors so we will ignore extracted and continue with the 48,193 points still remaining in traffic.

Cleaning non-numeric data

The values in the date_time feature are strings, so we will parse each string to generate five new numeric features (‘year’, ‘month’, ‘day’, ‘hour’, ‘weekday’) to replace this feature.

[7]:
def dateTimeSplitter(value):
    dt = datetime.strptime(value, "%Y-%m-%d %H:%M:%S")
    return [dt.year, dt.month, dt.day, dt.hour, dt.weekday()]

traffic.features.splitByParsing('date_time', dateTimeSplitter,
                                ['year', 'month', 'day', 'hour', 'weekday'])

Now let’s take a look at our data again after splitting a single feature of text into 5 numeric features.

[8]:
traffic.show('New parsed features in traffic data', points=10)
New parsed features in traffic data
45636pt x 13ft
           holiday       temp   rain_1h  snow_1h  clouds_all  ──  month  day  hour  weekday  traffic_volume
      ┌────────────────────────────────────────────────────────────────────────────────────────────────────
    0 │                264.580   0.000    0.000       40      ──    12    31   22      0          2293
    1 │                264.720   0.000    0.000       20      ──    12    31   23      0          1596
    2 │ New Years Day  263.490   0.000    0.000       58      ──     1     1    0      1          1439
    3 │                263.780   0.000    0.000       40      ──     1     1    1      1          1502
    4 │                264.160   0.000    0.000       75      ──     1     1    2      1           933
    │ │       │           │        │        │         │       ──     │    │    │       │           │
45632 │                282.760   0.000    0.000       90      ──     9    30   20      6          2781
45633 │                282.730   0.000    0.000       90      ──     9    30   21      6          2159
45634 │                282.090   0.000    0.000       90      ──     9    30   22      6          1450
45635 │                282.120   0.000    0.000       90      ──     9    30   23      6           954

The holiday feature is extremely idiosyncratic, so it will require a complex function to transform. Since this case this unique to this dataset, we won’t dig into the details but Nimble can definitely handle complex cases like this one. The purpose of the function is to create a binary feature that identifies the points in the data that occur on a holiday.

[9]:
holidayIndex = traffic.features.getIndex('holiday')
currentHoliday = {'date': None}
def holidayToBinary(point):
    newPt = list(point)
    dateTuple = (point['year'], point['month'], point['day'])
    if isinstance(point['holiday'], str):
        currentHoliday['date'] = dateTuple
    if currentHoliday['date'] == dateTuple:
        newPt[holidayIndex] = True
    else:
        newPt[holidayIndex] = False

    return newPt

dateInfoFeatures = ['holiday', 'year', 'month', 'day', 'hour']
traffic.points.transform(holidayToBinary)
sample = traffic[:, dateInfoFeatures]
sample.show('Data sample with converted holiday feature', points=10)
Data sample with converted holiday feature
45636pt x 5ft
        holiday  year  month  day  hour
      ┌────────────────────────────────
    0 │  False   2012    12    31   22
    1 │  False   2012    12    31   23
    2 │   True   2013     1     1    0
    3 │   True   2013     1     1    1
    4 │   True   2013     1     1    2
    │ │    │      │       │    │    │
45632 │  False   2018     9    30   20
45633 │  False   2018     9    30   21
45634 │  False   2018     9    30   22
45635 │  False   2018     9    30   23

We have two features related to categorizing the weather conditions. We saw in our first look at the data that the weather_description feature is more detailed than the weather_main feature. “Clouds” in the weather_main feature could be “scattered clouds”, “broken clouds” or “overcast clouds” in weather_description. Since these features are very similar, we will use only one of them. The weather_main feature provides a good general idea of the current weather so let’s delete weather_description from our data.

[10]:
traffic.features.delete('weather_description')

To make the string values in weather_main suitable for machine learning, we will represent each of the 11 unique values contained in this column as 11 new binary features. This technique is typically called one-hot encoding because for every point in the dataset only one of these new features will be equal to 1 (hot) while others will be 0 (cold).

[11]:
newCols = traffic.replaceFeatureWithBinaryFeatures('weather_main')
sampleFts = ['weather_main=Clouds', 'weather_main=Clear', 'weather_main=Mist']
traffic[:, sampleFts].show('Sample of binary weather features',
                                           points=10)
Sample of binary weather features
45636pt x 3ft
        weather_main=Clouds  weather_main=Clear  weather_main=Mist
      ┌───────────────────────────────────────────────────────────
    0 │        0.000               0.000               1.000
    1 │        1.000               0.000               0.000
    2 │        1.000               0.000               0.000
    3 │        1.000               0.000               0.000
    4 │        0.000               0.000               0.000
    │ │          │                   │                   │
45632 │        1.000               0.000               0.000
45633 │        0.000               0.000               0.000
45634 │        1.000               0.000               0.000
45635 │        1.000               0.000               0.000

Now that we have removed any bad points and transformed all of our data to numeric values, our dataset is ready for machine learning. We will be using this data to predict the traffic_volume feature from the other features.

[12]:
traffic.show('Cleaned traffic data', points=10)
Cleaned traffic data
45636pt x 22ft
        holiday    temp   rain_1h  snow_1h  clouds_all  weather_mai─  ──  month  day  hour  weekday  traffic_vol─
      ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────
    0 │  False   264.580   0.000    0.000       40         1.000      ──    12    31   22      0         2293
    1 │  False   264.720   0.000    0.000       20         0.000      ──    12    31   23      0         1596
    2 │   True   263.490   0.000    0.000       58         0.000      ──     1     1    0      1         1439
    3 │   True   263.780   0.000    0.000       40         0.000      ──     1     1    1      1         1502
    4 │   True   264.160   0.000    0.000       75         0.000      ──     1     1    2      1          933
    │ │    │        │        │        │         │            │        ──     │    │    │       │          │
45632 │  False   282.760   0.000    0.000       90         0.000      ──     9    30   20      6         2781
45633 │  False   282.730   0.000    0.000       90         0.000      ──     9    30   21      6         2159
45634 │  False   282.090   0.000    0.000       90         0.000      ──     9    30   22      6         1450
45635 │  False   282.120   0.000    0.000       90         0.000      ──     9    30   23      6          954

Writing to a file

We’d like to be able to load the cleaned data for our Supervised Learning example any time we want, so we will write it to a new csv file.

[13]:
traffic.save('Metro_Interstate_Traffic_Volume_Cleaned.csv')

References:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Link: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume