Merging and Tidying Data

Creating a tidy (cleaned up) data object from multiple data objects.

In this example, we have temperature data contained in 8 files. Each row in every file records 12 hours of temperatures for each day, but files vary by weather station, time period, and extreme (i.e., the file contains either minimum or maximum temperatures). Each file covers the same date range so we want to create a single object with all of our data. Once combined, we would like to restructure our data using Hadley Wickham’s Tidy Data principles. Wickham defines “data tidying” as “structuring datasets to facilitate analysis” and Nimble provides functions that reorganize points and features to create data that is “tidy”. Our goal is to create one tidy data object containing all of the data from our 8 original files.

In this example we will learn about:

Getting Started

[2]:
import nimble

bucket = 'https://storage.googleapis.com/nimble/datasets/tidy/'
downtownTempMinAM = nimble.data(bucket + 'downtown_am_min.csv', returnType="Matrix")
downtownTempMaxAM = nimble.data(bucket + 'downtown_am_max.csv', returnType="Matrix")
downtownTempMinPM = nimble.data(bucket + 'downtown_pm_min.csv', returnType="Matrix")
downtownTempMaxPM = nimble.data(bucket + 'downtown_pm_max.csv', returnType="Matrix")
airportTempMinAM = nimble.data(bucket + 'airport_am_min.csv', returnType="Matrix")
airportTempMaxAM = nimble.data(bucket + 'airport_am_max.csv', returnType="Matrix")
airportTempMinPM = nimble.data(bucket + 'airport_pm_min.csv', returnType="Matrix")
airportTempMaxPM = nimble.data(bucket + 'airport_pm_max.csv', returnType="Matrix")

To begin, we create 8 objects from 8 different files. The variable names and object names for each object represent the weather station location (downtown or airport), the temperature extreme recorded (Min or Max) and the time of day (AM or PM). All of our files have the same header row and cover the same date range. Let’s look at one of our objects to see these headers and understand the current format of our data.

[3]:
downtownTempMinAM.show('Example of data file structure', points=7)
Example of data file structure
712pt x 13ft
         date      hr0     hr1     hr2     hr3     hr4     hr5     hr6     hr7     hr8     hr9     hr10    hr11
    ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────
  0 │ 2011-01-01   2.840   2.019           2.839   2.838   2.840   2.020   1.199   2.839   6.119   8.579  7.760
  1 │ 2011-01-03   2.020   1.200                  -0.440  -0.440  -1.260  -1.262  -1.260  -0.442   0.380  1.199
  2 │ 2011-01-04  -0.440  -0.441  -1.261          -1.262  -2.081  -2.083  -2.080  -1.260  -0.440  -0.442  2.019
  │ │     │         │       │       │        │      │       │       │       │       │       │       │       │
709 │ 2012-12-29   2.840   2.840   2.839   2.839   2.840   2.839   3.660   3.658   3.659   3.659   3.660  1.200
710 │ 2012-12-30   3.659   3.659   3.659   3.660   3.660   3.659   2.840   2.840   2.840   2.840   3.660  3.659
711 │ 2012-12-31   0.380   0.379  -0.440  -0.441  -1.261  -0.441  -0.440  -0.440  -1.260   0.379   1.199  2.020

Combining the data

First, we can reduce our number of objects to 4 by combining AM and PM temperatures of objects at the same weather station (downtown or airport) and with the same extreme (min or max). The feature names for AM and PM are currently the same, so we will need to modify the feature names in the PM objects so that they denote the hour according to a 24 hour clock.

[4]:
ftsPM = ['date', 'hr12', 'hr13', 'hr14', 'hr15', 'hr16', 'hr17',
         'hr18', 'hr19', 'hr20', 'hr21', 'hr22', 'hr23']

for obj in [downtownTempMinPM, downtownTempMaxPM, airportTempMinPM, airportTempMaxPM]:
    obj.features.setNames(ftsPM)

Now that we’ve differentiated our features, we can use a merge operation to combine the data. We want to join these objects on the ‘date’ feature (i.e., we are combining data with the same date) and use point='union' (that is, we want all the points from both files) so that we keep all possible dates, even if a date is missing for the AM or PM data.

[5]:
downtownTempMinAM.merge(downtownTempMinPM, onFeature='date', point='union')
downtownTempMaxAM.merge(downtownTempMaxPM, onFeature='date', point='union')
airportTempMinAM.merge(airportTempMinPM, onFeature='date', point='union')
airportTempMaxAM.merge(airportTempMaxPM, onFeature='date', point='union')

downtownTempMinAM.show('Downtown data merged on date', points=7)
Downtown data merged on date
717pt x 25ft
         date      hr0    hr1    hr2     hr3     hr4     hr5    ──   hr18    hr19    hr20   hr21   hr22    hr23
    ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────
  0 │ 2011-01-01  2.840  2.019           2.839   2.838   2.840  ──  10.219  10.219  9.399  9.399   9.399  11.859
  1 │ 2011-01-02                                                ──   7.759   6.118  5.300  3.659   2.839   2.020
  2 │ 2011-01-03  2.020  1.200                  -0.440  -0.440  ──   2.840   1.199  1.200  0.379  -1.261   0.379
  │ │     │         │      │       │       │      │       │     ──     │       │      │      │      │        │
714 │ 2012-12-29  2.840  2.840   2.839   2.839   2.840   2.839  ──   5.297   5.299  5.299  4.479   3.658   3.659
715 │ 2012-12-30  3.659  3.659   3.659   3.660   3.660   3.659  ──   2.839   6.939  2.019  1.199   1.200   1.200
716 │ 2012-12-31  0.380  0.379  -0.440  -0.441  -1.261  -0.441  ──   3.660   3.659  3.659  3.660   3.660   3.659

Next, we can reduce our number of objects from 4 to 2 by combining the objects with different extremes (min vs. max) for the same location. Before combining, we will want to add an “extreme” feature to each object based on whether it contains min vs. max data. Without this step, we would not be able to differentiate between minimum and maximum temperature points in the combined objects. Once our new feature is added, we can points.append our objects from the same weather station.

[6]:
for obj in [downtownTempMinAM, downtownTempMaxAM, airportTempMinAM, airportTempMaxAM]:
    extreme = 'min' if 'min' in obj.path else 'max'
    ftData = [[extreme] for _ in range(len(obj.points))]
    newFt = nimble.data(ftData, featureNames=['extreme'])
    # New feature will be added at index position 1 (after "date" feature)
    obj.features.insert(1, newFt)

downtownTempMinAM.points.append(downtownTempMaxAM)
airportTempMinAM.points.append(airportTempMaxAM)

downtownTempMinAM.show('Downtown combined extreme data', points=7)
Downtown combined extreme data
1430pt x 26ft
          date     extreme   hr0    hr1    hr2     hr3     hr4    ──   hr18    hr19    hr20   hr21   hr22    hr23
     ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────
   0 │ 2011-01-01    min    2.840  2.019           2.839   2.838  ──  10.219  10.219  9.399  9.399   9.399  11.859
   1 │ 2011-01-02    min                                          ──   7.759   6.118  5.300  3.659   2.839   2.020
   2 │ 2011-01-03    min    2.020  1.200                  -0.440  ──   2.840   1.199  1.200  0.379  -1.261   0.379
   │ │     │          │       │      │       │       │      │     ──     │       │      │      │      │        │
1427 │ 2012-12-29    max    2.840  2.840   2.841   2.841   2.840  ──   5.303   5.301  5.301  4.481   3.662   3.661
1428 │ 2012-12-30    max    3.661  3.661   3.661   3.660   3.660  ──   2.841   6.941  2.021  1.201   1.200   1.200
1429 │ 2012-12-31    max    0.380  0.381  -0.440  -0.439  -1.259  ──   3.660   3.661  3.661  3.660   3.660   3.661

Finally, we can combine our two objects into one by combining our two weather stations (downtown vs. airport). Just like in the last step, we need to create a new ‘station’ feature for each object based on which weather station location (downtown vs. airport) recorded the data.

[7]:
for obj in [downtownTempMinAM, airportTempMinAM]:
    station = 'downtown' if 'downtown' in obj.path else 'airport'
    stationData = [[station] for _ in range(len(obj.points))]
    newFt = nimble.data(stationData, featureNames=['station'])
    obj.features.insert(1, newFt)

downtownTempMinAM.points.append(airportTempMinAM)

Since all of these operations have been in-place, our downtownTempMinAM object now contains all of our data from the 8 files. This variable name could be confusing so, for clarity, let’s assign this object to a new variable name, tempData. Let’s also sort our data by date, so that we can double check that each date has a minimum and maximum temperature recording for each weather station. Taking a look at our data will also help us start exploring how we can begin to tidy it.

[8]:
tempData = downtownTempMinAM
tempData.name = 'combinedTemperatureData'
tempData.points.sort('date')

tempData.show('Fully merged (untidy) data', points=11)
Fully merged (untidy) data
"combinedTemperatureData" 2840pt x 27ft
          date     station   extreme   hr0    hr1    hr2    ──   hr18    hr19    hr20    hr21    hr22    hr23
     ┌────────────────────────────────────────────────────────────────────────────────────────────────────────
   0 │ 2011-01-01  downtown    min    2.840  2.019          ──  10.219  10.219   9.399   9.399   9.399  11.859
   1 │ 2011-01-01  downtown    max    2.840  2.021   2.022  ──  10.221  10.221   9.401   9.401   9.401  11.861
   2 │ 2011-01-01   airport    min    6.138  5.299   5.300  ──  13.579  13.579  12.789  12.789  12.769  15.259
   3 │ 2011-01-01   airport    max    6.142  5.301   5.300  ──  13.581  13.581  12.791  12.791  12.771  15.261
   4 │ 2011-01-02  downtown    min                          ──   7.759   6.118   5.300   3.659   2.839   2.020
   │ │     │          │         │       │      │       │    ──     │       │       │       │       │       │
2835 │ 2012-12-30   airport    max    3.700  3.691   3.721  ──   2.850   6.971   2.070   1.221   1.220   1.191
2836 │ 2012-12-31  downtown    min    0.380  0.379  -0.440  ──   3.660   3.659   3.659   3.660   3.660   3.659
2837 │ 2012-12-31  downtown    max    0.380  0.381  -0.440  ──   3.660   3.661   3.661   3.660   3.660   3.661
2838 │ 2012-12-31   airport    min    0.389  0.419  -0.401  ──   3.659   3.659   3.678   3.679   3.670   3.649
2839 │ 2012-12-31   airport    max    0.391  0.421  -0.399  ──   3.661   3.661   3.682   3.681   3.670   3.651

Tidying the data

Our dataset is combined but not in the format we want. To structure our data for analysis, we would like each point to be a single observation of the variables in our data. According to Hadley Wickham’s Tidy Data principles, our dataset is not tidy for two reasons. First, 24 observations are made every day (one each hour). Points should represent observations so each day should be represented by 24 points. Second, our minimum and maximum temperatures are variables for the same observation. Variables should be represented as features.

As an example, our current (truncated) data for 2001-01-01 at the downtown station can be seen below. We see it is structured using two points.

   date    station  extreme  hr0   hr1   hr2   --  hr22   hr23

2011-01-01 downtown   min   2.840 2.019        --  9.399  11.859
2011-01-01 downtown   max   2.840 2.021 2.022  --  9.401  11.861

To tidy that same data, we modify the structure to include one point for each hour and identify the minimum and maximum temperatures in the min and max features.

   date    station   min    max   hour

2011-01-01 downtown 2.840  2.840  hr0
2011-01-01 downtown 2.019  2.840  hr1
    |         |       |     |      |
2011-01-01 downtown 9.399  11.859 hr22
2011-01-01 downtown 3.649  11.861 hr23

Tidying our data will take two steps. First, we need each point to represent a single hour of time. So we will take our 24 hour features (hr0, hr1, …, hr23) and collapse them to represent this same data using 24 points (one point for each feature that is collapsed). The collapsed features become two new features: one named hour storing each feature’s name as a value and one named temp storing the temperature recorded during that hour.

[9]:
hourFts = ['hr' + str(i) for i in range(24)]
tempData.points.splitByCollapsingFeatures(featuresToCollapse=hourFts,
                                          featureForNames='hour',
                                          featureForValues='temp')
tempData.points.sort(['date', 'hour'])
tempData.show('Split points by collapsing the hour features', points=11)
Split points by collapsing the hour features
"combinedTemperatureData" 68160pt x 5ft
           date     station   extreme  hour   temp
      ┌────────────────────────────────────────────
    0 │ 2011-01-01  downtown    min    hr0    2.840
    1 │ 2011-01-01  downtown    max    hr0    2.840
    2 │ 2011-01-01   airport    min    hr0    6.138
    3 │ 2011-01-01   airport    max    hr0    6.142
    4 │ 2011-01-01  downtown    min    hr1    2.019
    │ │     │          │         │      │      │
68155 │ 2012-12-31   airport    max    hr8   -1.219
68156 │ 2012-12-31  downtown    min    hr9    0.379
68157 │ 2012-12-31  downtown    max    hr9    0.381
68158 │ 2012-12-31   airport    min    hr9    0.418
68159 │ 2012-12-31   airport    max    hr9    0.422

This is looking closer now that each point refers to a single hour of time. However, we still have separate points storing our maximum and minimum temperatures. This is not obvious in the output above, let’s make a couple of modifications to see this more clearly. First, we can clean our hour feature by transforming the former feature name strings into integers. Then, we will sort our data so that show will clearly display point pairs that need to be combined for our data to be tidy.

[10]:
tempData.features.transform(lambda ft: [int(hourStr[2:]) for hourStr in ft],
                            features=['hour'])
tempData.points.sort(['date', 'hour'])
tempData.show('Date and hour sorted', points=11)
Date and hour sorted
"combinedTemperatureData" 68160pt x 5ft
           date     station   extreme  hour   temp
      ┌───────────────────────────────────────────
    0 │ 2011-01-01  downtown    min      0   2.840
    1 │ 2011-01-01  downtown    max      0   2.840
    2 │ 2011-01-01   airport    min      0   6.138
    3 │ 2011-01-01   airport    max      0   6.142
    4 │ 2011-01-01  downtown    min      1   2.019
    │ │     │          │         │      │      │
68155 │ 2012-12-31   airport    max     22   3.670
68156 │ 2012-12-31  downtown    min     23   3.659
68157 │ 2012-12-31  downtown    max     23   3.661
68158 │ 2012-12-31   airport    min     23   3.649
68159 │ 2012-12-31   airport    max     23   3.651

We see above that hr0 on 2011-01-01 for the downtown station, for example, is still represented by two points. This is because each point identifies either the minimum or maximum temperature. Our second step is to combine these two point pairs by expanding the features to include features for the minimum and maximum temperatures. Our extreme feature contains the values (min and max) that will become our new feature names and the temp feature contains the values that fill the new min and max features.

[11]:
tempData.points.combineByExpandingFeatures(featureWithFeatureNames='extreme',
                                           featuresWithValues='temp')
tempData.show('Combined points by expanding extreme feature', points=11)
Combined points by expanding extreme feature
"combinedTemperatureData" 35064pt x 5ft
           date     station    min    max   hour
      ┌─────────────────────────────────────────
    0 │ 2011-01-01  downtown  2.840  2.840    0
    1 │ 2011-01-01   airport  6.138  6.142    0
    2 │ 2011-01-01  downtown  2.019  2.021    1
    3 │ 2011-01-01   airport  5.299  5.301    1
    4 │ 2011-01-01  downtown         2.022    2
    │ │     │          │        │      │     │
35059 │ 2012-12-31   airport  3.679  3.681   21
35060 │ 2012-12-31  downtown  3.660  3.660   22
35061 │ 2012-12-31   airport  3.670  3.670   22
35062 │ 2012-12-31  downtown  3.659  3.661   23
35063 │ 2012-12-31   airport  3.649  3.651   23

Our object is now organized how we wanted with a tidy structure. There is one more tidying function in Nimble as well. It is designed to separate a feature containing multiple pieces of information into multiple features. We can demonstrate its functionality by applying it to our ‘date’ feature to create year, month and day features.

[12]:
tempData.features.splitByParsing('date', lambda val: val.split('-'),
                                 ['year', 'month', 'day'])
tempData.show('Split features by parsing the date feature', points=11)
Split features by parsing the date feature
"combinedTemperatureData" 35064pt x 7ft
        year  month  day  station    min    max   hour
      ┌───────────────────────────────────────────────
    0 │ 2011    01    01  downtown  2.840  2.840    0
    1 │ 2011    01    01   airport  6.138  6.142    0
    2 │ 2011    01    01  downtown  2.019  2.021    1
    3 │ 2011    01    01   airport  5.299  5.301    1
    4 │ 2011    01    01  downtown         2.022    2
    │ │  │      │     │      │        │      │     │
35059 │ 2012    12    31   airport  3.679  3.681   21
35060 │ 2012    12    31  downtown  3.660  3.660   22
35061 │ 2012    12    31   airport  3.670  3.670   22
35062 │ 2012    12    31  downtown  3.659  3.661   23
35063 │ 2012    12    31   airport  3.649  3.651   23

Reference:

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10