Merging and Tidying Data¶
Creating a tidy (cleaned up) data object from multiple data objects.¶
In this example, we have temperature data contained in 8 files. Each row in every file records 12 hours of temperatures for each day, but files vary by weather station, time period, and extreme (i.e., the file contains either minimum or maximum temperatures). Each file covers the same date range so we want to create a single object with all of our data. Once combined, we would like to restructure our data using Hadley Wickham’s Tidy Data principles. Wickham defines “data tidying” as “structuring datasets to facilitate analysis” and Nimble provides functions that reorganize points and features to create data that is “tidy”. Our goal is to create one tidy data object containing all of the data from our 8 original files.
In this example we will learn about:
Getting Started¶
[2]:
import nimble
bucket = 'https://storage.googleapis.com/nimble/datasets/tidy/'
downtownTempMinAM = nimble.data(bucket + 'downtown_am_min.csv', returnType="Matrix")
downtownTempMaxAM = nimble.data(bucket + 'downtown_am_max.csv', returnType="Matrix")
downtownTempMinPM = nimble.data(bucket + 'downtown_pm_min.csv', returnType="Matrix")
downtownTempMaxPM = nimble.data(bucket + 'downtown_pm_max.csv', returnType="Matrix")
airportTempMinAM = nimble.data(bucket + 'airport_am_min.csv', returnType="Matrix")
airportTempMaxAM = nimble.data(bucket + 'airport_am_max.csv', returnType="Matrix")
airportTempMinPM = nimble.data(bucket + 'airport_pm_min.csv', returnType="Matrix")
airportTempMaxPM = nimble.data(bucket + 'airport_pm_max.csv', returnType="Matrix")
To begin, we create 8 objects from 8 different files. The variable names and object names for each object represent the weather station location (downtown or airport), the temperature extreme recorded (Min or Max) and the time of day (AM or PM). All of our files have the same header row and cover the same date range. Let’s look at one of our objects to see these headers and understand the current format of our data.
[3]:
downtownTempMinAM.show('Example of data file structure', points=7)
Example of data file structure
712pt x 13ft
date hr0 hr1 hr2 hr3 hr4 hr5 hr6 hr7 hr8 hr9 hr10 hr11
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────
0 │ 2011-01-01 2.840 2.019 2.839 2.838 2.840 2.020 1.199 2.839 6.119 8.579 7.760
1 │ 2011-01-03 2.020 1.200 -0.440 -0.440 -1.260 -1.262 -1.260 -0.442 0.380 1.199
2 │ 2011-01-04 -0.440 -0.441 -1.261 -1.262 -2.081 -2.083 -2.080 -1.260 -0.440 -0.442 2.019
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │
709 │ 2012-12-29 2.840 2.840 2.839 2.839 2.840 2.839 3.660 3.658 3.659 3.659 3.660 1.200
710 │ 2012-12-30 3.659 3.659 3.659 3.660 3.660 3.659 2.840 2.840 2.840 2.840 3.660 3.659
711 │ 2012-12-31 0.380 0.379 -0.440 -0.441 -1.261 -0.441 -0.440 -0.440 -1.260 0.379 1.199 2.020
Combining the data¶
First, we can reduce our number of objects to 4 by combining AM and PM temperatures of objects at the same weather station (downtown or airport) and with the same extreme (min or max). The feature names for AM and PM are currently the same, so we will need to modify the feature names in the PM objects so that they denote the hour according to a 24 hour clock.
[4]:
ftsPM = ['date', 'hr12', 'hr13', 'hr14', 'hr15', 'hr16', 'hr17',
'hr18', 'hr19', 'hr20', 'hr21', 'hr22', 'hr23']
for obj in [downtownTempMinPM, downtownTempMaxPM, airportTempMinPM, airportTempMaxPM]:
obj.features.setNames(ftsPM)
Now that we’ve differentiated our features, we can use a merge
operation to combine the data. We want to join these objects on the ‘date’ feature (i.e., we are combining data with the same date) and use point='union'
(that is, we want all the points from both files) so that we keep all possible dates, even if a date is missing for the AM or PM data.
[5]:
downtownTempMinAM.merge(downtownTempMinPM, onFeature='date', point='union')
downtownTempMaxAM.merge(downtownTempMaxPM, onFeature='date', point='union')
airportTempMinAM.merge(airportTempMinPM, onFeature='date', point='union')
airportTempMaxAM.merge(airportTempMaxPM, onFeature='date', point='union')
downtownTempMinAM.show('Downtown data merged on date', points=7)
Downtown data merged on date
717pt x 25ft
date hr0 hr1 hr2 hr3 hr4 hr5 ── hr18 hr19 hr20 hr21 hr22 hr23
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────
0 │ 2011-01-01 2.840 2.019 2.839 2.838 2.840 ── 10.219 10.219 9.399 9.399 9.399 11.859
1 │ 2011-01-02 ── 7.759 6.118 5.300 3.659 2.839 2.020
2 │ 2011-01-03 2.020 1.200 -0.440 -0.440 ── 2.840 1.199 1.200 0.379 -1.261 0.379
│ │ │ │ │ │ │ │ │ ── │ │ │ │ │ │
714 │ 2012-12-29 2.840 2.840 2.839 2.839 2.840 2.839 ── 5.297 5.299 5.299 4.479 3.658 3.659
715 │ 2012-12-30 3.659 3.659 3.659 3.660 3.660 3.659 ── 2.839 6.939 2.019 1.199 1.200 1.200
716 │ 2012-12-31 0.380 0.379 -0.440 -0.441 -1.261 -0.441 ── 3.660 3.659 3.659 3.660 3.660 3.659
Next, we can reduce our number of objects from 4 to 2 by combining the objects with different extremes (min vs. max) for the same location. Before combining, we will want to add an “extreme” feature to each object based on whether it contains min vs. max data. Without this step, we would not be able to differentiate between minimum and maximum temperature points in the combined objects. Once our new feature is added, we can points.append
our objects from the same weather station.
[6]:
for obj in [downtownTempMinAM, downtownTempMaxAM, airportTempMinAM, airportTempMaxAM]:
extreme = 'min' if 'min' in obj.path else 'max'
ftData = [[extreme] for _ in range(len(obj.points))]
newFt = nimble.data(ftData, featureNames=['extreme'])
# New feature will be added at index position 1 (after "date" feature)
obj.features.insert(1, newFt)
downtownTempMinAM.points.append(downtownTempMaxAM)
airportTempMinAM.points.append(airportTempMaxAM)
downtownTempMinAM.show('Downtown combined extreme data', points=7)
Downtown combined extreme data
1430pt x 26ft
date extreme hr0 hr1 hr2 hr3 hr4 ── hr18 hr19 hr20 hr21 hr22 hr23
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────
0 │ 2011-01-01 min 2.840 2.019 2.839 2.838 ── 10.219 10.219 9.399 9.399 9.399 11.859
1 │ 2011-01-02 min ── 7.759 6.118 5.300 3.659 2.839 2.020
2 │ 2011-01-03 min 2.020 1.200 -0.440 ── 2.840 1.199 1.200 0.379 -1.261 0.379
│ │ │ │ │ │ │ │ │ ── │ │ │ │ │ │
1427 │ 2012-12-29 max 2.840 2.840 2.841 2.841 2.840 ── 5.303 5.301 5.301 4.481 3.662 3.661
1428 │ 2012-12-30 max 3.661 3.661 3.661 3.660 3.660 ── 2.841 6.941 2.021 1.201 1.200 1.200
1429 │ 2012-12-31 max 0.380 0.381 -0.440 -0.439 -1.259 ── 3.660 3.661 3.661 3.660 3.660 3.661
Finally, we can combine our two objects into one by combining our two weather stations (downtown vs. airport). Just like in the last step, we need to create a new ‘station’ feature for each object based on which weather station location (downtown vs. airport) recorded the data.
[7]:
for obj in [downtownTempMinAM, airportTempMinAM]:
station = 'downtown' if 'downtown' in obj.path else 'airport'
stationData = [[station] for _ in range(len(obj.points))]
newFt = nimble.data(stationData, featureNames=['station'])
obj.features.insert(1, newFt)
downtownTempMinAM.points.append(airportTempMinAM)
Since all of these operations have been in-place, our downtownTempMinAM
object now contains all of our data from the 8 files. This variable name could be confusing so, for clarity, let’s assign this object to a new variable name, tempData
. Let’s also sort our data by date, so that we can double check that each date has a minimum and maximum temperature recording for each weather station. Taking a look at our data will also help us start exploring how we can begin to tidy it.
[8]:
tempData = downtownTempMinAM
tempData.name = 'combinedTemperatureData'
tempData.points.sort('date')
tempData.show('Fully merged (untidy) data', points=11)
Fully merged (untidy) data
"combinedTemperatureData" 2840pt x 27ft
date station extreme hr0 hr1 hr2 ── hr18 hr19 hr20 hr21 hr22 hr23
┌────────────────────────────────────────────────────────────────────────────────────────────────────────
0 │ 2011-01-01 downtown min 2.840 2.019 ── 10.219 10.219 9.399 9.399 9.399 11.859
1 │ 2011-01-01 downtown max 2.840 2.021 2.022 ── 10.221 10.221 9.401 9.401 9.401 11.861
2 │ 2011-01-01 airport min 6.138 5.299 5.300 ── 13.579 13.579 12.789 12.789 12.769 15.259
3 │ 2011-01-01 airport max 6.142 5.301 5.300 ── 13.581 13.581 12.791 12.791 12.771 15.261
4 │ 2011-01-02 downtown min ── 7.759 6.118 5.300 3.659 2.839 2.020
│ │ │ │ │ │ │ │ ── │ │ │ │ │ │
2835 │ 2012-12-30 airport max 3.700 3.691 3.721 ── 2.850 6.971 2.070 1.221 1.220 1.191
2836 │ 2012-12-31 downtown min 0.380 0.379 -0.440 ── 3.660 3.659 3.659 3.660 3.660 3.659
2837 │ 2012-12-31 downtown max 0.380 0.381 -0.440 ── 3.660 3.661 3.661 3.660 3.660 3.661
2838 │ 2012-12-31 airport min 0.389 0.419 -0.401 ── 3.659 3.659 3.678 3.679 3.670 3.649
2839 │ 2012-12-31 airport max 0.391 0.421 -0.399 ── 3.661 3.661 3.682 3.681 3.670 3.651
Tidying the data¶
Our dataset is combined but not in the format we want. To structure our data for analysis, we would like each point to be a single observation of the variables in our data. According to Hadley Wickham’s Tidy Data principles, our dataset is not tidy for two reasons. First, 24 observations are made every day (one each hour). Points should represent observations so each day should be represented by 24 points. Second, our minimum and maximum temperatures are variables for the same observation. Variables should be represented as features.
As an example, our current (truncated) data for 2001-01-01
at the downtown
station can be seen below. We see it is structured using two points.
date station extreme hr0 hr1 hr2 -- hr22 hr23
2011-01-01 downtown min 2.840 2.019 -- 9.399 11.859
2011-01-01 downtown max 2.840 2.021 2.022 -- 9.401 11.861
To tidy that same data, we modify the structure to include one point for each hour and identify the minimum and maximum temperatures in the min
and max
features.
date station min max hour
2011-01-01 downtown 2.840 2.840 hr0
2011-01-01 downtown 2.019 2.840 hr1
| | | | |
2011-01-01 downtown 9.399 11.859 hr22
2011-01-01 downtown 3.649 11.861 hr23
Tidying our data will take two steps. First, we need each point to represent a single hour of time. So we will take our 24 hour features (hr0, hr1, …, hr23) and collapse them to represent this same data using 24 points (one point for each feature that is collapsed). The collapsed features become two new features: one named hour
storing each feature’s name as a value and one named temp
storing the temperature recorded during that hour.
[9]:
hourFts = ['hr' + str(i) for i in range(24)]
tempData.points.splitByCollapsingFeatures(featuresToCollapse=hourFts,
featureForNames='hour',
featureForValues='temp')
tempData.points.sort(['date', 'hour'])
tempData.show('Split points by collapsing the hour features', points=11)
Split points by collapsing the hour features
"combinedTemperatureData" 68160pt x 5ft
date station extreme hour temp
┌────────────────────────────────────────────
0 │ 2011-01-01 downtown min hr0 2.840
1 │ 2011-01-01 downtown max hr0 2.840
2 │ 2011-01-01 airport min hr0 6.138
3 │ 2011-01-01 airport max hr0 6.142
4 │ 2011-01-01 downtown min hr1 2.019
│ │ │ │ │ │ │
68155 │ 2012-12-31 airport max hr8 -1.219
68156 │ 2012-12-31 downtown min hr9 0.379
68157 │ 2012-12-31 downtown max hr9 0.381
68158 │ 2012-12-31 airport min hr9 0.418
68159 │ 2012-12-31 airport max hr9 0.422
This is looking closer now that each point refers to a single hour of time. However, we still have separate points storing our maximum and minimum temperatures. This is not obvious in the output above, let’s make a couple of modifications to see this more clearly. First, we can clean our hour
feature by transforming the former feature name strings into integers. Then, we will sort our data so that show
will clearly display point pairs that need to be combined for our data to be tidy.
[10]:
tempData.features.transform(lambda ft: [int(hourStr[2:]) for hourStr in ft],
features=['hour'])
tempData.points.sort(['date', 'hour'])
tempData.show('Date and hour sorted', points=11)
Date and hour sorted
"combinedTemperatureData" 68160pt x 5ft
date station extreme hour temp
┌───────────────────────────────────────────
0 │ 2011-01-01 downtown min 0 2.840
1 │ 2011-01-01 downtown max 0 2.840
2 │ 2011-01-01 airport min 0 6.138
3 │ 2011-01-01 airport max 0 6.142
4 │ 2011-01-01 downtown min 1 2.019
│ │ │ │ │ │ │
68155 │ 2012-12-31 airport max 22 3.670
68156 │ 2012-12-31 downtown min 23 3.659
68157 │ 2012-12-31 downtown max 23 3.661
68158 │ 2012-12-31 airport min 23 3.649
68159 │ 2012-12-31 airport max 23 3.651
We see above that hr0
on 2011-01-01
for the downtown
station, for example, is still represented by two points. This is because each point identifies either the minimum or maximum temperature. Our second step is to combine these two point pairs by expanding the features to include features for the minimum and maximum temperatures. Our extreme
feature contains the values (min and max) that will become our new feature names and the temp
feature contains the values that fill the
new min
and max
features.
[11]:
tempData.points.combineByExpandingFeatures(featureWithFeatureNames='extreme',
featuresWithValues='temp')
tempData.show('Combined points by expanding extreme feature', points=11)
Combined points by expanding extreme feature
"combinedTemperatureData" 35064pt x 5ft
date station min max hour
┌─────────────────────────────────────────
0 │ 2011-01-01 downtown 2.840 2.840 0
1 │ 2011-01-01 airport 6.138 6.142 0
2 │ 2011-01-01 downtown 2.019 2.021 1
3 │ 2011-01-01 airport 5.299 5.301 1
4 │ 2011-01-01 downtown 2.022 2
│ │ │ │ │ │ │
35059 │ 2012-12-31 airport 3.679 3.681 21
35060 │ 2012-12-31 downtown 3.660 3.660 22
35061 │ 2012-12-31 airport 3.670 3.670 22
35062 │ 2012-12-31 downtown 3.659 3.661 23
35063 │ 2012-12-31 airport 3.649 3.651 23
Our object is now organized how we wanted with a tidy structure. There is one more tidying function in Nimble as well. It is designed to separate a feature containing multiple pieces of information into multiple features. We can demonstrate its functionality by applying it to our ‘date’ feature to create year
, month
and day
features.
[12]:
tempData.features.splitByParsing('date', lambda val: val.split('-'),
['year', 'month', 'day'])
tempData.show('Split features by parsing the date feature', points=11)
Split features by parsing the date feature
"combinedTemperatureData" 35064pt x 7ft
year month day station min max hour
┌───────────────────────────────────────────────
0 │ 2011 01 01 downtown 2.840 2.840 0
1 │ 2011 01 01 airport 6.138 6.142 0
2 │ 2011 01 01 downtown 2.019 2.021 1
3 │ 2011 01 01 airport 5.299 5.301 1
4 │ 2011 01 01 downtown 2.022 2
│ │ │ │ │ │ │ │ │
35059 │ 2012 12 31 airport 3.679 3.681 21
35060 │ 2012 12 31 downtown 3.660 3.660 22
35061 │ 2012 12 31 airport 3.670 3.670 22
35062 │ 2012 12 31 downtown 3.659 3.661 23
35063 │ 2012 12 31 airport 3.649 3.651 23
Reference:
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10