Base.groupByFeature¶
- Base.groupByFeature(by, calculate=None, countUniqueValueOnly=False, *, useLog=None)¶
Group data object by one or more features. This results in a dictionary where the keys are the unique values of the target feature(s) and the values are the nimble base objects that correspond to the group.
- Parameters:
int - the index of the feature to group by
str - the name of the feature to group by
list - indices or names of features to group by
calculate (str, None) – The name of the statistical function to apply to each group. If None, the default, no calculation will be applied.
countUniqueValueOnly (bool) – Return only the count of points in the group
useLog (bool, None) – Local control for whether to send object creation to the logger. If None (default), use the value as specified in the “logger” “enabledByDefault” configuration option. If True, send to the logger regardless of the global option. If False, do NOT send to the logger, regardless of the global option.
- Returns:
dict – Each unique feature (or group of features) to group by as keys. When
countUniqueValueOnly
is False, the value at each key is a nimble object containing the ungrouped features of points within that group. WhencountUniqueValueOnly
is True, the values are the number of points within that group.
See also
Examples
>>> lst = [['ACC', 'Clemson', 15, 0], ... ['SEC', 'Alabama', 14, 1], ... ['Big 10', 'Ohio State', 13, 1], ... ['Big 12', 'Oklahoma', 12, 2], ... ['Independent', 'Notre Dame', 12, 1], ... ['SEC', 'LSU', 10, 3], ... ['SEC', 'Florida', 10, 3], ... ['SEC', 'Georgia', 11, 3]] >>> ftNames = ['conference', 'team', 'wins', 'losses'] >>> top10 = nimble.data(lst, featureNames=ftNames) >>> groupByLosses = top10.groupByFeature('losses') >>> list(groupByLosses.keys()) [0, 1, 2, 3] >>> groupByLosses[1] <DataFrame 3pt x 3ft conference team wins ┌────────────────────────────── 0 │ SEC Alabama 14 1 │ Big 10 Ohio State 13 2 │ Independent Notre Dame 12 > >>> groupByLosses[3] <DataFrame 3pt x 3ft conference team wins ┌────────────────────────── 0 │ SEC LSU 10 1 │ SEC Florida 10 2 │ SEC Georgia 11 >
Using the calculate parameter we can find the maximum of other features within each group.
>>> top10.groupByFeature('conference', calculate='max') {'ACC': <DataFrame 1pt x 3ft team wins losses ┌───────────────────── max │ 15.000 0.000 >, 'SEC': <DataFrame 1pt x 3ft team wins losses ┌───────────────────── max │ 14.000 3.000 >, 'Big 10': <DataFrame 1pt x 3ft team wins losses ┌───────────────────── max │ 13.000 1.000 >, 'Big 12': <DataFrame 1pt x 3ft team wins losses ┌───────────────────── max │ 12.000 2.000 >, 'Independent': <DataFrame 1pt x 3ft team wins losses ┌───────────────────── max │ 12.000 1.000 >}
Adding a new point to the data object with a missing value in a target feature will result in a new group with the key ‘NaN’. >>> lst.append([‘’, ‘Auburn’, 9, 4]) >>> top10 = nimble.data(lst, featureNames=ftNames) >>> top10.groupByFeature(‘conference’) {‘ACC’: <DataFrame 1pt x 3ft
team wins losses
┌──────────────────────
0 │ Clemson 15 0
- >, ‘SEC’: <DataFrame 4pt x 3ft
team wins losses
┌──────────────────────
0 │ Alabama 14 1 1 │ LSU 10 3 2 │ Florida 10 3 3 │ Georgia 11 3
- >, ‘Big 10’: <DataFrame 1pt x 3ft
team wins losses
┌─────────────────────────
0 │ Ohio State 13 1
- >, ‘Big 12’: <DataFrame 1pt x 3ft
team wins losses
┌───────────────────────
0 │ Oklahoma 12 2
- >, ‘Independent’: <DataFrame 1pt x 3ft
team wins losses
┌─────────────────────────
0 │ Notre Dame 12 1
- >, ‘NaN’: <DataFrame 1pt x 3ft
team wins losses
┌─────────────────────
0 │ Auburn 9 4
>}
Keywords: split, organize, categorize, groupby, variable, dimension, attribute, predictor