Base.groupByFeature

Base.groupByFeature(by, calculate=None, countUniqueValueOnly=False, *, useLog=None)

Group data object by one or more features. This results in a dictionary where the keys are the unique values of the target feature(s) and the values are the nimble base objects that correspond to the group.

Parameters:
  • by (int, str or list) –

    • int - the index of the feature to group by

    • str - the name of the feature to group by

    • list - indices or names of features to group by

  • calculate (str, None) – The name of the statistical function to apply to each group. If None, the default, no calculation will be applied.

  • countUniqueValueOnly (bool) – Return only the count of points in the group

  • useLog (bool, None) – Local control for whether to send object creation to the logger. If None (default), use the value as specified in the “logger” “enabledByDefault” configuration option. If True, send to the logger regardless of the global option. If False, do NOT send to the logger, regardless of the global option.

Returns:

dict – Each unique feature (or group of features) to group by as keys. When countUniqueValueOnly is False, the value at each key is a nimble object containing the ungrouped features of points within that group. When countUniqueValueOnly is True, the values are the number of points within that group.

Examples

>>> lst = [['ACC', 'Clemson', 15, 0],
...        ['SEC', 'Alabama', 14, 1],
...        ['Big 10', 'Ohio State', 13, 1],
...        ['Big 12', 'Oklahoma', 12, 2],
...        ['Independent', 'Notre Dame', 12, 1],
...        ['SEC', 'LSU', 10, 3],
...        ['SEC', 'Florida', 10, 3],
...        ['SEC', 'Georgia', 11, 3]]
>>> ftNames = ['conference', 'team', 'wins', 'losses']
>>> top10 = nimble.data(lst, featureNames=ftNames)
>>> groupByLosses = top10.groupByFeature('losses')
>>> list(groupByLosses.keys())
[0, 1, 2, 3]
>>> groupByLosses[1]
<DataFrame 3pt x 3ft
      conference     team     wins
   ┌──────────────────────────────
 0 │         SEC     Alabama   14
 1 │      Big 10  Ohio State   13
 2 │ Independent  Notre Dame   12
>
>>> groupByLosses[3]
<DataFrame 3pt x 3ft
     conference    team   wins
   ┌──────────────────────────
 0 │    SEC          LSU   10
 1 │    SEC      Florida   10
 2 │    SEC      Georgia   11
>

Using the calculate parameter we can find the maximum of other features within each group.

>>> top10.groupByFeature('conference', calculate='max')
{'ACC': <DataFrame 1pt x 3ft
       team   wins   losses
     ┌─────────────────────
 max │       15.000  0.000
>, 'SEC': <DataFrame 1pt x 3ft
       team   wins   losses
     ┌─────────────────────
 max │       14.000  3.000
>, 'Big 10': <DataFrame 1pt x 3ft
       team   wins   losses
     ┌─────────────────────
 max │       13.000  1.000
>, 'Big 12': <DataFrame 1pt x 3ft
       team   wins   losses
     ┌─────────────────────
 max │       12.000  2.000
>, 'Independent': <DataFrame 1pt x 3ft
       team   wins   losses
     ┌─────────────────────
 max │       12.000  1.000
>}

Adding a new point to the data object with a missing value in a target feature will result in a new group with the key ‘NaN’. >>> lst.append([‘’, ‘Auburn’, 9, 4]) >>> top10 = nimble.data(lst, featureNames=ftNames) >>> top10.groupByFeature(‘conference’) {‘ACC’: <DataFrame 1pt x 3ft

team wins losses

┌──────────────────────

0 │ Clemson 15 0

>, ‘SEC’: <DataFrame 4pt x 3ft

team wins losses

┌──────────────────────

0 │ Alabama 14 1 1 │ LSU 10 3 2 │ Florida 10 3 3 │ Georgia 11 3

>, ‘Big 10’: <DataFrame 1pt x 3ft

team wins losses

┌─────────────────────────

0 │ Ohio State 13 1

>, ‘Big 12’: <DataFrame 1pt x 3ft

team wins losses

┌───────────────────────

0 │ Oklahoma 12 2

>, ‘Independent’: <DataFrame 1pt x 3ft

team wins losses

┌─────────────────────────

0 │ Notre Dame 12 1

>, ‘NaN’: <DataFrame 1pt x 3ft

team wins losses

┌─────────────────────

0 │ Auburn 9 4

>}

Keywords: split, organize, categorize, groupby, variable, dimension, attribute, predictor