nimble.data

nimble.data(source, pointNames='automatic', featureNames='automatic', returnType=None, name=None, convertToType=None, keepPoints='all', keepFeatures='all', treatAsMissing=(float('nan'), numpy.nan, None, '', 'None', 'nan', 'NULL', 'NA'), replaceMissingWith=numpy.nan, rowsArePoints=True, ignoreNonNumericalFeatures=False, inputSeparator='automatic', copyData=True, *, useLog=None)

Function to instantiate one of the Nimble data container types.

All Nimble data objects offer the same methods but offer unique implementations based on how each stores data on the backend. This creates consistency regardless of returnType, but efficiencies can be gained by choosing the best returnType for the source data. For example, highly sparse data would benefit most from choosing returnType='Sparse'. Nimble data objects are also consistent in that they all allow for point and feature names. Additional parameters allow some data preprocessing to be performed during object creation.

Parameters:
  • source (object, str) –

    The source of the data to be loaded into the returned object.

    • in-python data object - list, dictionary, numpy array or matrix, scipy sparse matrix, pandas dataframe or series.

    • open file-like object

    • str - A path or url to the data file.

  • pointNames ('automatic', bool, list, dict) –

    Specifies the source for point names in the returned object.

    • ’automatic’ - the default, indicates that this function should attempt to detect the presence of pointNames in the data which will only be attempted when loading from a file. If no names are found, or data isn’t being loaded from a file, then default names are assigned.

    • bool - True indicates that point names are embedded in the data within the first column. A value of False indicates that names are not embedded and that default names should be used.

    • list, dict - all points in the data must be assigned a name and the names for each point must be unique. As a list, the index of the name will define the point index. As a dict, the value mapped to each name will define the point index.

  • featureNames ('automatic', bool, list, dict) –

    Specifies the source for feature names in the returned object.

    • ’automatic’ - the default, indicates that this function should attempt to detect the presence of featureNames in the data which will only be attempted when loading from a file. If no names are found, or data isn’t being loaded from a file, then default names are assigned.

    • bool - True indicates that feature names are embedded in the data within the first row. A value of False indicates that names are not embedded and that default names should be used.

    • list, dict - all features in the data must be assigned a name and the names for each feature must be unique. As a list, the index of the name will define the feature index. As a dict, the value mapped to each name will define the feature index.

  • returnType (str, None) – Indicates which Nimble data object to return. Options are the case sensitive strings “List”, “Matrix”, “Sparse” and “DataFrame”. If None, Nimble will detect the most appropriate type from the data and/or packages available in the environment.

  • name (str, None) – A string describing the data that will display when printing or logging the returned object. This value is also set as the name attribute of the returned object.

  • convertToType (type, dict, list, None) – A one-time conversion of features to the provided type or types. By default, object types within source are not modified, except for features detected to be numeric in a data file. Setting this parameter to a single type will convert all of the data to that type. For feature-by-feature type setting, a dict or list can be used. Dicts map feature identifiers (names and indexes) to conversion types. Any feature not included in the dict will remain as-is. A list must provide a type or None for each feature. Note: The setting of types only applies during the creation process, object methods will modify types if necessary.

  • keepPoints ('all', list) – Allows the user to select which points will be kept in the returned object, those not selected will be discarded. By default, the value ‘all’ indicates that all possible points in the data will be kept. Alternatively, the user may provide a list containing either names or indices (or a mix) of those points they want to be kept from the data. The order of this list will determine the order of points in the resultant object. In the case of reading data from a file, the selection will be done at read time, thus limiting the amount of data read into memory.

  • keepFeatures ('all', list) – Allows the user to select which features will be kept in the returned object, those not selected will be discarded. By default, the value ‘all’ indicates that all possible features in the data will be kept. Alternatively, the user may provide a list containing either names or indices (or a mix) of those features they want to be kept from the data. The order of this list will determine the order of features in the resultant object. In the case of reading data from a file, the selection will be done at read time, thus limiting the amount of data read into memory. Additionally, ignoreNonNumericalFeatures takes precedent and, when set to True, will remove features included in this selection if they contain non-numeric values.

  • treatAsMissing (list) – Values that will be treated as missing values in the data. These values will be replaced with value from replaceMissingWith By default this list is [float(‘nan’), np.nan, None, ‘’, ‘None’, ‘nan’]. Set to None or [] to disable replacing missing values.

  • replaceMissingWith – A single value with which to replace any value in treatAsMissing. By default this value is np.nan.

  • rowsArePoints (bool) – For source objects with a shape attribute, shape[0] indicates the rows. Otherwise, rows are defined as the objects returned when iterating through source. For one-dimensional rows, each row in is processed as a point if True, otherwise rows are processed as features. Two-dimensional rows will also be treated as one-dimensional if this parameter is False and the row has a feature vector shape (e.g. a list of 5 x 1 numpy arrays). This parameter must be True for any higher dimension rows.

  • ignoreNonNumericalFeatures (bool) – This only applies when ``source`` is a file. Indicate whether features containing non-numeric data should not be loaded into the final object. If there is point or feature selection occurring, then only those values within selected points and features are considered when determining whether to apply this operation.

  • inputSeparator (str) – This only applies when ``source`` is a delimited file. The character that is used to separate fields in the input file, if necessary. By default, a value of ‘automatic’ will attempt to determine the appropriate separator. Otherwise, a single character string of the separator in the file can be passed.

  • copyData (bool) – This only applies when ``source`` is an in-python data type. When True (the default) the backend data container is guaranteed to be a different object than source because a copy is made before processing the data. When False, the initial copy is not performed so it is possible (NOT guaranteed) that the source data object is used as the backend data container for the returned object. In that case, any modifications to either object would affect the other object.

  • useLog (bool, None) – Local control for whether to send object creation to the logger. If None (default), use the value as specified in the “logger” “enabledByDefault” configuration option. If True, send to the logger regardless of the global option. If False, do NOT send to the logger, regardless of the global option.

Returns:

nimble.core.data.Base – Subclass of Base object corresponding with the returnType.

Examples

>>> data = [[1, 2, 3], [4, 5, 6]]
>>> asList = nimble.data(data, returnType="List", name='simple')
>>> asList
<List "simple" 2pt x 3ft
     0  1  2
   ┌────────
 0 │ 1  2  3
 1 │ 4  5  6
>

Loading data from a file.

>>> with open('simpleData.csv', 'w') as cd:
...     out = cd.write('1,2,3\n4,5,6')
>>> fromFile = nimble.data('simpleData.csv')
>>> fromFile 
<Matrix 2pt x 3ft
     0  1  2
   ┌────────
 0 │ 1  2  3
 1 │ 4  5  6
>

Adding point and feature names.

>>> data = [['a', 'b', 'c'], [0, 0, 1], [1, 0, 0]]
>>> asSparse = nimble.data(data, pointNames=['1', '2'],
...                        featureNames=True, returnType="Sparse")
>>> asSparse
<Sparse 2pt x 3ft
     a  b  c
   ┌────────
 1 │ 0  0  1
 2 │ 1  0  0
>

Replacing missing values.

>>> data = [[1, 'Missing', 3], [4, 'Missing', 6]]
>>> ftNames = {'a': 0, 'b': 1, 'c': 2}
>>> asDataFrame = nimble.data(data, featureNames=ftNames,
...                           returnType="DataFrame",
...                           treatAsMissing=["Missing", 3],
...                           replaceMissingWith=-1)
>>> asDataFrame
<DataFrame 2pt x 3ft
     a  b   c
   ┌──────────
 0 │ 1  -1  -1
 1 │ 4  -1   6
>

Keywords: create, make, construct, new, matrix, read, load, open, file, in-python data object, list, dictionary, numpy array, pandas dataframe, scipy sparse, csv, mtx, hdf5, h5, url, pickle