How to work with Datasets

A clear examples to work with dat sets.
import numpy as np
import pandas as pd
from mltreelib.data import Data
n_size = 1000000
rnd = np.random.RandomState(1234)
dummy_data = pd.DataFrame({
    'numericfull':rnd.randint(1,500,size=n_size),
    'unitint':rnd.randint(1,25,size=n_size),
    'floatfull':rnd.random_sample(size=n_size),
    'floatsmall':np.round(rnd.random_sample(size=n_size)+rnd.randint(1,25,size=n_size),2),
    'categoryobj':rnd.choice(['a','b','c','d'],size=n_size),
    'stringobj':rnd.choice(["{:c}".format(k) for k in range(97, 123)],size=n_size)})
dummy_data.head()
numericfull unitint floatfull floatsmall categoryobj stringobj
0 304 1 0.651859 11.42 a f
1 212 1 0.906869 23.28 d v
2 295 23 0.933262 21.79 d t
3 54 19 0.919103 9.24 d s
4 205 9 0.262066 16.69 a l

Pass it to Dataset and let it do its magic

dataset = Data(df=dummy_data)
dataset
Dataset(df=Shape((1000000, 6), reduce_datatype=True, encode_category=None, add_intercept=False, na_treatment=allow, copy=False, digits=None, n_category=None, split_ratio=None)

To acess raw processed data

dataset.data[:5]
array([(304,  1, 0.65185905, 11.42, 'a', 'f'),
       (212,  1, 0.90686905, 23.28, 'd', 'v'),
       (295, 23, 0.9332624 , 21.79, 'd', 't'),
       ( 54, 19, 0.9191031 ,  9.24, 'd', 's'),
       (205,  9, 0.2620663 , 16.69, 'a', 'l')],
      dtype=[('numericfull', '<u2'), ('unitint', 'u1'), ('floatfull', '<f4'), ('floatsmall', '<f4'), ('categoryobj', 'O'), ('stringobj', 'O')])

Note: This is a Structured arrays and not a simmple numpy array or pandas data frame.

Size reduction is as follows:

print('Pandas Data Frame        : ',np.round(dummy_data.memory_usage(deep=True).sum()*1e-6,2),'MB')
print('Dataset Structured Array : ',np.round(dataset.data.nbytes*1e-6/ 1024 * 1024,2),'MB')
Pandas Data Frame        :  148.0 MB
Dataset Structured Array :  27.0 MB
print(dummy_data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   numericfull  1000000 non-null  int64  
 1   unitint      1000000 non-null  int64  
 2   floatfull    1000000 non-null  float64
 3   floatsmall   1000000 non-null  float64
 4   categoryobj  1000000 non-null  object 
 5   stringobj    1000000 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 141.1 MB
None

Further reduction in data size

We can even further reduce data by using following parameters:

dataset = Data(df=dummy_data, digits=2)
print('Pandas Data Frame        : ',np.round(dummy_data.memory_usage(deep=True).sum()*1e-6,2),'MB')
print('Dataset Structured Array : ',np.round(dataset.data.nbytes*1e-6/ 1024 * 1024,2),'MB')
Pandas Data Frame        :  148.0 MB
Dataset Structured Array :  27.0 MB
dataset.data[:5]
array([(304,  1, 0.65, 11.42, 'a', 'f'), (212,  1, 0.91, 23.28, 'd', 'v'),
       (295, 23, 0.93, 21.79, 'd', 't'), ( 54, 19, 0.92,  9.24, 'd', 's'),
       (205,  9, 0.26, 16.69, 'a', 'l')],
      dtype=[('numericfull', '<u2'), ('unitint', 'u1'), ('floatfull', '<f4'), ('floatsmall', '<f4'), ('categoryobj', 'O'), ('stringobj', 'O')])