import numpy as np
import pandas as pd
from mltreelib.data import Data
How to work with Datasets
A clear examples to work with dat sets.
= 1000000
n_size = np.random.RandomState(1234)
rnd = pd.DataFrame({
dummy_data 'numericfull':rnd.randint(1,500,size=n_size),
'unitint':rnd.randint(1,25,size=n_size),
'floatfull':rnd.random_sample(size=n_size),
'floatsmall':np.round(rnd.random_sample(size=n_size)+rnd.randint(1,25,size=n_size),2),
'categoryobj':rnd.choice(['a','b','c','d'],size=n_size),
'stringobj':rnd.choice(["{:c}".format(k) for k in range(97, 123)],size=n_size)})
dummy_data.head()
numericfull | unitint | floatfull | floatsmall | categoryobj | stringobj | |
---|---|---|---|---|---|---|
0 | 304 | 1 | 0.651859 | 11.42 | a | f |
1 | 212 | 1 | 0.906869 | 23.28 | d | v |
2 | 295 | 23 | 0.933262 | 21.79 | d | t |
3 | 54 | 19 | 0.919103 | 9.24 | d | s |
4 | 205 | 9 | 0.262066 | 16.69 | a | l |
Pass it to Dataset
and let it do its magic
= Data(df=dummy_data)
dataset dataset
Dataset(df=Shape((1000000, 6), reduce_datatype=True, encode_category=None, add_intercept=False, na_treatment=allow, copy=False, digits=None, n_category=None, split_ratio=None)
To acess raw processed data
5] dataset.data[:
array([(304, 1, 0.65185905, 11.42, 'a', 'f'),
(212, 1, 0.90686905, 23.28, 'd', 'v'),
(295, 23, 0.9332624 , 21.79, 'd', 't'),
( 54, 19, 0.9191031 , 9.24, 'd', 's'),
(205, 9, 0.2620663 , 16.69, 'a', 'l')],
dtype=[('numericfull', '<u2'), ('unitint', 'u1'), ('floatfull', '<f4'), ('floatsmall', '<f4'), ('categoryobj', 'O'), ('stringobj', 'O')])
Note: This is a Structured arrays and not a simmple numpy array or pandas data frame.
Size reduction is as follows:
print('Pandas Data Frame : ',np.round(dummy_data.memory_usage(deep=True).sum()*1e-6,2),'MB')
print('Dataset Structured Array : ',np.round(dataset.data.nbytes*1e-6/ 1024 * 1024,2),'MB')
Pandas Data Frame : 148.0 MB
Dataset Structured Array : 27.0 MB
print(dummy_data.info(memory_usage='deep'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 numericfull 1000000 non-null int64
1 unitint 1000000 non-null int64
2 floatfull 1000000 non-null float64
3 floatsmall 1000000 non-null float64
4 categoryobj 1000000 non-null object
5 stringobj 1000000 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 141.1 MB
None
Further reduction in data size
We can even further reduce data by using following parameters:
= Data(df=dummy_data, digits=2)
dataset print('Pandas Data Frame : ',np.round(dummy_data.memory_usage(deep=True).sum()*1e-6,2),'MB')
print('Dataset Structured Array : ',np.round(dataset.data.nbytes*1e-6/ 1024 * 1024,2),'MB')
Pandas Data Frame : 148.0 MB
Dataset Structured Array : 27.0 MB
5] dataset.data[:
array([(304, 1, 0.65, 11.42, 'a', 'f'), (212, 1, 0.91, 23.28, 'd', 'v'),
(295, 23, 0.93, 21.79, 'd', 't'), ( 54, 19, 0.92, 9.24, 'd', 's'),
(205, 9, 0.26, 16.69, 'a', 'l')],
dtype=[('numericfull', '<u2'), ('unitint', 'u1'), ('floatfull', '<f4'), ('floatsmall', '<f4'), ('categoryobj', 'O'), ('stringobj', 'O')])