How to :: Clean data in python

This page is dedicated to tutorials to show common cleaning operations using python, the pandas library and the Toucan’s functions that we designed for you.

Generic cleaning

Use the Toucan clean_dataframe util function

This function that we provide you allows to clean and homogenise columns names, optimise some dtypes for memory (category vs. object, int vs. float), and rename columns.

You can find the documentation here.

Input data (``input_data`` dataframe):

Country Product Category product_name date_2 $_VALUE évolution
France Category A product A1 Fr 2017-12-31 1.0 0.01
France Category A product A2 Fr 2017-12-31 2.0 0.02
France Category B product B1 Fr 2017-12-31 3.0 0.03
France Category B product B1 Fr 2017-12-31 4.0 0.04
France Category C product C1 Fr 2017-12-31 5.0 0.05
France Category C product C1 Fr 2017-12-31 6.0 0.06
Spain Category A product A1 Sp 2017-12-31 7.0 0.07
Spain Category A product A1 Sp 2017-12-31 8.0 0.08
Spain Category B product B2 Sp 2017-12-31 9.0 0.09
Spain Category B product B2 Sp 2017-12-31 10.0 0.1
Spain Category C product C2 Sp 2017-12-31 11.0 0.11
Spain Category C product C2 Sp 2017-12-31 12.0 0.12

Let us have a look at input_data dtypes:

print(input_data.dtypes)
column dtype
Country object
Product Category object
product_name object
date datetime64[ns]
$ VALUE float64
évolution N-1 float64

Now we would like to clean several things:

  • Cleaning and homogenisation of columns names: this is managed by the ``is_slugify`` parameter (True by default). If you do not want the columns names to be slugified, set this parameter to False.
  • Conversion of relevant text columns into ``category`` dtype: for categorical data (taking a limited, and usually fixed number of possible values), it allows to save significant memory for large datasets with columns with low cardinality (the amount of unique values is lower than 50% of the count of these values). The memory usage of a category dtype is proportional to the number of categories plus the length of the data; in contrast, an object dtype is a constant times the length of the data. To determine whether an object column has to be converted into category, you can set the maximum number of unique values to be found in the column below which the column can be considered as category. You can use the threshold parameter (default to 50) for that matter.
  • Conversion of relevant columns from float to int dtype. This is managed automatically by the function if relevant and will save memory
  • Renaming of columns: you can use a dictionary as the rename_cols parameter, with original names as key and replacement name as value.

So let us clean our data:

from toucan_data_sdk.utils.generic import clean_dataframe

clean_data = clean_dataframe(input_data, is_slugify=True, threshold=4, rename_cols={'product_name': 'product-name'})

Now let us have a look at the columns names and dtypes to see the result:

print(clean_data.dtypes)
column dtype
country category
product-category category
product-name object
date datetime64[ns]
value int64
evolution-n-1 float64

Focus on memory usage:

Let us compare the memory usage of input_data and clean_data:

print(input_data.memory_usage())
column memory_usage
Index 80
Country 96
Product Category 96
product_name 96
date 96
$ VALUE 96
évolution N-1 96
print(clean_data.memory_usage())
column memory_usage
Index 80
country 108
product-category 116
product-name 96
date 96
value 96
evolution-n-1 96

You can see that for this very small dataset, categorical data uses more memory because of the pandas implementation (which is not the subject of this tutorial).

But now let us build a bigger dataframe to get a sense of the optimisation allowed by categorical data:

test_input = input_data.copy()
test_output = output_0.copy()
for n in range(10):
    test_input = test_input.append(test_input).reset_index(drop=True)
    test_output = test_output.append(test_output).reset_index(drop=True)
print(test_input.memory_usage())
column memory_usage
Index 80
Country 98304
Product Category 98304
product_name 98304
date 98304
$ VALUE 98304
évolution N-1 98304
print(test_output.memory_usage())
column memory_usage
Index 80
country 12384
product-category 12384
product-name 98304
date 98304
value 98304
evolution-n-1 98304

No comment…

Cleaning text

Coming soon…

Cleaning dates

Coming soon…