How to :: create a date column in your data source from your file name

Note

This will happen in the augment.py file.

What’s that ?

Let say you receive everyday a new data file, containing the last day of data that need to be added to your exisiting domain/data source. For some reason, you don’t have a date column in the file you receive.

In this scenario, you are using the match: true option in the etl_config.cson file. See the doc here

Here is a tuto on how to do this in the augment.py file.

Configuration

⚠️ This should be added in the def augment (dfs) section

def augment(dfs):
    ### This step populates a new domain with all your file matching a certain naming convention
    ### i.e: maintenance-20180222
    df_my_domain = dfs.pop('my_domain-{date}')
    ### This step creates a 'DATE_FICHIER' with the file name
    df_my_domain = df_my_domain.rename(columns={'__match__': 'DATE_FICHIER'})
    ### This step keeps only the date in the column 'DATE_FICHIER', spliting the file name at the -
    ### What's your separator?
    df_my_domain['DATE_FICHIER'] = df_my_domain.DATE_FICHIER.str.split('-', expand=True)[1]
    ### This step creates a datetime format from your file name date
    ### :warning: define your date format in the format='%y%m%d'
    df_my_domain['DATE_FICHIER'] = pd.to_datetime(df_my_domain.DATE_FICHIER.str[:6], format='%y%m%d')
    ## This steps create your domain
    dfs['my_domain'] = df_my_domain
    return dfs

Tada! You’re all set :)