Adjust data to your needs

To adjust data to your charts, we have several tools to adapt the data. The simpliest way is to simply link the good domain to your chart. You can also use data manipulations in MongoDB or usual functions created by Toucan Toco called postprocess.

Tutorial: Product Corporation

You can download the CSV file for our tutorial.

data-product-corporation.csv affected to the domain: data-product.

Note

You can use comments in your query to explain the differents steps you’re doing.

Simply use this syntax : $comment: "This is a commentary to explain what I'm doing in my query"

comments

comments

YouPrep™

YouPrep™ is a graphical tool that will let you transform your datasets without having to write the query manually by yourself.

YouPrep™ will help you with most of the data transformations you need (filtering, computation, aggregation, text and date operations, reshaping, combining several datasets)

For information, we built YouPrep™ as an open source project that can be extended by the community. Today, we translate user interactions into Mongo query language. Tomorrow, we want to be able to translate languages ranging from our custom postprocess language (see below on this page) to SQL. The project is called WeaverBird.

For a detailed documentation on how to use the user interface, please see this dedicated page.

Switch to code mode

If you feel stuck on YouPrep™, you can switch to code mode for more advanced transformations:

Be careful, it is a one way journey: you cannot go back from code mode to YouPrep™ (at least not at the moment). But you will not lose what you’ve already done with YouPrep™ ! When clicking on “Switch to code mode”, you can start from the query already generated by YouPrep™:

Dynamic queries

In your data query you can use variables instead of raw values. Those variables come from the requesters that you define in your stories or on your home (for views and dates). As you select a value in a requester, the corresponding variable takes the selected value. It’s very powerful to filter your data and make your queries lighter and more dynamic.

In YouPrep™, when you are configuring a step, some fields will display the variable button at hover, indicating that you can use a variable:

Variables button

Variables button

Variables dropdown

Variables dropdown

In the screenshot above, you can see the list of all available variables:

  • App variables are those derived from you global app requesters, i.e. views and dates selectors (those requesters are always available in all your stories)
  • User variables are those derived from the current user viewing the app. It can be useful to use those variables when you need to filter your queries based on the user’s user group, id etc.
  • Story variables are those derived from the requesters that you define in a given story

Advanced variables

At the bottom of the variables drodpown you will find an “Advanced variable” button. This will allow you to define custom, more complex variables in code mode.

Advanced variable editor

Advanced variable editor

In code mode the syntax looks like <variableScope>.<variableName>.

Here are the three variable scopes that you can use:

  • For a global variable (a view or a date), the variable scope will be appRequester
  • For a story variable, the variable scope will be requestersManager
  • For a user variable, the variable scope will be user

For the variables name:

  • For a global variable,it will be either report (for views) or date (for dates)
  • For a story variable, it will be the name you gave to the requester when you configured it
  • For a user variable, it will be the name of the user attribute. By default, you can acces Toucan Toco attributes groups (which value is the list of the Toucan user groups the user belongs to) and username (the user login email). But you may expose some custom attributes derived from your dentity provider

Complete Object Mode

When defining a requester (on the home or in your stories), you have the option to use all the columns of your underlying dataset. In such a case, to access any of the column in your queries, the syntax looks like <variableScope>.<variableName>.<someField>

So for example if you use views with this underlying dataset:

city country region
Paris France Europe
Amsterdam Netherlands Europe
Boston USA North America
Toronto Canada North America

You can use make your queries dynamic depending on either the city (appRequesters.report.city), the country (appRequesters.report.country) or the region (appRequesters.report.region).

Ternary syntax: conditionals

You can use a powerful syntax allowing you to make a variable value depend on another. For example, let’s say that you want your variable to take the value 1 if some requester value is ‘FILTER ME’, else the variable should take the value 0.

Then you can write:

requestersManager.yourVariable == 'FILTER ME' ? 1 : 0

The syntax is <condition to verify> ? <value if true> : <value if false>

You can use any other Javascript comparison operator like !=, >, <, >=, <=

Do not filter in some conditions: the use of ‘__VOID__’

In some (or lot of) cases, you may not want to apply some filtering based on the value of a variable. The most fequent use case is when you have “ALL” values in your requesters (“All countries”, “All products”…). The underlying data would look like this:

country
All countries
France
Netherlands
USA
Canada

Then in your queries, let’s say that you need to first filter your data before aggregating it. For any country selected, you want to apply a filter to keep only the country rows before aggregating. But when “All countries” is selected, you do not want to apply any filter as you want to aggregate all the countries rows.

So how to do that? In Toucan, you can use a special value, which is ‘__VOID__’. When a variable takes this value, it will just not be taken into account. So in ourt example it’s very useful because it allows to express exactly what we need with a ternary syntax (see section just above):

appRequesters.report === 'All countries' ? '__VOID__' : appRequesters.report

Meaning: if the value selected is “All countries”, then the variable should be ‘__VOID__’ and no filter will apply, else the variable should take the value selected (any country).

Variables in code mode

Note that if you do not use YouPrep™ and that you want to use variables in code mode, you must add the variables delimiters <%= and %> in your queries or chart configuration. It looks like this: <%= appRequesters.report %>

See this detailed documention.

How to play with data requests in code mode

While editing a chart, you can select Data request editor (next to Chart parameters) to display the data request editor.

Data request editor

Data request editor

It is possible to do a little postprocess in the data request. This will allow you to do some massaging on the data. You can:

These functions can be found inside the utils/postprocess folder in toucan-data-sdk.

👉 You can use postprocess inside the data request editor at the same level as the query block. Here is an example for a roll_up process:

query:
  domain:'my_domain'
postprocess:[
  roll_up:
    levels : [0,1,2]
    groupby_vars: ['colA', colB]
    var_name: 'my_var_name'
]

👉 A post process can contain multiple steps, make sure to separate them with a ,

postprocess:[
  roll_up:
    levels : [0,1,2]
    groupby_vars: ['colA', colB]
    var_name: 'my_var_name'
,
  argmax:
    column: 'year'
]

Postprocess: clean

add_missing_row

Add missing rows based on reference values.

postprocess:[
  add_missing_row:
    id_cols: ['Col_name_1', 'Col_name_2']
    reference_col: 'Reference_col_name'
    complete_index:[A, B, C]
    method: 'between'
    cols_to_keep: ['Col_name_3']
]

mandatory :

  • id_cols (list of str): names of the columns used to create each group
  • reference_col (str): name of the column used to identify missing rows

optional :

  • complete_index (list or dict):[A, B, C] a list of values used to add missing rows. It can also be a dict to declare a date range. By default, use all values of reference_col.
  • method (str): by default all missing rows are added. The possible values are :
    • "between" : add missing rows having their value between min and max values for each group,
    • "between_and_after" : add missing rows having their value bigger than min value for each group.
    • "between_and_before" : add missing rows having their value smaller than max values for each group.
  • cols_to_keep (list of str): name of other columns to keep, linked to the reference_col.
Example 1

Input:

id_cols reference_col
2016 A
2016 B
2016 C
2017 A
2017 C
data:
  query:
    domain: "my_domain"
  postprocess:[
    add_missing_row:
      id_cols: ['id_cols']
      reference_col: 'reference_col'
  ]

Output:

id_cols reference_col
2016 A
2016 B
2016 C
2017 A
2017 B
2017 C
Example 2

Input:

id_cols reference_col
2016 B
2017 C
data:
  query:
    domain: "my_domain"
  postprocess:[
    add_missing_row:
      id_cols: ['id_cols']
      complete_index:['A', 'B', 'C']
      reference_col: 'reference_col'
  ]

Output:

id_cols reference_col
2016 A
2016 B
2016 C
2017 A
2017 B
2017 C

cast

Change type of a column

postprocess: [
  cast:
    column: 'my_column'
    type: 'int'
]

mandatory :

  • column (str): name of the column to convert
  • type (str): output type. It can be :
    • "int" : integer type
    • "float" : general number type
    • "str" : text type

optional :

  • new_column (str): name of the output column. By default the column arguments is modified.
Example

Input :

Column 1 Column 2 Column 3
‘one’ ‘2014’ 30.0
‘two’ 2015.0 ‘1’
3.1 2016 450
postprocess: [
  cast:
    column: 'Column 1'
    type: 'str'
  cast:
    column: 'Column 2'
    type: 'int'
  cast:
    column: 'Column 3'
    type: 'float'
]

Output:

Column 1 Column 2 Column 3
‘one’ 2014 30.0
‘two’ 2015 1.0
‘3.1’ 2016 450.0

convert_datetime_to_str

  • Convert a datetime object into string according to the format of your choice *
postprocess: [
  convert_datetime_to_str:
    column: 'my_date_column'
    format: 'my_format'
]

madatory:

optional :

  • new_column (str): name of the output column. By default the column arguments is modified.
Example

Input :

label value
France 2017-03-22 15:16:00
Europe 2016-03-22 15:16:00
convert_datetime_to_str:
  column: 'value'
  format: '%Y-%m'
]

Output :

label value
France 2017-03
Europe 2016-03

change_date_format

Change a date column from a format to an other

postprocess: [
  change_date_format:
    column: 'my_date_column'
    output_format: '%Y%-%m-%d'
]

madatory:

optional :

Example 1

Input :

label date
France 2017-03-22
Europe 2016-03-22
change_date_format:
  column: 'date'
  input_format: '%Y-%m-%d'
  output_format: '%Y-%m'
]

Output :

label date
France 2017-03
Europe 2016-03
Example 2

Input :

label date
France 2017-03-22
Europe 2016-02-22
change_date_format:
  column: 'date'
  new_column: 'month'
  input_format: '%Y-%m-%d'
  output_format: '%B'
]

Output :

label date month
France 2017-03 March
Europe 2016-02 February

fillna

Will fill in missing values with a specific value.

data:
  query:
    domain: "my_domain"
  postprocess:[
    fillna:
      column: 'my_value'
      value: 0
  ]

mandatory :

  • column(str) : column name on which we want to replace missing values by another value
  • value : value replacing missing values
Example 1

Input:

variable wave year my_value
toto wave 1 2014 300
toto wave 1 2015  
toto wave 1 2016 450
data:
  query:
    domain: "my_domain"
  postprocess:[
    fillna:
      column: 'my_value'
      value: 0
  ]

Output:

variable wave year my_value
toto wave 1 2014 300
toto wave 1 2015 0
toto wave 1 2016 450
Example 2

If column doesn’t exist in the dataframe, it will be added with value as value :

Input:

variable wave year my_value
toto wave 1 2014 300
toto wave 1 2015  
toto wave 1 2016 450
data:
  query:
    domain: "my_domain"
  postprocess:[
    fillna:
      column: 'other_column'
      value: 50
  ]

Output:

variable wave year my_value other_column
toto wave 1 2014 300 50
toto wave 1 2015   50
toto wave 1 2016 450 50

rename

Change the label of a value or a columns within your data source.

postprocess: [
  rename:
    values:
      'value_to_modify':
        'en': 'new_value'
        'fr': 'new_value'
    columns:
      'column_to_rename':
        'en': 'new_value'
        'fr': 'new_value'
]

⚠️ You have to rename in ‘en’ and ‘fr’ otherwise it will not work.

Note: The rename postprocess accepts a fallback_locale parameter that will be used as a default locale if locale is not found in translations.

Example

Input :

label value
France 100
Europe wo France 500
postprocess: [
rename:
    values:
    'Europe wo France':
      'en': 'Europe excl. France'
      'fr': 'Europe excl. France'
    columns:
    'value':
      'en': 'revenue'
      'fr': 'revenue'
]

Output :

label revenue
France 100
Europe excl. France 500
Example 2

Input :

label value
France 100
Europe wo France 500
postprocess: [
rename:
    values:
    'Europe wo France':
      'en': 'Europe excl. France'
      'fr': 'Europe sauf. France'
    columns:
    'value':
      'en': 'revenue'
      'fr': 'revenus'
    locale: 'it'
    fallback_locale: 'en'
]

Output :

label revenue
France 100
Europe excl. France 500

replace

Change the label of a value or a columns within your data source. (Similar to ``rename`` but does not have the notion of locale)

postprocess: [
  replace:
    column: "column_to_modify"
    new_column: "new_column"
    to_replace:
      "old_value": "new_value"
      "old_value_2": "new_value_2"
]

mandatory :

  • column (str): name of the column to modify.
  • to_replace (dict): keys of this dict are old values pointing on substitute.

optional :

  • new_column (str): name of the output column. By default the column arguments is modified.
Example 1

Input :

article rating
book 1
puzzle 3
food 5

We want to split the ratings in three categories: “good”, “average” and “poor”.

postprocess: [
  replace:
    column: "rating"
    new_column: "rating_category"  # create a new column with replaced data
    to_replace:
      1: "poor"
      2: "poor"
      3: "average"
      4: "good"
      5: "good"
]

Output :

article rating rating_category
book 1 poor
puzzle 3 average
food 5 good
Example 2

Input :

article comment
book I didnt recieve the book
puzzle Couldnt acheive it, fuuuck
food Fucking yummy!

We fix some common orthographic mistakes using regular expressions (regex):

postprocess: [
  replace:
    column: "comment"
    to_replace:
      "recieve": "receive"
      "acheive": "achieve"
      "([fF])u+ck": "\g<1>*ck"  # censoring :o
    regex: true
]

Output :

article comment
book I didnt receive the book
puzzle Couldnt achieve it, f*ck
food F*cking yummy!

Postprocess : Filter

argmax and argmin

Keep in the dataframe a maximal (or minimal) value of a row.

👉 Query in the dataframe rows with latest (oldest) date.

postprocess:[
  argmax:
    column: 'my_column'
]

mandatory :

  • column : column name on which we want to keep the maximal (minimal) value
Example 1

Input:

variable wave year value
toto wave 1 2014 300
toto wave 1 2015 250
toto wave 1 2016 450
postprocess:[
  argmax:
    column: 'year'
]

Output:

variable wave year value
toto wave 1 2016 450
Example 2

Input:

variable wave year value
toto wave 1 2014 300
toto wave 1 2015 250
toto wave 1 2016 450
postprocess:[
  argmin:
    column: 'year'
]

Output:

variable wave year value
toto wave 1 2014 300

drop_duplicates

  • Drop duplicate rows*
postprocess: [
  drop_duplicates:
    columns: ['col1', 'col2']
]

mandatory :

  • columns : columns to consider to identify duplicates (set to null to use all the columns)
Example

Input:

name country year
toto France 2014
titi England 2015
toto France 2014
toto France 2016
postprocess: [
  drop_duplicates:
    columns: null
]

Output:

name country year
toto France 2014
titi England 2015
toto France 2016

coffeescript   postprocess: [     drop_duplicates:       columns: ['name', 'country']   ]

Output:

name country year
toto France 2014
titi England 2015

query

Filter a dataset under a condition

👉 Similar to $match step in Mongo

postprocess:[
    query: 'value > 1'
]
Example

Input:

variable wave year value
toto wave 1 2014 300
toto wave 1 2015 250
toto wave 1 2015 100
toto wave 1 2016 450
postprocess:[
    query: 'value > 350'
]

Output:

variable wave year value
toto wave 1 2016 450
postprocess:[
    query: 'year == 2015'
]

Output:

variable wave year value
toto wave 1 2015 250
toto wave 1 2015 100

if else

Transform a part of your dataset according to certain conditions : a classic if… then… else… option

if_else:
  if: <query>
  then: <string, number, boolean or new postprocess>
  else: <string, number, boolean or new postprocess>
  new_column: <the new column name>

mandatory :

  • if (str): string representing the query to filter the data (same as the one used in the postprocess ‘query’) See pandas doc for more information
  • then:
    • str, int, float: A row value to use for the new column
    • dict or list of dicts: postprocesses functions to have more logic (see examples)
  • new_column (str): name of the column containing the result. optional :
  • else: same as then but for the non filtered part. If not set, the non filtered part won’t have any values for the new column
Example

Input :

country city clean the rating
France Paris -1 3
Germany Munich 4 5
France Nice 3 4
Hell HellCity -10 0
  if_else:
    if: '`the rating` % 2 == 0'
    then:
      postprocess: 'formula'
      formula: '("the rating" + clean) / 2'
    else: 'Hey !'
    new_column: 'new'
]

Output:

country city clean the rating new
France Paris -1 3 Hey!
Germany Munich 4 5 Hey!
France Nice 3 4 3.5
Hell HellCity -10 0 -5.0

top

Get the top or flop N results based on a column value. You can execute the selection on group columns

postprocess:[
  top:
      value: "column_name"
      limit: 10
      order: 'asc'
]

mandatory :

  • value (str): Name of the column name on which you will rank the results.
  • limit (int): Number to specify the N results you want to retrieve from the sorted values.
    • Use a positive number x to retrieve the first x results.
    • Use a negative number -x to retrieve the last x results.

optional :

  • order (str): "asc" or "desc" to sort by ascending ou descending order. By default : "asc".
  • group (str, list of str): name(s) of columns on which you want to perform the group operation.
Example 1

Input :

variable Category value
lili 1 50
lili 1 20
toto 1 100
toto 1 200
toto 1 300
lala 1 100
lala 1 150
lala 1 250
lala 2 350
lala 2 450
postprocess:[
  top:
    value: 'value'
    limit: 4
    order: 'asc'
]

Output:

variable Category value
lala 1 250
toto 1 300
lala 2 350
lala 2 450
Example 2

Input :

variable Category value
lili 1 50
lili 1 20
toto 1 100
toto 1 200
toto 1 300
lala 1 100
lala 1 150
lala 1 250
lala 2 350
lala 2 450
postprocess:[
  top:
      group: ["Category"]
      value: 'value'
      limit: 2
      order: 'asc'
]

Output:

variable Category value
lala 1 250
toto 1 300
lala 2 350
lala 2 450

top_group

Get the results based on a column value and a function that aggregates the input and keeps only the N top or flop groups. The result is composed by all lines corresponding to the top groups

postprocess: [
  top_group:
    "group": ["Category"]
    "value": 'value'
    "aggregate_by": ["variable"]
    "limit": 2
    "order": "desc"
]

mandatory :

  • value (str): Name of the column name on which you will rank the results.
  • limit (int): Number to specify the N results you want to retrieve from the sorted values.
    • Use a positive number x to retrieve the first x results.
    • Use a negative number -x to retrieve the last x results.
  • aggregate_by (list of str)): name(s) of columns you want to aggregate

optional :

  • order (str): "asc" or "desc" to sort by ascending ou descending order. By default : "asc".
  • group (str, list of str): name(s) of columns on which you want to perform the group operation.
  • function : Function to use to group over the group column
Example

Input :

variable Category value
lili 1 50
lili 1 20
toto 1 100
toto 1 200
toto 1 300
lala 1 100
lala 1 150
lala 1 250
lala 2 350
lala 2 450
postprocess: [
  top_group:
    "group": ["Category"]
    "value": 'value'
    "aggregate_by": ["variable"]
    "limit": 2
    "order": "desc"
]

Output :

variable Category value
toto 1 100
toto 1 200
toto 1 300
lala 1 100
lala 1 150
lala 1 250
lala 2 350
lala 2 450

filter_by_date

Filter out a dataset according to a date range specification

postprocess:[
  filter_by_date:
      start: "2018-01-01"
      stop: "2018-01-04"
      date_format: "%Y-%m-%d"
]

mandatory :

  • date_col: the name of the dataframe’s column to filter on

optional :

  • date_format: expected date format in column date_col (default is %Y-%m-%d)
  • start: if specified, lower bound (included) of the date range
  • stop: if specified, upper bound (excluded) of the date range
  • atdate: if specified, the exact date we’re filtering on

Even though start, stop and atdate are individually not required, you must use one of the following combinations to specify the date range to filter on:

  • atdate: keep all rows matching this date exactly,
  • start: keep all rows matching this date onwards (bound included),
  • stop: keep all rows matching dates before this one (bound excluded),
  • start and stop: keep all rows between start (included) and stop (excluded).

Besides standard date formats, you can also:

  • use the following symbolic names (TODAY, YESTERDAY and TOMORROW),
  • use offsets with either (datestr) + OFFSET or (datestr) - OFFSET syntax. In this case, OFFSET must be understable by pandas.Timedelta (cf. http://pandas.pydata.org/pandas-docs/stable/timedeltas.html) and datestr must be wrapped with parenthesis (e.g. (2018-01-01) + 2d),
  • use a combination of both symbolic names and offsets (e.g. (TODAY) + 2d).
from ‘2018-01-31’ onwards

Input :

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
postprocess:[
  filter_by_date:
    date_col: 'date'
    date_fmt: '%Y-%m-%d'
    start: '2018-01-31'
]

Output:

date value
2018-01-31 4
2018-02-01 5
2018-02-02 6
from ‘2018-01-29’ + 2days onwards

Input :

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
postprocess:[
  filter_by_date:
    date_col: 'date'
    date_fmt: '%Y-%m-%d'
    start: '(2018-01-29) + 2days'
]

Output:

date value
2018-01-31 4
2018-02-01 5
2018-02-02 6
from ‘2018-01-29’ until ‘2018-02-01’

Input :

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
postprocess:[
  filter_by_date:
    date_col: 'date'
    date_fmt: '%Y-%m-%d'
    start: '2018-01-29'
    stop: '2018-02-01'
]

Output:

date value
2018-01-29 2
2018-01-30 3
2018-01-31 4
everything until ‘2018-02-01’

Input :

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
postprocess:[
  filter_by_date:
    date_col: 'date'
    date_fmt: '%Y-%m-%d'
    stop: '2018-02-01'
]

Output:

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
only ‘2018-01-28’

Input :

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
postprocess:[
  filter_by_date:
    date_col: 'date'
    date_fmt: '%Y-%m-%d'
    atdate: '2018-01-28'
]

Output:

date value
2018-01-28 0
2018-01-28 1
from today onwards (assume today is 2018-01-29)

Input :

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
postprocess:[
  filter_by_date:
    date_col: 'date'
    date_fmt: '%Y-%m-%d'
    start: 'TODAY'
]

Output:

date value
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
from today-2days until today+2days (assume today is 2018-01-29)

Input :

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3
2018-01-31 4
2018-02-01 5
2018-02-02 6
postprocess:[
  filter_by_date:
    date_col: 'date'
    date_fmt: '%Y-%m-%d'
    start: '(TODAY) + 2d'
    stop: '(TODAY) + 2d'
]

Output:

date value
2018-01-28 0
2018-01-28 1
2018-01-29 2
2018-01-30 3

Postprocess : Aggregate

groupby

Aggregate values by groups.

postprocess: [
  groupby:
    group_cols: ['colname_1', 'colname_2']
    aggregations:
      'value_1': 'sum'
      'value_2': 'mean'
]

mandatory :

  • group_cols (list of str): name(s) of columns used to group data
  • aggregations (dict) : dictionary of values columns to group as keys and aggregation function to use as values. Available aggregation functions:
    • 'sum'
    • 'mean'
    • 'median'
    • 'prod' (product)
    • 'std' (standard deviation)
    • 'var' (variance)
Example 1

Input:

ENTITY YEAR VALUE_1 VALUE_2
A 2017 10 3
A 2017 20 1
A 2018 10 5
A 2018 30 4
B 2017 60 4
B 2017 40 3
B 2018 50 7
B 2018 60 6
postprocess: [
  groupby:
    group_cols: ['ENTITY', 'YEAR']
    aggregations:
      'VALUE_1': 'sum'
      'VALUE_2': 'mean'
]

Output:

ENTITY YEAR VALUE_1 VALUE_2
A 2017 30 2.0
A 2018 40 4.5
B 2017 100 3.5
B 2018 110 6.5

roll_up

Creates aggregates following a given hierarchy.

postprocess:[
  roll_up:
    levels: ["col_name_1", "col_name_2"]
    groupby_vars: ["value_1", "value_2"]
    extra_groupby_cols: # (list(str)) optional: other columns used to group in each level.
    var_name: "output_var"
    value_name:  "output_value"
    agg_func: "sum"
    drop_levels: ["level_1", "level_2"]
]

mandatory :

  • levels (list of str): name of the columns composing the hierarchy (from the top to the bottom level).
  • groupby_vars (list of str): name of the columns with value to aggregate.
  • extra_groupby_cols (list of str) optional: other columns used to group in each level.

optional :

  • var_name (str) : name of the result variable column. By default, “type”.
  • value_name (str): name of the result value column. By default, “value”.
  • agg_func (str): name of the aggregation operation. By default, “sum”.
  • drop_levels (list of str): the names of the levels that you may want to discard from the output.
Example

Input :

Region City Population
Idf Panam 200
Idf Antony 50
Nord Lille 20
postprocess:[
  roll_up:
    levels: ["Region", "City"]
    groupby_vars: "Population"
]

Output:

Region City Population value type
Idf Panam 200 Panam City
Idf Antony 50 Antony City
Nord Lille 20 Lille City
Idf Nan 250 Idf Region
Nord Nan 20 Nord Region

combine_columns_aggregation

Aggregates data based on all possible combinations of values found in specified columns.

postprocess:[
  combine_columns_aggregation:
    id_cols: ['Col_name_1', "Col_name_2"]
    cols_for_combination:
      'Col_name_3': 'All_3',
      'Col_name_4': 'All_4'
    agg_func: 'sum'
]

mandatory :

  • id_cols (list of str): name of the columns used to create each group
  • cols_for_combination (dict): name of the input columns for combination as key and their output label as value

optional :

  • agg_func (string or dict or list): Use for aggregating the data. By default "sum"
Example

Initial data:

Year City Category Value
2017 Paris A 5
  Nantes A 4
  Paris B 7
2016 Nantes A 10
postprocess:[
  combine_columns_aggregation:
    id_cols: ['Year']
    cols_for_combination:
      'City': 'All Cities'
      'Category': 'All'
]

Output:

Year City Category Value
2017 Paris A 5
  Nantes A 4
  Paris B 7
  All Cities A 9
  All Cities B 7
  Nantes All 4
  Paris All 12
  All Cities All 16
2016 Nantes A 10
  All Cities A 10
  Nantes All 10
  All Cities All 10

compute_fill_by_group

Gives the result of a groupby with a ffill without the performance issue. The method ffill propagate last valid value forward to next values.

postprocess:[
  compute_ffill_by_group:
    id_cols: ["id_cols"]
    reference_cols: ["reference_cols"]
    value_col: ["value_col"]
]

mandatory :

  • id_cols (str or list of str): name(s) of the columns used to create each group.
  • reference_cols (str or list of str): name (s)of the columns used to sort.
  • value_col (str or list of str): name(s) of the columns to fill.
Example

Initial data:

id_cols reference_cols value_col
A 1 2
A 2 5
A 3 null
B 1 null
B 2 7
postprocess:[
  compute_ffill_by_group:
    id_cols: "id_cols"
    reference_cols: "reference_cols"
    value_col: "value_col"
]

Output:

id_cols reference_cols value_col
A 1 2
A 2 5
A 3 5
B 1 null
B 2 7

Postprocess : Transform Structure

melt

A melt will transform a dataset by creating a column “variable” and a column “value”. This function is useful to transform a dataframe into a format where one or more columns are identifier variables, while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ``”variable”`` and ``”value”``.

postprocess: [
  melt:
    id: ['my_column_1', 'my_colum_2']
    value: ['info_1', 'info_2', 'info_3']
    dropna: true
]

mandatory :

  • id (list of str): names of the columns that must be kept in column.
  • value (list of str): names of the columns that will be transformed in long format (in rows).

optional :

  • dropna (boolean): It allows you to drop missing values. By default : true
Example 1

Input :

my_label my_value my_column_1 my_column_2 info_1 info_2 info_3
toto 10 S45 Lalaland 10 20 None
postprocess: [
  melt:
    id: ['my_label', 'my_value' 'my_column_1', 'my_colum_2']
    value: ['info_1', 'info_2', 'info_3']
    dropna: true
]

Output :

my_label my_value my_column_1 my_column_2 variable value
toto 10 S45 Lalaland info_1 10
toto 10 S45 Lalaland info_2 20
Example 2

Input :

my_label my_value my_column_1 my_column_2 info_1 info_2 info_3
toto 10 S45 Lalaland 10 20 None
postprocess: [
  melt:
    id: ['my_column_1', 'my_colum_2']
    value: ['info_1', 'info_2', 'info_3']
]

Output :

my_label my_value my_column_1 my_column_2 variable value
toto 10 S45 Lalaland info_1 10
toto 10 S45 Lalaland info_2 20
toto 10 S45 Lalaland info_3 None

pivot

Reverse operation of melting. Useful for configuring evolution

postprocess:[
    pivot:
        index: ['variable','wave']
        column: 'year'
        value: 'value'
]

mandatory :

  • index (list): names of index columns.
  • column (str): column name to pivot on
  • value (str): column name containing the value to fill the pivoted df
Example

Input :

variable wave year value
toto wave 1 2014 300
toto wave 1 2015 250
toto wave 1 2016 450
postprocess:[
    pivot:
        index: ['variable','wave']
        column: 'year'
        value: 'value'
]

Output:

variable wave 2014 2015 2015
toto wave 1 300 250 450

pivot_by_group

Pivot a dataframe by group of variables.

postprocess:[
  pivot_by_group :
    id_cols: ['type']
    variable: 'variable'
    value: 'montant'
    new_columns: ['value', 'variation']
    groups:
      'Group 1' : ['var1', 'var1_evol']
      'Group 2' : ['var2', 'var2_evol']
  ]

mandatory :

  • variable (str): name of the column used to create the groups.
  • value (str): name of the column containing the value to fill the pivoted df.
  • new_columns (list of str): names of the new columns.
  • groups (dict): names of the groups with their corresponding variables. ⚠️ The list of variables has the same order than new_columns

optional :

  • id_cols (list of str) : names of other columns to keep, default None.
Example

Input

type variable montant
A var1 5
A var1_evol 0.3
A var2 6
A var2_evol 0.2
postprocess:[
  pivot_by_group :
    id_cols: ['type']
    variable: 'variable'
    value: 'montant'
    new_columns: ['value', 'variation']
    groups:
      'Group 1' : ['var1', 'var1_evol']
      'Group 2' : ['var2', 'var2_evol']
  ]

Output

type variable value variation
A Group 1 5 0.3
A Group 2 6 0.2

sort

Sort the data by the value in specified columns.

postprocess: [
  sort:
    columns: ["my_column1", "my_column2"]
    order: 'asc'
]

mandatory :

  • columns(str of list of str): name(s) the columns on which you want to sort.
  • group (str, list of str): name(s) of columns on which you want to perform the group operation.
Example

Input:

variable value
A 220
B 200
C 300
D 100
postprocess:[
  sort:
      columns: ['variable']
      order: 'asc'
]

Output :

variable value
D 100
B 200
A 220
C 300

two_values_melt

Transforms one or multiple columns into rows. Unlike melt function (pandas), two value columns can be returned by the function. For example, an evolution column and a price column.

postprocess:[
  two_values_melt:
    first_value_vars: ["avg", "total"]
    second_value_vars: ["evo_avg", "evo_total"]
    var_name: "type"
    value_name: "value"
]

mandatory :

  • first_value_vars (list of str): name of the columns corresponding to the first returned value column
  • second_value_vars (list of str): name of the columns corresponding to the second returned value column
  • var_name (str): name of the column containing values in first_value_vars
  • value_name (str): suffix of the two value columns (suffix_first / suffix_second)
Example

Input :

Region avg total evo_avg evo_total
A 50 100 1 4
B 40 250 2 5
postprocess:[
  two_values_melt:
    first_value_vars: ["avg", "total"]
    second_value_vars: ["evo_avg", "evo_total"]
    var_name: "type"
    value_name: "value"
]

Output:

Region type value_first value_second
A avg 50 1
A total 100 4
B avg 40 2
B avg 250 5

waterfall

This postprocess step simplify the data preparation for waterfall charts configuration.

postprocess:[
  waterfall:
    upperGroup:
      id: '<col>' # name of the column containing upperGroups unique IDs
      label: '<col>'  # name of the column containing text displayed under each upperGroups bars (not required, default on id)
      groupsOrder: '<col>'  # name of the column used for ordering of upperGroups (not requiered)
    insideGroup:
      id: '<col>'
      label: '<col>'
      groupOrder: '<col>'
    date: '<col>' # name of the column containing the id of the period of each lines
    value: '<col>' # name of the column containing the vaue for each lines
    start:
      label: '<value>' # text displayed under the first column
      id: '<value>' # value that identifies lines of the first period in the date column
    end:
      label: '<value>'
      id: '<value>'
    breakdown: [list of filter labels]
]
Example

Input:

product_id played date ord category_id category_name
super clap 12 t1 1 clap Clap
clap clap 1 t1 10 clap Clap
tac 1 t1 1 snare Snare
super clap 10 t2 1 clap Clap
tac 100 t2 1 snare Snare
bom 1 t2 1 tom Tom
postprocess:[
  waterfall:
    upperGroup:
      id: 'category_id'
      label: 'category_name'
    insideGroup:
      id: 'product_id'
      groupsOrder: 'ord'
    date: 'date'
    value: 'played'
    start:
      label: 'Trimestre 1'
      id: 't1'
    end:
      label: 'Trimester 2'
      id: 't2'
]

Output:

value label variation groups type order
14 Trimestre 1 NaN NaN NaN NaN
-3 Clap -0.230769 clap parent NaN
-2 super clap -0.166667 clap child 1
-1 clap clap -1 clap child 10
99 Snare 99 snare parent NaN
99 tac 99 snare child 1
1 Tom inf tom parent NaN
1 bom inf tom child 1
111 Trimester 2 NaN NaN NaN NaN

Postprocess : Compute

add_aggregation_columns

Add aggregation columns

postprocess: [
  add_aggregation_columns:
    group_cols: ['colname_1', 'colname_2']
    aggregations:
      new_col1:
        value_1: 'sum'
      new_col2:
        value_2: 'mean'
]
  • group_cols (list of str): name(s) of columns used to group data
  • aggregations (dict of dict): dictionary mapping the new columns and the corresponding aggregation dictionary. An aggregation dictionary is a dictionary with one key - the column you want to aggregate - and one value - the aggregation function amongst sum, max, min, mean, median, prod (product), std (standard deviation), var (variance)…
Example

Input:

ENTITY YEAR VALUE_1 VALUE_2
A 2017 10 3
A 2017 20 1
A 2018 10 5
A 2018 30 4
B 2017 60 4
B 2017 40 3
B 2018 50 7
B 2018 60 6
postprocess: [
  add_aggregation_columns:
    group_cols: ['ENTITY', 'YEAR']
    aggregations:
      sum_value1:
        VALUE_1: 'sum'  # sum of `VALUE_1` put in `sum_value1` column
      max_value1:
        VALUE_1: 'max'  # max of `VALUE_1` put in `max_value1` column
      mean_value2:
        VALUE_2: 'mean'  # mean of `VALUE_2` put in `mean_value2` column
]

Output:

ENTITY YEAR VALUE_1 VALUE_2 sum_value1 max_value1 mean_value2
A 2017 10 3 30 20 2.0
A 2017 20 1 30 20 2.0
A 2018 10 5 40 30 4.5
A 2018 30 4 40 30 4.5
B 2017 60 4 100 60 3.5
B 2017 40 3 100 60 3.5
B 2018 50 7 110 60 6.5
B 2018 60 6 110 60 6.5

compute_cumsum

Return a cumsum by group.

postprocess:[
  compute_cumsum:
    id_cols: #(list): name of the columns used to create each group..
    reference_cols: # (list): name of the columns used to order the cumsum.
    value_cols: # (list): name of the columns to cumsum.
    new_value_cols: # (list): name of th new columns with the cumsum.
    cols_to_keep: # (list) optional: other columns to keep in the dataframe. This option can be used if there is only one row by group identify with the columns id_cols and reference_cols
]

mandatory :

  • id_cols (list of str): names of the columns used to create each group..
  • reference_cols (list of str): names of the columns used to order the cumsum.
  • value_cols (list of str): names of the columns to cumsum.
  • new_value_cols (list of str): names of th new columns with the cumsum.

optional :

  • cols_to_keep (list) : other columns to keep in the dataframe. This option can be used if there is only one row by group identify with the columns id_cols and reference_cols

Tips For an aggregation YearToDate, the column DATE is the reference column and the column YEAR is an id column.

Example

Input :

Year Date Value
2017 01/01/2017 1
  02/01/2017 1
  03/01/2017 1
2016 01/01/2017 1
  02/01/2017 2

Configuration:

postprocess:[
  compute_cumsum:
    id_cols: ["Year"]
    reference_cols: ["Date"]
    value_cols: ["Value"]
    new_value_cols: ["Cumsum"]
]

Output:

Year Date Value Cumsum
2017 01/01/2017 1 1
  02/01/2017 1 2
  03/01/2017 1 3
2016 01/01/2017 1 1
  02/01/2017 2 3

absolute_values

Absolute value of each value of a column

postprocess:[
  absolute_values:
    column: 'my_numbers'
    new_column: 'my_positive_numbers'
]

mandatory :

  • column (str): name of the column

optional :

  • new_column (str): name of the result column with absolute values. If not set, column will be replaced.
Example

Input :

Values
1.357
1.24
-0.589
-2
postprocess:[
  absolute_values:
    column: 'Values'
    new_column: 'Positive'
]

Output:

Values Positive
1.357 1.357
1.24 1.24
-0.589 0.589
-2 2

round_values

Round each value of a column

postprocess:[
  round_values:
    column: 'my_numbers'
    decimals: 2
    new_column: 'my_numbers_rounded'
]

mandatory :

  • column (str): name of the column
  • decimals (int): number of decimals

optional :

  • new_column (str): name of the result column with rounded values. If not set, column will be replaced.
Example

Input :

Values
1.357
1.24
-0.589
-2
postprocess:[
  round_values:
    column: 'Values'
    decimal: 1
    new_column: 'Rounded'
]

Output:

Values Rounded
1.357 1.3
1.24 1.2
-0.589 -0.6
-2 -2

formula

Execute basics math operations on columns

postprocess:[
  formula:
    new_column: 'diff'
    formula: 'MyFirstColumn-MySecondColumn'
]

mandatory :

  • new_column (str): name of the output column
  • formula (str): Operation on column. Use name of column and special character :
    • +” for addition
    • -” for substration
    • *” for multiplication
    • /” for division
Example 1

Input :

variable valueA valueB valueC
toto 20 100 10
toto 30 200 10
toto 10 300 10
postprocess:[
  formula:
    new_column: 'valueD'
    formula: '(valueB + valueA ) / valueC'
]

Output:

variable valueA valueB valueC valueD
toto 20 100 10 12
toto 30 200 10 23
toto 10 300 10 31
Example 2

Input :

variable valueA valueB valueC
toto 20 100 10
toto 30 200 10
toto 10 300 10
postprocess:[
  formula:
    new_column: 'valueD'
    formula: 'valueB + valueA  / valueC'
]

Output:

variable valueA valueB valueC valueD
toto 20 100 10 102
toto 30 200 10 203
toto 10 300 10 301
Example 3

You also can use number in formula :

Input :

variable valueA valueB
toto 20 100
toto 30 200
toto 10 300
postprocess:[
  formula:
    new_column: 'valueA'
    formula: 'valueA/10'
]

Output:

variable valueA valueB
toto 2 100
toto 3 200
toto 1 300

compute_evolution_by_criteria

This function answers the question: how has a value changed compare to a specific value ?

postprocess:[
  compute_evolution_by_criteria:
    id_cols: #(list(str)): name of the columns used to create each group..
    value_col: # (str) : name of the column containing the value to compare
    compare_to: # (str) : the query identifying a specific set of values for comparison.
    method: # (str) optional: either abs for absolute values or pct for the evolution in percentage of previous value
    offseted_suffix: # (str) optional: suffix of the offseted column. By default, _offseted.
    evolution_col_name: # (str) optional: name given to the evolution column. By default, evolution_computed.
    raise_duplicate_error: # (bool) optional: raise an error - if set to True - or warning - if set to False- when the dataframe has duplicated values with the given id_cols.
    format: 'df' # Do not change it !!!
]
Example

Input :

id_cols value_col month
A 100 1
  250 12
B 300 1
  200 12

Configuration:

postprocess:[
  compute_evolution_by_criteria:
    id_cols: "id_cols"
    value_col: "value_col"
    compare_to: "month==12"
]

Output:

id_cols value_col month value_offseted evolution_computed
A 100 1 250 -150
  250 12 250 0
B 300 1 200 100
  200 12 200 0

compute_evolution_by_frequency

This function answers the question: how has a value changed on a weekly, monthly, yearly basis ?

postprocess:[
  compute_evolution_by_frequency:
    id_cols: # (list(str)): name of the columns used to create each group.
    date_col: # (str) : name of the column containing the date.
    value_col: # (str) : name of the column containing the value to compare
    freq: #(int/pd.DateOffset/pd.Serie/dict): the frequency at which we calculate evolutions (anything that can be added to the pd.Serie present in date_col). By default, 1.
      years: 1
      days: 1
      months: 1
    method: # (str) optional: either abs for absolute values or pct for the evolution in percentage of previous value
    offseted_suffix: #(str) optional: suffix of the offseted column. By default, _offseted.
    evolution_col_name: # (str) optional: name given to the evolution column. By default, evolution_computed.
    missing_date_as_zero: # (boolean) optional: add missing date with zero value.
    raise_duplicate_error: # (bool) optional: raise an error - if set to True - or a warning - if set to False- when the dataframe has duplicated values with the given id_cols.
    format: 'df' # Do not change it !!!
]

Tips

When you use missing_date_as_zero=False and format: 'df', since both the original dataframe index and the evoluiton serie index are matching, the result of the function can be add as a new column on the original dataframe.

Example

Input :

id_cols value_col date_col
A 20 2010
  7 2011
B 200 2010
  220 2011
C 100 2011

Configuration:

postprocess:[
  compute_evolution_by_frequency:
    id_cols: "id_cols"
    date_col: "date_col"
    value_col: "value_col"
]

Output:

id_cols value_col date_col evolution
A 20 2010 null
  7 2011 -13
B 200 2010 null
  220 2011 20
C 100 2011 null

percentage

Creates a new column, which is the percentage of a column regarding of the groupby logic

postprocess:[
  percentage:
    new_column: 'number_percentage'
    column: 'number'
    group_cols: ['sport']
]

mandatory :

  • column: name of the desired column you need percentage on

optional :

  • group_cols (list): names of columns for the groupby logic
  • new_column (str): name of the output column. By default the column arguments is modified.
Example

Input:

gender sport number
male bicycle 17
female basketball 17
male basketball 3
female football 7
female running 30
male running 20
male football 21
female bicycle 17
postprocess:[
  percentage:
    new_column: 'number_percentage'
    column: 'number'
    group_cols: ['sport']
]

Output:

gender sport number number_percentage
male bicycle 17 50.0
female basketball 17 85.0
male basketball 3 15.0
female football 7 25.0
female running 30 60.0
male running 20 40.0
male football 21 75.0
female bicycle 17 50.0

rank

Create rank column(s) based on numeric values to be ranked. Independent ranking can be performed for different groups.

postprocess:[
  rank :
    value_cols: ['VALUE_1', 'VALUE_2']
    group_cols: ['ENTITY', 'YEAR']
    rank_cols_name: ['RANK_1', 'RANK_2']
  ]

mandatory :

  • value_cols (str or list(str)): name(s) of the columns used
  • group_cols (str or list(str), optionnal): name(s) of the column(s) used to create each group inside which independent ranking needs to be applied
  • rank_cols_names (str or list(str), optionnal): the names of the added ranking columns. If not filled in, the ranking will be named after the value_cols with a ‘_rank’ suffix

optional :

  • method (str): method to use when encountering equal values:
    • 'min' (default): lowest rank in group
    • 'max': highest rank in group
    • 'average': average rank of group
    • 'first': ranks assigned in order the values appear in the series
    • 'dense': like ‘min’, but rank always increases by 1 between groups
  • ascending (bool): true (default) or false whether the rank should be determined based on ascending or descending order respectively
Example 1

Input :

ENTITY YEAR VALUE_1 VALUE_2
A 2017 10 3
A 2017 20 1
A 2018 10 5
A 2018 30 4
B 2017 60 4
B 2017 40 3
B 2018 50 7
B 2018 50 6
postprocess:[
  rank :
    value_cols: 'VALUE_1'
  ]

Output :

ENTITY YEAR VALUE_1 VALUE_2 VALUE_1_rank
A 2017 10 3 1
A 2017 20 1 3
A 2018 10 5 1
A 2018 30 4 4
B 2017 60 4 8
B 2017 40 3 5
B 2018 50 7 6
B 2018 50 6 6
Example 2

Input :

ENTITY YEAR VALUE_1 VALUE_2
A 2017 10 3
A 2017 20 1
A 2018 10 5
A 2018 30 4
B 2017 60 4
B 2017 40 3
B 2018 50 7
B 2018 50 6
postprocess:[
  rank :
    value_cols: 'VALUE_1'
    group_cols: 'YEAR'
    method: 'average'
  ]

Output :

ENTITY YEAR VALUE_1 VALUE_2 VALUE_1_rank
A 2017 10 3 1.0
A 2017 20 1 2.0
A 2018 10 5 1.0
A 2018 30 4 2.0
B 2017 60 4 4.0
B 2017 40 3 3.0
B 2018 50 7 3.5
B 2018 50 6 3.5
Example 3

Input :

ENTITY YEAR VALUE_1 VALUE_2
A 2017 10 3
A 2017 20 1
A 2018 10 5
A 2018 30 4
B 2017 60 4
B 2017 40 3
B 2018 50 7
B 2018 50 6
postprocess:[
  rank :
    value_cols: ['VALUE_1', 'VALUE_2']
    group_cols: ['ENTITY', 'YEAR']
    rank_cols_name: ['RANK_1', 'RANK_2']
  ]

Outptut :

ENTITY YEAR VALUE_1 VALUE_2 RANK_1 RANK_2
A 2017 10 3 1 2
A 2017 20 1 2 1
A 2018 10 5 1 2
A 2018 30 4 2 1
B 2017 60 4 2 2
B 2017 40 3 1 1
B 2018 50 7 1 2
B 2018 50 6 1 1

categories_from_dates

Create a new column of categories based on a date column. This function will gather into categories dates from the date column based on range steps.

For instance, the dates: [2018-01-02, 2018-01-06, 2018-01-15, 2018-01-16, 2018-01-20] with the steps: [2018-01-07, 2018-01-18] will give 3 categories:

  • First category: [2018-01-02, 2018-01-06]
  • Second category: [2018-01-15, 2018-01-16]
  • Third category: [2018-01-20]

mandatory :

  • date_col (str): name of the date column
  • new_column (str): name of the column for created categories
  • range_steps (list of str): a list of valid dates. Each date should be a string matching date_fmt and parsable by strptime but some offset can also be added using (datestr) + OFFSET or (datestr) - OFFSET syntax. When using this syntax, OFFSET should be understable by pandas.Timedelta (see http://pandas.pydata.org/pandas-docs/stable/timedeltas.html) and w, week, month and year offset keywords are also accepted. Additionally, the following symbolic names are supported: TODAY, YESTERDAY, TOMORROW. NOTE: datestr MUST be wrapped with parenthesis.

optional :

  • category_names (list of str): names of the categories to be created. This list must have a length equal to range_step plus one
  • date_format (str): expected date format in column date_col (see available formats
Example

Input

date value
2018-01-28 1
2018-01-28 2
2018-01-29 3
2018-01-30 1
2018-02-01 6
2019-01-05 7
2019-01-06 7
2019-01-07 1
2019-01-04 1
cson
  categories_from_dates:
      date_col: 'date'
      new_column: 'categories'
      range_steps: ['(TODAY)-10days', '(TODAY)+1days']
      category_names: ['Old', 'Recent', 'Futur']

Output

Assuming today is 2019-01-06

date value categories
2018-01-28 1 Old
2018-01-28 2 Old
2018-01-29 3 Old
2018-01-30 1 Old
2018-02-01 6 Old
2019-01-05 7 Recent
2019-01-06 7 Futur
2019-01-07 1 Futur
2019-01-04 1 Recent

Postprocess: Text manipulation

lower, upper, capitalize, title, swapcase

Change case of a column

  • lower: Converts all characters of column to lowercase.
  • upper: Converts all characters of column to uppercase
  • capitalize: Converts first character to uppercase and remaining to lowercase for each line of column.
  • title: Converts first character to uppercase and remaining to lowercase for each word of each line of column
  • swapcase: Converts uppercase to lowercase and lowercase to uppercase for each word of column.
postprocess:[
  lower :
    column: 'my_column'
    new_column: 'the_new_column'
  ]

mandatory :

  • column (str): name of the column to change case

optional :

  • new_column (str): name of the output column. By default the column arguments is modified.
Example : lower

Input :

entity year label
A 2017 France
A 2017 UK
A 2018 SPAIN
B 2017 ItAly
B 2018 germany
postprocess:[
  lower :
    column: 'label'
  ]

Output :

entity year label
A 2017 france
A 2017 uk
A 2018 spain
B 2017 italy
B 2018 germany
Example : upper

Input :

entity year label
A 2017 France
A 2017 UK
A 2018 SPAIN
B 2017 ItAly
B 2018 germany
postprocess:[
  upper :
    column: 'label'
  ]

Output :

entity year label
A 2017 FRANCE
A 2017 UL
A 2018 SPAIN
B 2017 ITALY
B 2018 GERMANY
Example : capitalize

Input :

entity year label
A 2017 France
A 2017 United KINGDOM
A 2018 SPAIN
B 2017 ItAly
B 2018 germany
postprocess:[
  capitalize :
    column: 'label'
  ]

Output :

entity year label
A 2017 France
A 2017 United kingdom
A 2018 Spain
B 2017 Italy
B 2018 Germany
Example : title

Input :

entity year label
A 2017 France
A 2017 United KINGDOM
A 2018 SPAIN
B 2017 ItAly
B 2018 germany
postprocess:[
  title :
    column: 'label'
  ]

Output :

entity year label
A 2017 France
A 2017 United Kingdom
A 2018 Spain
B 2017 Italy
B 2018 Germany
Example : swapcase

Input :

entity year label
A 2017 France
A 2017 United KINGDOM
A 2018 SPAIN
B 2017 ItAly
B 2018 germany
postprocess:[
  swapcase :
    column: 'label'
  ]

Output :

entity year label
A 2017 fRANCE
A 2017 uNITED kingdom
A 2018 sPAIN
B 2017 iTaLY
B 2018 GERMANY

concat

Concatenate text columns

postprocess:[
  concat:
    columns: ['country', 'year']
    new_column: 'the_new_column'
  ]

mandatory:

  • columns (list of str): names of the columns to concat
  • new_column (str): name of the output column.

optional:

  • sep (str): the separator
Example 1

Input :

entity year label
A 2017 France
A 2017 UK
A 2018 Spain
B 2017 Italy
B 2018 Germany
postprocess:[
  concat:
    columns: ['label', 'year']
    new_column: 'new_label'
  ]

Output :

entity year label new_label
A 2017 France France2017
A 2017 UK UK2017
A 2018 Spain Spain2018
B 2017 Italy Italy2017
B 2018 Germany Germany2018
Example 2

Input :

entity year label
A 2017 France
A 2017 UK
A 2018 Spain
B 2017 Italy
B 2018 Germany
postprocess:[
  concat:
    columns: ['label', 'year']
    new_column: 'new_label'
    sep:"-"
  ]

Output :

entity year label new_label
A 2017 France France-2017
A 2017 UK UK-2017
A 2018 Spain Spain-2018
B 2017 Italy Italy-2017
B 2018 Germany Germany-2018

isalpha, isnumeric

Type detection: check whether all characters in each string in a column are alphabetic/digits

postprocess:[
  isalpha:
    column: 'my_column'
    new_column: 'the_new_column'
  ]

mandatory:

  • column (str): name of the column to check

optional:

  • new_column (str): name of the output column. By default the column arguments is modified.
isalpha

Input :

entity year label
A 2017 France
A1 2017 UK
A2 2018 Spain
B 2017 Italy
B1 2018 Germany
postprocess:[
  isalpha :
    column: 'entity'
    new_column:'entity_check'
  ]

Output :

entity year label entity_check
A 2017 France True
A1 2017 UK False
A2 2018 Spain False
B 2017 Italy True
B1 2018 Germany False
isnumeric

Input :

entity year label
A 2017 France
A1 2017 UK
A2 2018a Spain
B 2017 Italy
B1 2018 Germany
postprocess:[
  isnumeric :
    column: 'year'
    new_column:'year_check'
  ]

Output :

entity year label year_check
A 2017 France True
A1 2017 UK True
A2 2018a Spain False
B 2017 Italy True
B1 2018 Germany True

(r)split

Split a column of strings in multiple columns

split and rsplit do the same thing except that rsplit splits starting from the end (from right to left).

postprocess:[
  split:
    column: 'my_column'
    sep: '-'
  ]

mandatory :

  • column (str): name of the column to split

optional :

  • new_columns (list of str): name of the output columns. By default the columns are column_1, …, column_n
  • sep: (str) the character to use to split (by default the space character ' ' is used)
  • limit (int) the maximum number of split (by default there is no limit)
Basic example without `new_columns`

Input :

date
2018-01-02
2018-03-04
postprocess:[
  split :
    column: 'date'
    sep: '-'
  ]

Output :

date date_1 date_2 date_3
2018-01-02 2018 01 02
2018-03-04 2018 03 04
Basic example with `new_columns`

Input :

date
2018-01-02
2018-03-04
postprocess:[
  split :
    column: 'date'
    sep: '-'
    new_columns: ['year', 'month', 'day']
  ]

Output :

date year month day
2018-01-02 2018 01 02
2018-03-04 2018 03 04
Difference between `split` and `rsplit` with `limit`

Input :

date
2018-01-02
2018-03-04
postprocess:[
  split :
    column: 'date'
    sep: '-'
    limit: 1  # only one split will be done starting from the left
  ]

Output :

date date_1 date_2
2018-01-02 2018 01-02
2018-03-04 2018 03-04
postprocess:[
  rsplit :
    column: 'date'
    sep: '-'
    limit: 1  # only one split will be done starting from the right
  ]

Output :

date date_1 date_2
2018-01-02 2018-01 02
2018-03-04 2018-03 04

strip

Remove space characters from string in a column

postprocess:[
  strip:
    column: 'my_column'
    new_column: 'the_new_column'
    to_strip:'-'
  ]

mandatory :

  • column (str): name of the column to strip
  • to_strip: (str) set of characters to remove

optional :

  • new_column (str): name of the output column. By default the column arguments is modified.
Example

Input :

entity year label
A 2017 #France
A1 2017 UK##
A2 2018 Sp#ain
B 2017 Italy###
B1 2018 ##Germ#any
postprocess:[
  strip :
    column: 'entity'
    to_strip:'#'
  ]

Output :

entity year label
A 2017 France
A1 2017 UK
A2 2018 Spain
B 2017 Italy
B1 2018 Germany

Advanced usage: MongoDB aggregation pipeline

Mongodb offers advanced features for data processing. See the Mongo documentation for a comprehensive introduction. The following exposes some common cases.

Note

The only difference with the mongo aggregation framework is with the $sort operator, when it is used for sorting by multiple fields. This feature should be used when you want to keep the relative order of the values between sort (i.e. stable sort). Instead of passing it an object it should receive an array of objects like so:

chartOptions:
  data:
    query:[
      $match:
        domain: 'sugarcane_toi'
        Core: 1
        Plant: {$ne: 'BU'}
    ,
      $sort: [
        {'Date': -1}
        {'Country': 1}
      ]
    ]

Mongo Pipeline Mongo DB is a framework that allows doing manipulation of the different data sources. We can :

The steps can be used in the order you want.

To use MongoDB agregation framework, query is not an object but an array. Here is an example of a complex MongoDB query:

chartOptions:
  data:
    query:[
      $match:
        domain: 'sugarcane_toi'
        Core: 1
        Plant: {$ne: 'BU'}
    ,
      $sort: { 'Date': -1}
    ,
      $group:
        _id: { KPI_short:'$KPI_short', Plant: '$Plant', DayType:'$DayType'}
        Actual: {$first: '$Actual'}
        Plan: {$first: '$Plan'}
        precision: {$first: '$precision'}
        unit: {$first: '$unit'}
    ,
      $project:
        Actual: 1
        Plan: 1
        unit: 1
        precision: 1
        Plant: '$_id.Plant'
        KPI_short: '$_id.KPI_short'
        DayType: '$_id.DayType'
        evolution: { $subtract: ['$Plan', '$Actual'] }
    ]

Step $match

In the match step, we can filter the data thanks to the values we want to keep or not. Many functions can help doing that filtering : $ne, $in, $nin:, $gt and $lt.

  • $ne: exclude one value
  • $nin: exclude multiple values
  • $in: include multiple values

Example

country value
France 100
France 500
Allemagne 500
France 500
France 500
Allemagne 500
Maroc 200
data:
  query:[
    $match:
      domain: 'my_domain'
      country: $in: ['France', 'Maroc']
  ]
country value
France 100
France 500
France 500
France 500
Maroc 200

Example with all filtering options:

data:
  query:[
    $match:
      domain: 'my_domain'
      column_1: 'value_in_column_1'
      column_2: $ne: "value_in_column_2"
      column_3: $nin: ["value_1", "value_2"]
      column_4: $in: ["value_1", "value_2"]
  ]

Step $project

The $project step allows doing calculation on the columns, the inclusion of fields, the suppression of fields, the addition of new fields, and the resetting of the values of existing fields.

You can also use the operator $addFieldsinstead of $project. This operator is equivalent to a $project step that explicitly keep all existing fields of the dataset and adds the new fields.

Inclusion of a field

If we want to conserve a column, we have to indicate with a boolean 1 or 0.

My dataset:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $project:
      country: 1
      value_1: 1
  ]

my output is:

country value_1
France 600
France 1000
Allemagne 500
Allemagne 500
Maroc 200

Calculation on metrics:

  • $add: [ "$column_1", "$column_2"] : creates a new column with the sum : column_1 + column_2
  • $subtract: [ "$column_1", "$column_2"] : creates a new column with the difference: column_1 - column_2
  • $multiply: [ "$column_1", "$column_2"] : creates a new column with the multiplication: column_1* column_2
  • $divide: [ "$column_1", "$column_2"] : creates a new column with the division: column_1/column_2 We can use columns but also figures. For example: $divide: [ "$column_1", 100]

my dataset is:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $project:
      country: 1
      category: 1
      value_3: $subtract: ["$value_1", "$value_2"]
  ]

My output is:

country category value_3
France Femme 400
France Homme 900
Allemagne Femme 0
Allemagne Homme 100
Maroc Homme 50

Add new fields or reset existing fields

To set a field value directly to a numeric or boolean literal, as opposed to setting the field to an expression that resolves to a literal, use the $literal operator. You can also create a new field thanks to $concat operator.

Example my dataset is:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $project:
      country: 1
      category: 1
      test: $literal: "This is a test"
      numeric: $literal: 10
      label: $concat: [ "$country", " - ", "$category"]
  ]

My output is:

country category test numeric label
France Femme This is a test 10 France - Femme
France Homme This is a test 10 France - Homme
Allemagne Femme This is a test 10 Allemagne - Femme
Allemagne Homme This is a test 10 Allemagne - Homme
Maroc Homme This is a test 10 Maroc - Homme

Step $group

The $group step allows grouping data in 1 or multiple axis. This step always contains:

  • _id : the axis of the grouping
  • calculus of the other columns Be careful the output of the grouping step changes the name of the indexes columns (_id). Always use $project step to rename the steps.

Example my dataset is:

country value
France 100
France 500
Allemagne 500
France 500
France 500
Allemagne 500
Maroc 200
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $group:
      _id:
        "country": "$country"
      value: $sum: "$value"

  ]

my output is:

_id.country value
France 1600
Allemagne 1000
Maroc 200

Example with a multiple axis

my dataset is:

country category value
France Femme 100
France Homme 500
Allemagne Femmme 500
France Femme 500
France Homme 500
Allemagne Homme 500
Maroc Homme 200
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $group:
      _id:
        "country": "$country"
        "category": "$category"
      value: $sum: "$value"

  ]

my output is:

_id.country _id.category value
France Femme 600
France Homme 1000
Allemagne Femme 500
Allemagne Homme 500
Maroc Homme 200

The functions we can use to calculate the metrics are :

  • $sum : to sum the values on the grouping axis
  • $avg : to calculate the average values on the grouping axis
  • $first : to keep the first value on the grouping axis
  • $last : to keep the last value on the grouping axis
  • $max : to keep the maximal value on the grouping axis
  • $min : to keep the minimal value on the grouping axis
data:
  query:[
    $match:
      domain: 'my_domain'
      column_4: $in: ["value_1", "value_2"]
  ,
    $group:
      _id:
        "axis_1": "$axis_1"
        "axis_2": "$axis_2"

      value: $sum: "$value"

  ]

Step $sort

The $sort operator sorts your input dataset and returns in a sorted order depending on 1 or multiple fields: (numeric or alphabetic). The order can have one of the following values:

  • 1 to specify ascending order.
  • -1 to specify descending order.

Example 1: sorting in an alphabetical order my dataset is:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $sort:
      "country": 1
  ]

my output is:

country category value_1 value_2
Allemagne Femme 500 500
Allemagne Homme 500 400
France Femme 600 200
France Homme 1000 100
Maroc Homme 200 150

Example 2: sorting in an decreasing order my dataset is:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $sort:
      "value_1": -1
  ]

my output is:

country category value_1 value_2
France Homme 1000 100
France Femme 600 200
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150

Step $limit

The $limit operator limits the number of lines of your dataset.

Example my dataset is:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $limit: 2
  ]

my output is:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100

Step $skip

The $skip operator skips the N first lines of your dataset.

Example my dataset is:

country category value_1 value_2
France Femme 600 200
France Homme 1000 100
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150
data:
  query:[
    $match:
      domain: 'my_domain'
  ,
    $skip: 2
  ]

my output is:

country category value_1 value_2
Allemagne Femme 500 500
Allemagne Homme 500 400
Maroc Homme 200 150

Multiple queries

We can concat multiples mongoDB queries thanks to multiple_queries option.

Example

domain: “France”

category value value_n_1
Femme 600 200
Homme 1000 100

domain: “uk”

category value value_last_year
Femme 500 600
Homme 1100 500
Unknown 100 100
data:
  query:[
    [
      $match:
        domain: 'france'
    ,
      $addFields:
        country: $literal: "France"
    ]
    [
      $match:
        domain: 'UK'
    ,
      $project:
        category: 1
        value: 1
        value_n_1: "$value_last_year"
        country: $literal: "UK"
    ]
  ]
  multiple_queries: true
country category value value_n_1
France Femme 600 200
France Homme 1000 100
UK Femme 500 600
UK Homme 1100 500
UK Unknown 100 100