Data sources

You can use diverse data sources to populate Toucan Toco apps.

Basics

A data source must be registered in etl_config.cson, in the DATA_SOURCES block.

Example

Here is a DATA_SOURCES example with three types of data sources (Excel files, MySQL database and Google Spreadsheets):

DATA_SOURCES: [
  domain: "market-breakdown"
  type: "excel"
  file: "data-market-breakdown.xls"
  sheetname: "Data"
,
  domain: "dashboards"
  type: "excel"
  file: "dashboards.xls"
  sheetname: "Data"
,
  domain: 'gsheets'
  name: 'google_spreadsheet_provider'
  spreadsheet_id: '1O5yDtjOZCCvtRCxnw_MwQQ2aT3TW9J3i_WvOu0zXVXk'
  sheetname: 'Sheet1'
,
  domain: 'db_test'
  type: 'data_provider'
  name: 'Some MySQL provider'
  table: 'City'
]

File-based data sources

File-based data sources are directly updated by dropping files on the upload interface. They will be stored on Toucan Toco’s servers. Two formats are allowed, CSV (.csv or any other delimited values) and excel (.xls or .xlsx) files. A basic DATA_SOURCES block looks like the following:

DATA_SOURCES = [
  domain: 'data1'
  type: 'csv'
  file: 'data1.csv'
  separator: ','
,
  domain: 'data2'
  type: 'excel'
  file: 'data2.xls'
  sheetname: 'sheet1'
]

CSV files

To declare a .csv data source, specify the type (csv), the file name (file: 'my_filename.csv') and the separator used in the csv file (separator: ',', usually ',', ';' or tabulations \t). Add a domain field to identify and use the datasource (domain: 'my_csv_domain'):

DATA_SOURCES = [
  domain: 'my_csv_domain1'
  type: 'csv'
  file: 'my_filename1.csv'
  separator: ';'
,
  domain: 'my_csv_domain2'
  type: 'csv'
  file: 'my_filename2.csv'
  separator: ','
]

Excel files

To declare the .xls or .xlsx data source, specify the type (type: 'excel'), the file name (file: 'my_filename.xls' or file: 'my_filename.xlsx'), and a domain field to identify and use the datasource (domain: 'my_excel_domain'): Since excel files may have multiples sheets, there are two possibilites:

  1. Declare a data source and a domain per file:
DATA_SOURCES = [
  domain: 'my_excel_domain1'
  type: 'excel'
  file: 'my_filename.xls'
  sheetname: 'sheetname1'
]
  1. Declare a data source and a domain per sheet by identifying the sheet with the sheetname field:
DATA_SOURCES = [
  domain: 'my_excel_domain1'
  type: 'excel'
  file: 'my_filename.xls'
  sheetname: 'sheetname1'
,
  domain: 'my_excel_domain2'
  type: 'excel'
  file: 'my_filename.xls'
  sheetname: 'sheetname2'
]

This is the declaration of excel files.

3.Declare a data source and a domain corresponding to the concatenation of all the sheets by setting the sheetname field to null:

DATA_SOURCES = [
  domain: 'my_excel_domain'
  type: 'excel'
  file: 'my_filename.xls'
  sheetname: null
]

This option extract a domain with the concatenated data and a column __sheet__ containing the sheet name. For instance, with the data source declared above, if the my_filename.xls contains 2 sheets sheet1 and sheet2, the domain my_excel_domain will have the concatenated data and the column __sheet__ with two possible values ‘sheet1’ and ‘sheet2’.

xml files

To decalre .xml data sources, specify the type (type: 'xml'), a file name (file: 'my_filename.xml'), a domain field to identify and use the datasource (domain: 'my_xml_data').

If your data is nested in the dom and you need to access it or modify it before it is imported you can use the filter field to write a jq filter (for ex.: filter: 'records.record'). jq is the stantard we use to preprocess nested data souces into tabular data. See this part of the Toucan Toco documentation for a detailed introduction to jq. In the specific case of xml data we use the following spec to transform it, and then apply your jq filter.

Example data:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<records>
  <record id="1">
    <title>Black album</title>
  </record>
  <record id="2">
    <title>Graffiti bridge</title>
  </record>
</records>

Datasource declaration:

DATA_SOURCES = [
   domain': 'XML Fixture'
   type': 'xml'
   file': 'fixture.xml'
   filter': '.records.record'
]

Resulting dataframe:

    @id    title
0   1       Black album
1   2       Graffiti bridge

Common options

Encoding format

By default Toucan will try to read utf-8 encoded files. We can read any other encoding, but this has to be specified in your etl_config. Make sure the encoding format has been identified before loading data. This is the list of common encoding format: https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings

For example:

DATA_SOURCES = [
  domain: 'my_excel_domain1'
  type: 'excel'
  file: 'my_filename.xls'
  sheetname: 'sheetname1'
  encoding: 'latin-1'
]

Column format

Sometimes the data is not extracted as you want. Let’s say you have the below file

    date    title
0   2017    Black album
1   2016    Graffiti bridge

By default the date column elements will be interpreted as integers. If you want this column to store text instead, you shall use the dtype option available for csv and excel files. Here is how you would use it:

DATA_SOURCES = [
  domain: 'my_excel_domain1'
  type: 'excel'
  file: 'my_filename.xls'
  sheetname: 'sheetname1'
  dtype:
    date: 'str'
]

For more details about this method, please refer to pandas documentation.

Skipping rows

DATA_SOURCES = [
  domain: 'my_excel_domain1'
  type: 'excel'
  file: 'my_filename.xls'
  sheetname: 'sheetname1'
  skip_rows: 1
]

Skip n rows at the begining of the file, usefull when your files come with artfull headers or blank lines at the top.

Using patterns to match multiple data sources :

You can use a single etl_config block (of type csv or excel) with a pattern to match multiple data sources.

Concatenate matched data sources

Set the optional field match to true to activate the option. The field “file” will be interpreted as a regular expression and will match all the files with names that match the regular expression. ⚠️ sometimes it may be necessary to add a $ at the end of the file name to mark the end of the file name.

If the matched data sources are all assigned the same domain, the extracted data will be concatenated. The following block will extract files with names such as “results-01-2016.csv” and “results-02-2016.csv”, and extract a domain “results” with the concatenated data.

etl_config:
  DATA_SOURCES: [
    domain: "results"
    type: "csv"
    file: "results-\\d\\d-\\d\\d\\d\\d.csv"
    match: true
  ]

Add a __match__ value

You can also access the matched groups to create a match value for each matched data source. The following block will:

  • match files with names: “results-01-2016-spain.csv”, “results-02-2016-france.csv”,
  • extract one domain “results-{month}-{year}-{country}” with the concatenated data and a column __match__ containing two possible values “results-01-2016-spain” and “results-02-2016-france”.
etl_config:
  DATA_SOURCES: [
    domain: "results-{month}-{year}-{country}"
    type: "csv"
    file: "results-(?P<month>\\d\\d)-(?P<year>\\d\\d\\d\\d)-(?P<country>\\w+).csv"
    match: true
  ]

The syntax (?P<name>...) is known as “group name”. You can learn about regular expressions syntax in the Python regular expressions guide.

Notes

  • You can test your regular expressions against test strings here.
  • In the etl_config you have to escape backslashes “" with an extra backslash “\”, but not when you test your regular expressions.
  • To limit the match to an exact match you have to use the syntax “^results-\d\d-\d\d\d\d.csv$”, otherwise be aware that the pattern “results-\d\d-\d\d\d\d.csv” will also match strings such as “results-01-2016.csv.copy”.

Reading files from remote locations

Sometimes your data files are uploaded on remote servers, and it is simpler to keep them there. In this case data sources can have a url instead of a file key.

Everything else will work exactly the same but we will read the file over a network connection according to the url scheme.

DATA_SOURCES: [
  domain: 'db_test'
  type: 'csv'
  url: 'sftp://<username>:<password>@ftp.example.com/my_folder/my_data.csv'
]

We support ftp (as well as sftp or ftps), http (and https), S3 and a long list of other schemes (‘mms’, ‘hdl’, ‘telnet’, ‘rsync’, ‘gopher’, ‘prospero’, ‘shttp’, ‘ws’, ‘https’, ‘http’, ‘sftp’, ‘rtsp’, ‘nfs’, ‘rtspu’, ‘svn’, ‘git’, ‘sip’, ‘snews’, ‘tel’, ‘nntp’, ‘wais’, ‘svn+ssh’, ‘ftp’, ‘ftps’, ‘file’, ‘sips’, ‘git+ssh’, ‘imap’, ‘wss’).

This feature makes it very easy to read data from remote storage services. For example this is how you can read a CSV file directly on Dropbox.

DATA_SOURCES: [
  domain: 'my_remote_data'
  type: 'csv'
  url: 'https://www.dropbox.com/s/9yu9ekfjk8kmjlm/fake_data.csv?dl=1'
]

Thus, it also works with the secured Toucan Toco FTP server:

DATA_SOURCES: [
    domain: 'db_test'
    type: 'csv'
    url: 'ftps://<login>:<password>@ftps.toucantoco.com:990/my_db.csv'
    separator: ";"
]

FTP How to

When your data files are too big to be transfered via the studio data upload interface, we offer the option to store them in a FTP server.

The FTP server can either be in the Toucan Toco side (ask for support to set it up, i.e. Trello run), but your client can set it up as well and we will be able to connect to it.

Use Filezilla to get a glimpse of what files are available in your FTP.

To get the url of the file you want to upload, open FileZilla, connect to your FTP. Once you see the list of available files, right-click on the one you want to use, and select “Copy URL(s) to clipboard”. Don’t forget to add your password to the url generated.

Image of chart

Image of chart

External data providers

With our different connectors, you can use diverse data sources to populate Toucan Toco apps.

Data providers are server-side, which means they require a connection from Toucan Toco’s server to the data source.

Basics

Declaration

Each external data provider must be declared in the DATA_PROVIDERS block of etl_config.cson. The type and a unique identifier name must be specified. Multiple options fields must be added depending on the type (e.g. username and password). See below for each type available options.

DATA_PROVIDERS: [
  name: "My provider"
  type: "providerType"
  options: options
]

In :

DATA_PROVIDERS: [
  name: "My provider"
  type: "providerType"
  options: options
]

See connectors for detailed options of each provider type.

Usage

A data provider can be used:

  • to extract data to Toucan Toco database (DATA_SOURCES blocks),
  • or directly in the front_config.cson, to populate slides.

1. Extract data from a data provider

When creating a DATA_SOURCES block, you can specify the special type data_provider and the name of the data provider you want to use, as declared in DATA_PROVIDERS.

DATA_SOURCES: [
  domain: 'db_test'
  type: 'data_provider'
  name: 'My provider'
  [... other options ...]
]

2. Query direclty a data provider

In front_config.cson, instead of querying Toucan Toco database, you can add the option provider to direct the query to a data provider. In this mode, Toucan Toco data pipeline is totally bypassed.

You need to provide the name of the data provider you wan to use, as declared in DATA_PROVIDERS.

chartOptions:
  data:
    provider: "My provider"
    query: { [... query ...] }
    ...
  ...

Note

We don’t recommend direct querying of external data providers for the following reasons:

  • performance relies heavily of your external database
  • post-processing of this data is limited to what can be done in the front_config, as this data will never pass through our ETL pipeline
  • permissions configured in Toucan Toco database no longer apply, which means a user will be able to run any request that Toucan’s server is authorized to (so be aware of the privileges granted to this user)

Available data providers

  • MySQL
  • MSSQL
  • MSSQL on Azure
  • SSAS
  • Sap Hana
  • Google Analytics
  • Google Big Query
  • Google Spreadsheet
  • Postgres
  • MongoDB
  • HttpAPI

Below are some configuration examples.

Usual databases

Configuration :

In the etl_config.cson (MySQL example):

DATA_SOURCES: [
  domain: 'db_test'
  type: 'data_provider'
  name: 'Some MySQL provider'
  table: 'City'
]

The first three fields are always the same. Depending on the data provider, the next field(s) can be different.

  • domain: same as usual
  • type: ‘external_database’ (always)
  • name: name of the data provider (same as the one inside the DATA_PROVIDERS list)

MySQL data provider use either table or query:

  • table: Name of the table to dump.
  • query: Query to run on the database.

SSAS data provider:

  • mdx_query: the mdx query to retrieve data and load it to the DataFrame. (Need more information and example here)

Result :

The table or the query result is loaded into the database, and you can access like you are used to for csv or excel files.

Google Big Query

You can both connect to and extract data from Google Big Query.

Google Big Query organizes information in projects that contain datasets that contain tables. Tables can be queried.

Prerequisites :

For a ToucanToco app to connect to Google Big Query you need to have a JSON private key with a Google Service Account that has being authorized to query certain datasets in a project.

Instructions for the client:

  • Go to the Google API console and select a project
  • Go to the page Google “BigQuery API” > “Credentials” > “Service account key” and create a new Google Service account:
    • Service account: “New service account”
    • Service account name: “toucantoco”
    • Key type: “JSON”
  • Save the automatically downloaded JSON private key file.
  • Open the JSON private key file and copy the Google Service Account email in the section client_email.
  • Go to the Google Big Query console and share each of the datasets with the client_email you want us to have access to. You should give us only read access.
  • Send us the JSON private key file.

Instructions for Toucan Toco to know the name of the datasets, tables and try queries:

  • Ask the client to share the datasets with your personal Google email.
  • Create a Google Big Query account, a new project and then switch to the client project.
  • Test a query from the Big Query console.

Data Provider configuration :

Register the data provider in the DATA_PROVIDERS section in etl_config.cson. Chose a name for the data provider. Set the data provider type to GoogleBigQuery. Set the project_id to the one inside the private key file. Save the private key file in app/parameters.

Example:

DATA_PROVIDERS: [
    name: "<google_bigquery_provider>"
    type: "GoogleBigQuery"
    options:
        project_id: "<project_id>"
        authentication_file: "<private-key-file.json>"
]

Connector configuration :

Configure connector in the chartOptions section in front_config.cson.

Example:

chartOptions:
    data:
        provider: "<google_bigquery_provider>"
        query: """SELECT * FROM project_id.dataset.table;"""

Extractor configuration :

Configure extractor in the DATA_SOURCES section in etl_config.cson. Set the extractor type to google_bigquery.

Example:

DATA_SOURCES: [
    domain: "domain_name"
    type: "google_bigquery"
    name: "<google_bigquery_provider>"
    query: """SELECT * FROM project_id.dataset.table;"""
]

Query :

Big Query uses SQL-like queries.

Basic query form

"SELECT * FROM project:dataset.table;"

If no project name is specified in the query, the default project_id in the data provider will be used

"SELECT * FROM dataset.table;"

If the project name includes a dash, the FROM clause must be enclosed in square brackets

"SELECT * FROM [my-dashed-project:dataset.table];"

See: https://cloud.google.com/bigquery/query-reference

Billing note

The project_id in the data provider is the one linked to the private key file, while the project_id in the query is the one you want to query. They don’t have to be the same but take into account that the billing for the queries is charged to the Google Service account, so ideally it should be the client who creates the service account.

SAP Hana

Basic implementation of a SAP Hana connector (Hana is a proprietary in-memory db) using pyhdb. We created a trial account on hanatrial.ondemand.com to test it. Follow these steps:

  1. Follow the instructions of this link to download the SDK: http://scn.sap.com/community/developer-center/hana/blog/2015/04/24/try-hanatrial-using-python-or-nodejs
  2. Unzip the SDK and go to neo-java-web-sdk-1.105.2/tools
  3. Run ./neo.sh open-db-tunnel -a <account> -h <host> -u <user>  -i <id> to open a tunnel
  4. When asked, type the password associated with the user
  5. When the connection is opened, update the password in the ETL config with the one provided in the terminal
  6. Use it!

Regarding step 3, here some more information:

  • -a , –account: account name
  • -h, –host: host name
  • -u, –user: e-mai, sap id, or user name
  • -i, –id: ASE database, Hana database or schema ID

Here is an example

neo open-db-tunnel -a myaccount -h hanatrial.ondemand.com -u mymail@example.com -i myschema

MongoDB

Example data source configuration (requires to setup the appropriate data provider) :

DATA_SOURCES: [
    domain: "domain_name"
    type: "MongoDB"
    name: "my_mongodb_provider"
    collection: "users"  # mongodb collection name
    query:  # mongodb query object
        age: {"$gt": 18}
]

Google Spreadsheet

Configuration

Here we suppose that you already have configured a google spreadsheet provider named gsheet_provider. If this is not done, please follow the google spreadsheet connector’s documentation.

In the etl_config.cson: ```domain: ‘gsheets’, name: ‘gsheet_provider’, spreadsheet_id: ‘gs_id’, sheetname: ‘Sheet1’

```

  • domain: same as usual
  • name: ‘gsheet_provider’ (replace by the name you gave to your provider)
  • spreadsheet_id: id of the spreadsheet which can be found inside the url: https://docs.google.com/spreadsheets/d//edit?pref=2&pli=1#gid=0,
  • sheetname: name of the sheet (obviously), if it is not found or if the sheetname is an empty string, the extractor will return the first sheet (perhaps the only one).

Share the spreadsheet

Unless the spreadsheet is public, you will have to manually share it.

Open the google spreadsheet inside your web browser. Inside the File menu, there a Share option. Click on it and enter the email address of your service account. If you are on Toucan Toco’s cloud, it is: toucanserviceaccount@testproj-204816.iam.gserviceaccount.com

Without this email, your app won’t be able to access to the spreasheet.

Result

The spreadsheet is loaded into the database, and you access it like you are used to for csv or excel files.

HTTP API

Work with data from any HTTP REST style API in your application.

This type of data source combines the features of Python’s requests to get data from any API and jq for flexbile transformations of the responses.

First define an entry in DATA_PROVIDERS with your target API basic informations, such as baseroute and authentication information.

Now you can configure DATA_SOURCES that will use this provider to get what’s called “ressources” in REST APIs (one entry in DATA_SOURCES = one API route and http method).

For example:

DATA_SOURCES: [
  domain: "books"
  type: "HttpAPI"
  name: "Some API provider"
  method: "GET"
  url: "records/1.0/search/"
  params:
    dataset: 'les-1000-titres-les-plus-reserves-dans-les-bibliotheques-de-pret'
    sort: 'nombre_de_reservations'
    facet: 'auteur'
  filter: ".records[].fields"
]

The JSON response looks like this:

{
  "nhits": 1000,
  "parameters": { ... },
  "records": [
    {
      "datasetid": "les-1000-titres-les-plus-reserves-dans-les-bibliotheques-de-pret",
      "recordid": "4b950c1ac5459379633d74ed2ef7f1c7f5cc3a10",
      "fields": {
        "nombre_de_reservations": 1094,
        "url_de_la_fiche_de_l_oeuvre": "https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1009613",
        "url_de_la_fiche_de_l_auteur": "https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1009613",
        "support": "indéterminé",
        "auteur": "Enders, Giulia",
        "titre": "Le charme discret de l'intestin [Texte imprimé] : tout sur un organe mal aimé"
      },
      "record_timestamp": "2017-01-26T11:17:33+00:00"
    },
    {
      "datasetid":"les-1000-titres-les-plus-reserves-dans-les-bibliotheques-de-pret",
      "recordid":"3df76bd20ab5dc902d0c8e5219dbefe9319c5eef",
      "fields":{
        "nombre_de_reservations":746,
        "url_de_la_fiche_de_l_oeuvre":"https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1016593",
        "url_de_la_fiche_de_l_auteur":"https://bibliotheques.paris.fr/Default/doc/SYRACUSE/1016593",
        "support":"Bande dessinée pour adulte",
        "auteur":"Sattouf, Riad",
        "titre":"L'Arabe du futur [Texte imprimé]. 2. Une jeunesse au Moyen-Orient, 1984-1985"
      },
      "record_timestamp":"2017-01-26T11:17:33+00:00"
    },
    ...
  ]
}

We apply the filter .records[].fields which means that for every entries in the records properity it will extract all the properties of the fields object. So the we end up with a table of results looking like this (I’m skipping columns in this example but you see the point):

nombre_de_reservations auteur skipped columns…
1094 Enders, Giulia
746 Sattouf, Riad

The complete spec of an HttpAPI entry in DATA_SOURCES is as follows:

Required:

  • domain : name of the domain
  • type : HttpAPI
  • name : name of the entry in DATA_PROVIDERS
  • url : url of the resource (REST API route)

Optional:

  • method : http method, default on GET
  • headers : object describing the headers of the request
  • params : url parameters
  • json : object serialized as JSON body of the request
  • data : raw string used as the body of the request
  • filter : jq filter (cf. jq doc) to apply on the response

Note: the reason to have a filter option is to allow you to take any API response and transfom it into something that fits into a column based data frame. jq is designed to be concise and easy to use for simple tasks, but if you dig a little deeper you’ll find a featureful functional programming language hiding underneath.

There is a good tutorial on jq, a complete manual and a cookbook for more advanced uses. You’ll find most Stack Overflow entries useful as well. And finally there is an online playground to test your filters.

Let’s take a concrete example from one of our projects with a more complex filter, we are getting the objectives of a KPI and this is the JSON we get:

[
  {
    "container": "objectives",
    "contents": [
      {
        "year": "2017",
        "january_objective": "10",
        "february_objective": "12",
        ...
      },
      {
        "year": "2016",
        "january_objective": "4",
        "february_objective": "6",
        ...
      },
    ]
  }
]

As a result we want a dataset that looks like this:

date type value
Jan-2017 objective 10
Feb-2017 objective 12
Jan-2016 objective 4
Feb-2016 objective 6

So let’s go step by step, first we need to select the objectives container and itterate on contents. This is how we achieve selecting the right container and itterating over its contents:

.[] | .select(.container == "objectives").contents[]

Next we want to transform the list of years with monthly objectives as properties…

[
  {"year": "2017", "january_ojective": "10", "february_objective": "12"},
  {"year": "2016", "january_ojective": "5", "february_objective": "6"}
]

… into a list of monthly objectives:

[
  {"date": "Jan-2017", "value": "10", "type": "objective"},
  {"date": "Feb-2017", "value": "12", "type": "objective"},
  {"date": "Jan-2016", "value": "4", "type": "objective"},
  {"date": "Feb-2016", "value": "6", "type": "objective"}
]

Let’s look first at the year property, we want to delete it but retain it’s value and combine it with the monthly properties so

"year": "2017", "january_ojective": ...

will become:

"2017_january_objective": ...

Our filter can first keep the value of year in a variable, and then delete the property:

.year as $year | del(.year)

In order to combine keys and values we use the to_entries filter which has this effect:

{"january_objective": "12"}

becomes:

{"key": "january_objective", "value": "12"}

We can then transform the value of key, using string interpolation like this:

to_entries | map(.key |= "\($year)_\(.)")

This combines the value of key (denoted as .) and the value of the variable $year and assign (|= operator) the result to key. As a result key is now a unique combination of year and month, it means we can combine all the yearly objectives in one single array using the add filter.

This is our filter so far:

.[] | select(.container == "objectives").contents
    | map(
        .year as $year
      | del(.year)
      | to_entries
      | map(.key |= "\($year)_\(.)")
    )
    | add

And the results is:

[
  { "key": "2017_january_objective", "value": "10" },
  { "key": "2017_february_objective", "value": "12" },
  { "key": "2016_january_objective", "value": "4" },
  { "key": "2016_february_objective", "value": "6" }
]

At this point we have a list of monthly objectives, it would be good to rename the properties (column names) and format the date properly. We will just map each of the objects in the final list into another object with the right property names and date formating. The filter should look like this:

map({
  date: ( .key
         | sub("objective"; "01")
         | strptime("%Y_%b_%d")
         | strftime("%b-%Y")
        )
  value: (.value | tonumber),
  type: "objective"
})

And the result is:

[
  {"date": "Jan-2017", "value": 12, "type": "objective"},
  {"date": "Feb-2017", "value": 12, "type": "objective"},
  {"date": "Jan-2016", "value": 5, "type": "objective"},
  {"date": "Feb-2016", "value": 5, "type": "objective"}
]

This fits nicely into the tabular format we were expecting. The complete filter looks like this:

.[] | select(.container == "objectives").contents
    | map(
        .year as $year
      | del(.year)
      | to_entries
      | map(.key |= "\($year)_\(.)")
    )
    | add
    | map({
        date: (
           .key
         | sub("objective"; "01")
         | strptime("%Y_%b_%d")
         | strftime("%b-%Y")
        )
        value: (.value | tonumber),
        type: "objective"
    })