Validate your datasources

Add some data validation steps to detect errors in your datasources sooner.

When a new data file is dropped through the studio or extracted during preprocess, it can be validated. You can check the data types of the columns, the number of rows of a data set, the presence of some required columns, that there is no duplicated rows, etc.

The list of validation rules for data files are defined inside the etl_config.cson file under the validation key.

Example

domain: "market-breakdown"
type: "excel"
file: "data-market-breakdown.xls"
sheetname: "Data"
validation:[
  type: "data_type"
  expected:
    "breakdown": "string"
    "pays": "string"
    "part": "number"
    "clients": "number"
,
  type: "pattern"
  expected: "[0-9]+"
  params:
    columns: ['metric']
]

Rows

Check the number of expected rows.

Keys:

  • type: ‘rows’
  • expected: (number) number of expected rows
type: 'rows'
expected: 6

Columns

Ensure that a given list of columns is a subset of the dataset’s columns

Keys:

  • type: ‘columns’
  • expected: (list(str)) columns you expected to find
type: 'columns'
expected: ['metric', 'target']

Unique values

Ensure that the list of unique values of a given column corresponds exactly to a list of expected values.

Keys:

  • type: ‘unique_values’
  • expected: (list) unique values
  • params:
  • column: (string) column name
type: 'unique_values'
expected: ["France", "Germany", "Spain"]
params:
    column: 'target'

No duplicates

Duplicated rows can be assessed based on all the columns or only a subset of columns.

Keys:

  • type: ‘no_duplicates’
  • params:
  • columns: (list or string) list of columns to use or ‘all’
type: 'no_duplicates'
params:
    columns: ['metric']

Value

Check the value of a column (one value only). If the query returned more than one row, only the first one will be used.

Keys:

  • type: ‘value’
  • expected: (string or number) expected value
  • params:
  • column: (string) in which to check the value
type: 'value'
expected: 30
params:
    column: 'target'

Data type

Check column data types. Three possible types: number, string or category.

Keys:

  • type: ‘data_type’
  • expected: : <’string’, ‘number’ or ‘category’>
type: 'data_type'
expected:
    'metric': 'number'
    'target': 'number'

Pattern

Check if string values correspond to a defined pattern

Keys:

  • type: ‘pattern’
  • expected: pattern/regex as a string
  • params: object with a columns key: the list of columns to check.
type: 'pattern'
expected: '[0-9]+'
params:
    columns: ['metric']
]

Not null

Check if some columns don’t have null value.

Keys:

  • type: ‘not_null’
  • params: object with a columns key: list of columns.
type: 'not_null'
params:
    columns : ['zone']

Tutorial : Product Corporation

You can download the CSV file for our tutorial.

data-product-corporation.csv

  • Add validation for your datasource in etl_config.cson

    DATA_SOURCES: [
     {
       domain: 'data-product-corpo'
       file: 'data-product-corporation.csv'
       skip_rows: 0
       separator: ','
       encoding: 'utf-8'
       type: 'csv'
       validation:[
      type: 'rows'
      expected: 46
       ,
      type: 'columns'
      expected: ['brand', 'country', 'product_line']
       ]
     }
    ]
    
  • Drag and drop your new etl_config.cson in the CONFIG FILES page

  • Go to your ‘DATA SOURCES’ page and drop your datasource.

  • Validation should be ok. If not, the file is not uploaded.