Validate your datasources¶
Add some data validation steps to detect errors in your datasources sooner.
When a new data file is dropped through the studio or extracted during preprocess, it can be validated. You can check the data types of the columns, the number of rows of a data set, the presence of some required columns, that there is no duplicated rows, etc.
The list of validation rules for data files are defined inside the etl_config.cson file under the validation key.
Example¶
domain: "market-breakdown"
type: "excel"
file: "data-market-breakdown.xls"
sheetname: "Data"
validation:[
type: "data_type"
expected:
"breakdown": "string"
"pays": "string"
"part": "number"
"clients": "number"
,
type: "pattern"
expected: "[0-9]+"
params:
columns: ['metric']
]
Rows¶
Check the number of expected rows.
Keys:
- type: ‘rows’
- expected: (number) number of expected rows
type: 'rows'
expected: 6
Columns¶
Ensure that a given list of columns is a subset of the dataset’s columns
Keys:
- type: ‘columns’
- expected: (list(str)) columns you expected to find
type: 'columns'
expected: ['metric', 'target']
Unique values¶
Ensure that the list of unique values of a given column corresponds exactly to a list of expected values.
Keys:
- type: ‘unique_values’
- expected: (list) unique values
- params:
- column: (string) column name
type: 'unique_values'
expected: ["France", "Germany", "Spain"]
params:
column: 'target'
No duplicates¶
Duplicated rows can be assessed based on all the columns or only a subset of columns.
Keys:
- type: ‘no_duplicates’
- params:
- columns: (list or string) list of columns to use or ‘all’
type: 'no_duplicates'
params:
columns: ['metric']
Value¶
Check the value of a column (one value only). If the query returned more than one row, only the first one will be used.
Keys:
- type: ‘value’
- expected: (string or number) expected value
- params:
- column: (string) in which to check the value
type: 'value'
expected: 30
params:
column: 'target'
Data type¶
Check column data types. Three possible types: number, string, date, or category.
Keys:
- type: ‘data_type’
- expected: : <’string’, ‘number’, ‘date’, or ‘category’>
type: 'data_type'
expected:
'metric': 'number'
'target': 'number'
Pattern¶
Check if string values correspond to a defined pattern
Keys:
- type: ‘pattern’
- expected: pattern/regex as a string
- params: object with a columns key: the list of columns to check.
type: 'pattern'
expected: '[0-9]+'
params:
columns: ['metric']
]
Not null¶
Check if some columns don’t have null value.
Keys:
- type: ‘not_null’
- params: object with a columns key: list of columns.
type: 'not_null'
params:
columns : ['zone']
Tutorial : Product Corporation
You can download the CSV file for our tutorial.
Add validation for your datasource in etl_config.cson
DATA_SOURCES: [ { domain: 'data-product-corpo' file: 'data-product-corporation.csv' skip_rows: 0 separator: ',' encoding: 'utf-8' type: 'csv' validation:[ type: 'rows' expected: 46 , type: 'columns' expected: ['brand', 'country', 'product_line'] ] } ]Drag and drop your new etl_config.cson in the
CONFIG FILES
pageGo to your ‘DATA SOURCES’ page and drop your datasource.
Validation should be ok. If not, the file is not uploaded.