How to :: Create a Data Model¶
What’s the point of using a Data Model ?¶
Creating a data model is the cornerstone of our Design methodology¶
The data model is the cornerstone of Toucan methodology. We use it to make our clients own their data, and ensure the quality of the data displayed inside their apps.
To know more about the many benefits of running your project using a data model, check out our amazing doc about Why :: choosing a data model approach.
To help your client creating a data model, it’s way better if :
- You need someone to talk to : check that there are persons in charge of the data in the project team, and make sure to involve them at the very beginning of the project.
- You need someone to work with : check that huresources are available for industrialisation. These people will be commited to generate data extracts and iterate with you on the data format.
How to get your client ready for your Data Model?¶
- What you want as a conceptor 👍 : the data in the most appropriate format to fit the dataset expected by the viz you’ve built in the design phase.
- What you don’t want as a conceptor 👎: deal with additional work integrating the data, by making additional operations to your data before the data model
- What your client don’t want 👎 : provide too many data tables and spend too much time on generating data extracts
You need to find an in-between :
- Get one data table per story
- Get a unique data table for your entire application
Bear in mind that it can be difficult for your client to find its own data, do not offer definitive solutions.
Check with your client :
- how many databases do they have? what’s inside their database? what does the raw data look like?
- will they be able to do the calculations we’re asking for? what’s the cost for them to format the data the way we need it?
The question you need to answer at the end is : can you help your client by adapting the data model you’ve provided ? Can you simplify it? Make it easier for them to generate. Keep in mind that your objective is to have date in your app 😁
Also, to help your client filling the data model, explain precisely that its the result of the app design : the views hierarchy inside your app, the temporality (“current”, “YTD”, “12 rolling months”). Make sure it’s clear that generating these extracts have to be anticipated and ready for go live estimated date, that all (or at least most of) calculations will have to be done on their end.
The 3 steps to build your data model¶
1st - Define the architecture of your application¶
What we mean by “architecture” is the back-bone of your app. That is to say the different views you get in your application, the hierarchical links between them, and the temporality wanted. You’ll have to modelize :
- a view referential
- a temporal format
The view referential¶
It will depend on the hierarchical structure you have defined in your application. Is it..?
A flat hierarchy : every views are at the same level
📝Brand performances, where each brand represents a view
A simple hierarchy : each view is the “child” of another, which is its “parent”
📝Mall analysis, where each shop represents a view (childs), aggregated in shop categories (parents)
A complex hierarchy : a “child” view can have several “parents” that are not necesarily linked between themselves
📝A Real Estate Portfolio, where each asset is a view (child), and belongs both to Territories (parent 1) and Asset Managers (parent 2), which are not hierarchicaly linked
Depending on the level of hierarchy you have, you’ll adapt the content of your view referential. The more hierarchy you have, the more detailed your referential has to be.
For a simple hierarchy, you’ll have to define¶
Entity_name : this is the unique label/name for each view Entity_group : this is the group your view belongs to (you can have several entity belonging to a same group) Entity_parent : this is the parent of your view (several view can share the same parent)
📝 Let’s take an example : your application is aiming to track shop performances within a mall, depending on their categories (group), and on which level of the mall they are located (parent).
|shop 1||FMCG||Level 1|
|shop 2||FMCG||Level 1|
|shop 3||Healthcare||Level 1|
|shop 4||FMCG||Level 2|
|shop 5||Audio||Level 2|
For a complex hierarchy, you’ll have to define¶
The same columns as above : Entity_name, Entity_group, Entity_parent Parent_group : this is group your parent belongs to (several parents can belongs to the same group)
📝 Example : you follow your real estate portfolio, by asset, which belongs to different Asset Manager, Territory, Direction and Company. You have a double hierarchy :
- Asset > Asset Manager > Direction > Company
- Asset > Territory > Direction > Company
The temporal format¶
Usually, you want to analyze your data with different temporality.
First, you can have different period type : YTD ou Current Period for example
Then, you can get different date type : Quarter, Month, Day, Week for example
After setting all that, you can pick a specific date you want to look at : Q3, April, 22nd of May 2019, or Week 35
📝For example, check the time requester on top of the dashboard : you can set up period type and associated period.
What you want to get in your data is :
- A date column - which should contain the first day of the associated period (to order your date easily, don’t forget to format it as YYYY/MM/DD)
- A date type column - where you get the detailed type of the date you’re looking at
- A date label - the date that will be displayed on the screen, it must be lisible and user friendly
- A period type column - where you get the detailed type of the period (YTD, MTD, Current, etc)
It will looks like :
|2018/03/01||1st of March||Week||W18||Current|
|2018/03/01||1st of March||Week||W18||To Date|
|2018/03/01||1st of March||Month||March||Current|
|2018/03/01||1st of March||Month||March||To Date|
|2018/03/30||30th of March||Quarter||Q1||Current|
|2018/03/30||30th of March||Quarter||Q1||To Date|
2nd - Factorize your datasets¶
One golden rule : Always start from the design of your app
1st : make a list¶
How? List, story by story, the following elements:
- KPIs / metrics
- Dimensions / decomposition
- What you see for which View
- What you see for which Temporality
2nd : factorize¶
Once every KPIs are listed, try to group them. They can get a common structure/way to be read.
📝For example, your KPIs “Operating Income”
- are made of different categories : submetrics, “Rent” and “Service charges”
- have the same reading keys : value_type, “Actual” and “Budget”
- are analyzed on the same period of time : period_type, “Current” and “To Date” Conclusion : You can put them in ONE unique dataset, as below :
|Operating Income||Rent||Actual||To Date||3|
|Operating Income||Rent||Budget||To Date||1|
|Operating Income||Service charges||Actual||Current||7|
|Operating Income||Service charges||Actual||To Date||9|
|Operating Income||Service charges||Budget||Current||5|
|Operating Income||Service charges||Budget||To Date||1|
3rd - Make your data model understandable¶
To build a powerful data model, you need it to be fully understandable by the team/people that will use it. To do so, include :
- An exhaustive summmary, teasing its content with words :)
- The data formatting best practices, that your client can follow when generating extracts
Include an amazing summary¶
That’s the front page of your data model. It should be understandable in a glance. This is where you list:
- The name of each dataset contained inside the DM
- The name of each column of each dataset
- The type of value expected by in each column (string, float, date, etc)
- The value expected by in these column (for example name of fixed categories, submetrics, period_type)
- The associated stories (name and ID) so your client can link a dataset to what’s actually designed in the application
- A comment field, to detail as much as possible what’s expected in a column, the calculation rules to apply if relevant, etc
Let your client know about the data formatting best practices¶
Plan a call with your data interlocutor, to remind her/him to :
- The name of the column has to be meaningful : we rather get a column named “evolution” instead of “value_2”
- The name of the column doesn’t contain any space, special characters, accent : use “value_type” instead of “value type”
- Apply the same naming convention to all the datasets so it makes sense globally
- Precise how the
nullvalue should be displayed (empty grey cells, “n.a.”, “N/A”, “null”…)
- Do not jump lines inside your datasets
- Use one excel tab per dataset