The Why?
Modeling of Raw data that comes in from business operations is essential to extract any business insights and provide business intelligence. Capturing data for a business has grown exponentially over the years and It is paramount that a good data modeling techniques be followed.
The book The Data Warehouse Toolkit by Ralph Kimball is considered bible and you are required to study if you are joining Shopify, in fact you will be asked if you have read that book. They are really serious about it :)
Any Data Engineer at Facebook or Amazon would ask you to at least read first few chapters to get good insights on why we need data modeling. Shopify blog clearly explain the benefits in data modeling, go check it out here.
Well, now that we have understood the importance of data modeling. Let’s dive in and understand the high level concepts.
Core Idea
As per Kimball, we need to perform below steps in data modeling
- Declare the grain Must be defined by business users - lowest level granular on the raw data.
- Identify dimensions Dimensions are like non quantifiable things like who, what, where, when ,why and how
- Identify facts Almost always numerical items
Uber Eats example
Let us say that, we wanted to model data coming from operations of Uber Eats. What is Uber Eats ? if you are not aware, it’s a food delivery app that lets customers order food from different restaurants.
Let’s establish critical entities from the business process, the way orders comes into the system is, when a customer places food order from a restaurant that he/she, then a driver picks up the package to deliver to the customer address.
- Customer browses through restaurants and their menus
- Places order
- Order is sent to Restaurant
- As restaurant done preparing food, a driver assigned for each order.
Star and snowflake schemas
Now, a little detour to understand schema design, There are primarily 2 schemas in the data modeling world that people talk about.
They are
- Star Schema - Simplified approach, many use cases fall into Star Schema
- Snowflake Schema - Extended on Star Schema with more dimensions for a complex use case
Star and snowflake schema designs are mechanisms to separate facts and dimensions into separate tables. Snowflake schemas further separate the different levels of a hierarchy into separate tables. In either schema design, each table is related to another table with a primary key/foreign key relationship. Primary key/foreign key relationships are used in relational databases to define many-to-one relationships between tables.
In my experience the most simpler way to start off is with Star Schema with one Fact table in the middle that is connecting to all dimension tables. As the dimensions become more complex, that may be broken down into further dimensions resulting in creation of Snowflake Schema.
Let us use simplified approach as we try to build a Star Schema for Uber Eats example.
Congratulations, now you have designed your first data model that captures business operations into set of tables that can be easily queried against for business intelligence.
Basic BI queries
Now that we have the model in place, assuming each data element is getting populated correctly by your ETL jobs. Let us try to answer some basic questions that user might have.
Q1. Find daily orders
Q2. Find the top 3 restaurants by number of orders
Q3. Find customers who placed multiple orders on the same day
There are more concepts to learn in Data modeling such as cardinality, Slowly changing dimensions, 3rd Normalization form , which we will cover in later posts.
Hope this will introduce you to the concepts of Data Modeling. The above is an example you might be asked during DE interviews.
Happy Learning.