Data modeling techniques and methodologies are a common discussion topic when working on BI or data warehouse projects. There are multiple practices of data modeling and different reporting tools support different ways to handle data. If you have past experience of working in the data area, chances are you have come across Kimball dimensional modeling, often referred to as star schema modeling. This article will help you to mastering dimensional modeling in Power BI.
In this article, I will briefly go through the core concepts of dimensional modeling, as well as from the perspective of Power BI and tabular models.
What is dimensional modeling?
Dimensional modeling is a combination of methodologies and modeling techniques aimed towards building data models for business intelligence, data analysis and reporting purposes.
Kimball modeling, founded by Ralph Kimball, has been around for decades and has evolved into one, if not the best practice way to build analytical and reporting friendly data models.
Dimensional modeling is a data warehouse concept. You often find this kind of data model in a data warehouse or data mart. However, data warehouses often stores data using using other data model techniques, such as Data Vault or Inmon 3nf. In those cases, dimensional models is typically found in data marts, where it is used as a presentation layer for reporting and analytical purposes.
It is worth mentioning, however, that some reporting tools offer support for building their own data models. As a result, dimensional modeling techniques can be used to model data directly in a reporting layer.
A dimensional model, at it’s core, separates measurable data from descriptive data. In other words, separating data that can somehow be summarized, from data that holds the textual representation of what is measured.
Mainly two types of tables stores the data. Fact and dimension tables. Fact tables stores the metrics, which can be things like sold amount or number of sold articles.
Dimension tables stores the descriptive values and attributes, for example customer name, article name or the year and month for a sales order. Descriptive values are typically the ones you find in grouping and filtering of data.
This results in that fact tables usually are narrow and tall (fewer columns and many rows) while dimension tables usually are wide and short (more columns and fewer rows).
The basics and purpose of dimensional modeling
There are multiple ways to model data, depending on the use case. For example, an ERP or operational system have completely different needs than an analytical system. A data model of an operational system, such as an ERP, is most likely to be modelled in a highly normalized way.
Operational data models
Database normalization involves certain processes of structuring relational data to fit certain needs. For example, data models in a ERP system aims towards supporting the operational day-to-day functionality in the best possible way. Usual characteristics of such data models are:
- Enforcement of data integrity. E.g. Primary and foreign keys relationships.
- Having efficient data storage and avoiding redundant data. Data models of a operational system can often consist of thousands of tables
- Support for multiple DML operations(Like select, update, insert, delete). Often on single(or small amount of) rows and increase performance towards such operations.
The typical use case for an operational system is to insert and update single rows or a small amount of rows in order to support the daily operational workflow. Such as creating a sales order, registering a new customer or place a purchase order.
These characteristics are not beneficial from a analytical or reporting perspective. Even getting data out of an operational system can prove to be a challenging task. The data models are often huge and the extraction possibilities sometimes limited and slow.
Dimensional data models
The purpose of a dimensional model is to support data analysis scenarios. It has a highly denormalized data model. Some areas that defines a dimensional model are:
- It has a simplified data model compared to operational sources
- Increased performance for querying large amounts of data. Often hundred of millions or billion of rows.
- Can include data from multiple (operational) sources
- Data includes business logic and metrics for analysing business events
- One star schema = one business process along with its corresponding metrics and attributes
A dimensional model has the following characteristics:
- Combines data from multiple tables into fewer tables
- Less to no enforcement of data integrity
- Has data redundancy
- Requires fewer number of joins to get data
- Allows fast querying (read operations) but not optimal for updates.
- Often times built as insert only
Defining business metrics
A dimensional model has a business-process oriented structure with defined business metrics.
Defining a business metric can prove to be very challenging, while some are easy to follow and understand. What exactly do I mean by that? Let me try to give you an understanding of the usual challenges when defining these metrics.
A business usually wants to analyze and measure sales. It’s a typical process within a company.
So how do you define a sales event which you want to analyze? Defining it as a general term like sales orders made by customers sounds easy, but it doesn’t tell exactly what it should actually include. Consider the following:
- Should all sales orders be included? Perhaps only invoiced orders?
- Are there different types of sales orders? How should they be represented?
- What happens if the customer returns the sold products or cancels the order? Will they get a refund?
- What happens if the customer doesn’t pay or pays late?
- Should sales be measured on order date, invoice date or payment date? Perhaps all of them?
- Should sales orders from internal customers (like employees) be included?
Similarly, the same type of challenging questions arise when trying to define other core metrics.
For example, what is a a customer or how many customers do we have? Perhaps someone would answer by saying it’s the number of persons or companies registered as a customer in the ERP. Well, things are usually not that easy. To demonstrate, let me give some examples:
- Should a customer who made a purchase 10 years ago be considered a customer today? Where is the break date? Perhaps you need to split customers into different states of activity?
- What if a customer made only one purchase and then returned the products?
- What if a customer placed an order that is not yet invoiced or payed?
- Should customers that got a quote be counted as customer?
- How should you handle duplicates? Or let’s say you have a customer with 100 subsidiaries. Does that count as 100 customers or one?
This is just to name a few examples of challenges when defining business metrics in the world of analytics and business intelligence. Identifying and defining metrics within an organization is an important process. This can sometimes be a complex task, and sometimes not even 100% clear for the business.
Operational system does not typically store these definitions. Operational systems stores the raw data. However, they do not typically store the business logic to define business metrics from a data analysis perspective.
It’s preferable to build this logic outside the operational systems, e.g. in a data warehouse solution or reporting systems that supports data modeling. Some examples to why:
- Data can come from multiple data sources. Perhaps you want to analyze and combine data coming from both a CRM and an ERP system.
- Need for data historization. Operational data models rarely keeps historical values in a structured way.
- Business logic can be hard or even impossible to apply in the operational system. E.g. querying, filtering or adding logic on top of data. These kind of data operations can be very limited.
It’s ok if you find this to be too abstract. To simplify it, let me put it like this.
The process of denormalizing data and building a dimensional model can be seen like throwing your data up in the air and then picking it up bit by bit, in a new structure.
This includes, but are not limited to, building new tables, new columns, new relationships, new logic, new names and defining metrics.
Do I always need to model my data? No, there are absolutely valid reasons for doing reporting against a normalized data model. Many times you might want to create operational reports on top of your ERP or CRM system. These kind of reports might only focus on the current state of a small subset of data.
This is a perfectly fine scenario and a very common thing to do. There are often times where you do not need the benefits of a data warehouse or denormalized data model.
Dimensional Star schema in Power BI
Power BI is a model based tool. This means that there is a dataset / data model beneath the reporting layer. Therefore, Power BI allows for importing star schemas and keeping the model as is, including relationships between tables. Moreover, Power BI supports importing or creating multiple star schemas and multiple fact tables in the same dataset, called fact constellation schemas or a galaxy schemas. This essentially means having multiple fact tables that connects (or can connect) to the same dimension.
Power BI not only support the use of dimensional models or star schemas. It’s a best practice for building datasets. A general recommendation is to make sure that your data model is in good shape and modelled as good as possible. This will not only make it easier to write your DAX measures in Power BI, it will also generally improve your model’s performance.
The source data passes through this layer in different ways depending on the type of model. Importing and storing it in the Power BI data model (Import mode) is a common practice. In fact, it is the best practice if you want to get the most out your data model, unless you have specific scenarios to use other modes, like Direct Query. Direct Query (DQ) sends queries from Power BI directly to the datasource when a user interacts with the report, instead of storing the data inside the Power BI data model.
It’s also possible to reuse Power BI datasets across multiple reports. By reusing existing data models and logic in multiple reports across the organisation, you can avoid building the same critical logic in multiple places. Furthermore, this also enables Power BI to act as semantic layer, even on enterprise level scale.
Another reason why you should always use a dimensional model in Power BI is due to DAX Auto-Exist. Auto-Exist is a built in optimization technique that kicks in whenever you apply more than one filter on multiple columns of a table. When this happens, the DAX engine creates a filter of the unique combination of values based on the selection. The side effects of this technique however, is that it can sometime cause wrong values on your calculations (measures).
This will only happen in the case of applying multiple filters on the same table as you are doing calculations. Therefore, this effect will most likely not occur If you have a proper star schema in Power BI, because the filters are set on the dimensions and the calculations are done on the fact tables.
There is a great article explaining Auto-Exists here, written by Alberto Ferrari on SQLBI.
Dimensional model performance
Performance from a reporting standpoint can vary between reporting tools. In some cases, and from a purely performance viewpoint, large giant single tables is the best choice (or the only choice) offered by a reporting tool. However, these large giant tables introduce other challenges or drawbacks (I’m covering some of these further down in this article)
One of the first things that comes to mind is being able to manage attribute values. Like updating existing values or adding new columns. In the world of Kimball, this is handled by storing descriptive values in separate dimension tables. However, doing these kind of operations over large flat tables that includes transactional metrics could be a messy operation. Imagine having hundred of million of rows and you need to make updates to text attributes stored in the table.
Compared to a dimensional model, a single table will decrease the readability of the data. Furthermore, it will most likely increase the size of the underlying dataset. Having a denormalized dimensional model is a good way to get the best performance while also increasing readability, lowering model size and making it easier to manage your model.
When it comes to performance benchmark of Power BI and tabular models, there’s an amazing article released by SQLBI, covering everything about performance and comparison between star schema and flat table.
The article covers a complete view of multiple benchmarks across different scenarios. The result is clear, a proper dimensional model will, in most cases, outperform a single table. The few cases where a single table outperforms does by no means outweigh the many benefits of the dimensional model. From a Power BI perspective, it’s clear that a dimensional model is the best practice, for multiple reasons.
Multiple fact table support
Creating reports and analyzing data typically involves multiple types of metrics from multiple business processes as well as attributes across multiple dimensions. For example, you might want to compare things like Sales against Budget. This data comes typically from different tables.
Not all reporting tools offer support for building or importing multiple tables or data models, let alone having multiple fact tables with relationships to conformed dimensions.
In these scenarios, in order to work with multiple fact tables, you typically need to do some data transformation. A typical and reasonable solution in this scenario is to join multiple fact (or other) tables in to one large table. However, it can introduce other challenges. For example if the fact tables has different grains of data.
One of the side effects of joining tables with different grains is that you will end up with duplicate value in your tables. As a result, you need to handle situations where values most likely will be summarized incorrectly.
For example, let’s say you have a transactional sales table where each row represents a sales transaction, and you need to join it with a monthly budget table. In this case, the budget values will get duplicated across each row for every month. Hence, you will get values that can’t be summarized.
As mentioned earlier in this article, Power BI is a model based tool. Power BI also has support for using multiple fact tables in your model. As a result, if you have a scenario as described above, you do not need to do any data manipulations like joining tables. You can simply import the tables as is and build relationships between the fact and dimension tables. Power BI also supports having relationships between multiple fact tables to the same dimension table, which brings us to the next topic, conformed dimensions.
A conformed dimension is a dimension that is related to multiple fact tables. All dimensional attributes has the same meaning to all fact tables that it is related to. This is an important core concept of building dimensional models as it allows for reusability of business dimensions and attributes across multiple star schemas. Building dimension tables often includes doing data transformations and data cleaning, which in this case can also be re-used.
Conformed dimensions is a useful concept when dealing with multiple fact tables, as well as in Power BI which also supports it. It allows for grouping and filtering on attributes from the single dimension table that filters multiple fact tables in a model.
Why is this useful? First of all, consider having multiple large flat tables in a model, instead of a star schema. This will make it harder for you to connect metrics to each other through common filter attributes, without adding relationships between them. However, adding relationships between fact tables is considered a bad practice both in the world of Kimball but also in Power BI. Adding this kind of relationship in Power BI forces you to use a bi-directional relationships, which can cause a negative impact on performance
Dimensional model – for the sake of simplicity
As I’ve gone through in this article, having a proper dimensional model offers several benefints.
It does not only simplify report creation, provides a logical structure for metrics and improves performance when querying large amounts of data. It also helps with understanding and navigating the data.
Simplicity is one of the purposes of the denormalized star schema structure. The model holds a simplified representation of metrics and data and it’s structure is based around a specific business process. As a result, it will simplify navigating the data, maintaining the model and making changes to the data model. Both from the perspective of a report creator or business user.
In a world of emerging need for self service BI, having a structured and simplified data model is crucial. It will enable self service BI users to find data and build custom reporting on top of the model. Compare this to a normalized data structure containing data spread out in perhaps hundreds or thousands of tables. Or data from multiple business domains contained within a single flat table, which will make the data harder to understand.
Lastly, having a structured, simple and performant data model will increase buy-in in your organization. It’s an enabler for bringing people to start using data, finding insights and become data driven.
Is Kimball still relevant?
Kimball dimensional modeling has been around for decades. It was created in a time where storage was expensive and computing was limited. In the modern data and analytics world, we have completely other limitations. Storage are cheap, computing can scale like never before using cloud solutions and new technology has emerged around data warehousing, like column storage oriented database solutions and MPP architectures that can spread out the computing on multiple nodes.
Having this in mind, some of the original incentives for building a denormalized dimensional model are not as important as they once were. For example, having wide and long tables in a columnar storage database does not affect performance the way it does in a traditional row based database. The same goes for Power BI as it is also using columnar storage.
However, apart from some performance aspects of data storage in dimensional modeling, there are still many valid reasons to use it. For example being able to manage your data operations in a structured way or lowering the bar for enabling people in your organization, technical or not, to use and understand your data! Also, from a Power BI perspective, it’s clear that a dimensional star schema offers the best performance.
Thank you for reading
I hope you enjoyed this article as much as I did writing it. If you did, please share it with others or comment below.