Power BI data types in relationships – Does it matter?

If you have a background in traditional database development you might have learned that using integer over e.g. string has a performance benefit in for example indexes and joins. You might have brought this knowledge with you into the world of Power BI or tabular development. Perhaps you are enforcing the use of integer columns in your Power BI relationships in order to increase the performance.

But is this really important in the world of Power BI? How does Power BI data types effect the performance of a model? In this post I will go through how this works. I will also show you the results of tests / benchmarks between four different data models using different types of data in relationship columns.

What about Power BI?

Power BI, and more specifically the Vertipaq engine, works a bit different than traditional databases. Vertipaq, the in-memory columnar database engine used by Power BI and other Microsoft products, uses multiple techniques when compressing and storing data. Based on this, you could argue that Power BI, in some ways, is data type independent. Because how it handles, compresses and stores it’s data. This makes it interesting when it comes to how to handle specific Power BI data types for certain scenarios. For example relationship columns.

Do you want to learn more about how Power BI and Vertipaq compresses and stores it’s data? Read about it in my article about compression techniques in Power BI.

There’s not that much information out there about Power BI data type performance in relationships.
When I started my Power BI journey, I learned that integer would always be superior in Power BI relationships and using string columns in relationships could reduce the performance of a Power BI model significantly.

I know many others that has been taught the same. Matter of fact, I have also taught this to other people in the past. But this is not strange at all. Even if there’s a lack of documentation about the topic, people tend to pick up whatever is available on the web. I’ve seen information being spread from time to time that Integer data types would improve the performance of a data model.
Actually, even Microsoft does it. For example in this official Power BI blog about BPA (Best practice Analyzer) rules in Power BI. Where it states that using integers in Power BI relationship columns is a best practice.

An image showing and example from the official Power BI blog post showing  "Relationship columns should be of integer data type"

What do people think?

Looking at the Power BI community, there seem to be a common understanding that Whole numbers / Integers would offer the best performance. This is by all means not representative for the whole user base, but it gives a small insight into the common understanding of the topic.

65% of people from the Twitter poll answered that Whole number, Date or Text resulted in the best performance. The same number from a LinkedIn poll was 88%

Result of a poll on twitter.
Result of a poll on LinkedIn

Some introduction

Recently I tried to find some information or benchmark and find out the actual performance benefits of using different Power BI data types in relationships. For example Integers vs. Text columns.

After some initial research I found this amazing article by SQLBI with comparison between using integer and date data types. https://www.sqlbi.com/articles/choosing-between-date-or-integer-to-represent-dates-in-power-bi-and-tabular/

In the article, Marco and Alberto goes through and shows all information needed to understand the difference between the two data types. Both from a technical point of view but also by testing the performance using DAX Studio. The conclusion from the article is that there is no real difference in performance between the two data types.

However, I still felt like I was missing one piece of information after reading the article. I’ve seen quite a few comments and posts related to comparison between integer and text data types, but not many numbers. Even if I knew that the principles were the same, I wanted the numbers. Therefore, I decided to do it myself. I have compared metrics and performance between some commonly used types of data.

The test

I created four equal data models in Power BI Desktop with only one difference. The column used in relationship between the fact and dimension tables. I tested using different Integer values (Whole numbers), Integer values stored as Text and GUID’s (Text).

I used a modified version of the Contoso Data Generator from SQLBI. You can find the original version here.

The test was based with a small model containing two dimensions and one fact table. One dimension was a date table which didn’t affect the result of the test. The fact table had around 10 million rows. The original version had only integer surrogate keys. I imported the data in four different data models in Power BI desktop. All data where equal except for the key column used in the relationship between the fact and dimension table.

I used the following DAX query to test the performance against all three data models separately.

DEFINE
    MEASURE Sales[Sales] =
        CALCULATE (
            SUMX (
                Sales,
                Sales[Net Price] * Sales[Quantity]
            )
        )
EVALUATE
SUMMARIZECOLUMNS (
    'Date'[Year],
    'Date'[Month],
    'Product'[Product Name],
    "Sales", [Sales]
)


I used DAX Studio to benchmark the query performance and Vertipaq Analyzer in DAX Studio to analyze the dataset.

Do you want to know more about how to use DAX Studio with Power BI? You can read about it in my article here.

Test model #1 – Whole number

Columns contained values with sequence starting from 1

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1955,4 milliseconds

Average warm query execution time: 1650 milliseconds

Relationship columns stats and size

Results from Vertipaq Analyzer in DAX Studio

Column Sales[ProductKey] size is 14,93 mb and Column Product[ProductKey] is 0.1 mb.

ProductKey column is stored as an integer. This would normally be encoded with VALUE encoding. However, because it’s used in a relationship the encoding type is replaced with HASH encoding.
You can also see that the dictionary size and cardinality is the same across the two different columns. However, the data size is different due to the different number of rows in the tables.

Test model #2 – Whole number

Columns contained values with sequence starting from 10000000001).
This data model is an exact copy of data model #1. With the exception that the surrogate key column has higher integer values.

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1949,8 milliseconds

Average warm query execution time: 1776,2 milliseconds

Relationship columns stats and size

Results from Vertipaq Analyzer in DAX Studio

There’s one interesting thing to notice here. The size of the dictionary is around twice the size of the previous model. Even though both models have the column stored as whole numbers(Int64) in Power BI. The only difference is the size of the integers, which affect the storage and therefore also the size of the dictionary.

Test model #3 Integer as Text data type

This data model is an exact copy of data model #1. With the exception that the key column is stored as Text instead of Whole numbers

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1986,80 ms

Average warm query execution time: 1742,2 ms

Relationship columns stats and size:

Results from Vertipaq Analyzer in DAX Studio

Test model #4 GUIDs (Text)

This data model contains GUID’s (globally unique identifier) which are commonly found in various source systems. It’s a 32 digit, 128-bit text string which is statistically unique.

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1971,4 ms

Average warm query execution time: 1673,4 ms

Relationship columns stats and size:

Results from Vertipaq Analyzer in DAX Studio

Summary

Storage comparison

Storage summary table from all test

Lets’ take a closer look at the numbers.
We can see that the two Integer columns (1 and 2) produces the smallest column size in the fact tables, but not in the dimension table. However, it is worth mentioning that the size of the fact table column changed for every refresh. Sometimes, model #2 produced the biggest size. This is probably the result of a different sort order causing different RLE-encodings for each refresh, hence the different total size of the column. Therefore, we can’t draw any conclusions based on these numbers. The dimension table, however, has only 1 segment and therefore returns the same size every time. The dictionary also always remained the same between refreshes, as expected.

There’s a rather big difference in dictionary size between model #1 and #2, even though they are both integers with the same cardinality (number of unique values). Dictionary in model #2 is around 100% bigger than model #3. The reason for this is that column in model #2 requires more storage for each value in the dictionary as the values are larger than in model #1.

Model number #3 has the lowest dictionary size out of all tested models. It stores the same values as model #1 but as Text data type. Because of this the dimension table in model #3 has the smallest column size out of all models. The fact table size is slightly bigger. However, it’s hard do draw any conclusions based on that because the size changes for every refresh.

Query times comparison

Query execution summary table from all test

By looking at these numbers, it’s hard to see any difference in query time in both cold and warm query executions. The numbers are very close to each other across all different models. No matter if the data type is Whole numbers or Text.

Conclusion

By looking at the results from these test, we can conclude that different Power BI data types does not seem to have an impact on performance.

In general, there was no major difference in either storage or execution times across the different models. However, there’s one indication that I could see during this test and the several different executed benchmarks.

Columns with a larger dictionary and column size tended to run slightly longer. With slightly longer I mean around a couple % in difference. This is however all expected behavior. A larger dictionary and column size would increase memory footprint and also affect query execution times. It is therefore not the different data types that has an impact on performance, but the actual values stored in the columns. The larger the values stored (bytes), the larger the dictionary which results in a larger memory footprint.

Remember, as I mentioned earlier in the article, all columns used in a relationship are encoded using HASH encoding. It is therefore the size of the dictionary that will have the biggest impact on performance in this case. Don’t forget to read about encoding methods, dictionaries and other useful information about Vertipaq storage techniques in my article here

Thank you for reading

That is all for now. I hope that you enjoyed this post and that you learned something from it.
If you did, please leave a comment below.

Leave a Comment

%d bloggers like this: