Power BI data types in relationships – Does it matter?

If you have a background in traditional database development you might have learned that using integer over e.g. string has a performance benefit in for example indexes and joins. You might have brought this knowledge with you into the world of Power BI or tabular development. Perhaps you are enforcing the use of integer columns in your Power BI relationships in order to increase the performance.

But is this really important in the world of Power BI? How does Power BI data types effect the performance of a model? In this post I will go through how this works. I will also show you the results of tests / benchmarks between four different data models using different types of data in relationship columns.

What about Power BI?

Power BI, and more specifically the Vertipaq engine, works a bit different than traditional databases. Vertipaq, the in-memory columnar database engine used by Power BI and other Microsoft products, uses multiple techniques when compressing and storing data. Based on this, you could argue that Power BI, in some ways, is data type independent. Because how it handles, compresses and stores it’s data. This makes it interesting when it comes to how to handle specific Power BI data types for certain scenarios. For example relationship columns.

Do you want to learn more about how Power BI and Vertipaq compresses and stores it’s data? Read about it in my article about compression techniques in Power BI.

There’s not that much information out there about Power BI data type performance in relationships.
When I started my Power BI journey, I learned that integer would always be superior in Power BI relationships and using string columns in relationships could reduce the performance of a Power BI model significantly.

I know many others that has been taught the same. Matter of fact, I have also taught this to other people in the past. But this is not strange at all. Even if there’s a lack of documentation about the topic, people tend to pick up whatever is available on the web. I’ve seen information being spread from time to time that Integer data types would improve the performance of a data model.
Actually, even Microsoft does it. For example in this official Power BI blog about BPA (Best practice Analyzer) rules in Power BI. Where it states that using integers in Power BI relationship columns is a best practice.

An image showing and example from the official Power BI blog post showing  "Relationship columns should be of integer data type"

What do people think?

Looking at the Power BI community, there seem to be a common understanding that Whole numbers / Integers would offer the best performance. This is by all means not representative for the whole user base, but it gives a small insight into the common understanding of the topic.

65% of people from the Twitter poll answered that Whole number, Date or Text resulted in the best performance. The same number from a LinkedIn poll was 88%

Result of a poll on twitter.
Result of a poll on LinkedIn

Some introduction

Recently I tried to find some information or benchmark and find out the actual performance benefits of using different Power BI data types in relationships. For example Integers vs. Text columns.

After some initial research I found this amazing article by SQLBI with comparison between using integer and date data types. https://www.sqlbi.com/articles/choosing-between-date-or-integer-to-represent-dates-in-power-bi-and-tabular/

In the article, Marco and Alberto goes through and shows all information needed to understand the difference between the two data types. Both from a technical point of view but also by testing the performance using DAX Studio. The conclusion from the article is that there is no real difference in performance between the two data types.

However, I still felt like I was missing one piece of information after reading the article. I’ve seen quite a few comments and posts related to comparison between integer and text data types, but not many numbers. Even if I knew that the principles were the same, I wanted the numbers. Therefore, I decided to do it myself. I have compared metrics and performance between some commonly used types of data.

The test

I created four equal data models in Power BI Desktop with only one difference. The column used in relationship between the fact and dimension tables. I tested using different Integer values (Whole numbers), Integer values stored as Text and GUID’s (Text).

I used a modified version of the Contoso Data Generator from SQLBI. You can find the original version here.

The test was based with a small model containing two dimensions and one fact table. One dimension was a date table which didn’t affect the result of the test. The fact table had around 10 million rows. The original version had only integer surrogate keys. I imported the data in four different data models in Power BI desktop. All data where equal except for the key column used in the relationship between the fact and dimension table.

I used the following DAX query to test the performance against all three data models separately.

DEFINE
    MEASURE Sales[Sales] =
        CALCULATE (
            SUMX (
                Sales,
                Sales[Net Price] * Sales[Quantity]
            )
        )

EVALUATE
SUMMARIZECOLUMNS (
    'Date'[Year],
    'Date'[Month],
    'Product'[Product Name],
    "Sales", [Sales]
)


I used DAX Studio to benchmark the query performance and Vertipaq Analyzer in DAX Studio to analyze the dataset.

Do you want to know more about how to use DAX Studio with Power BI? You can read about it in my article here.

Test model #1 – Whole number

Columns contained values with sequence starting from 1

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1955,4 milliseconds

Average warm query execution time: 1650 milliseconds

Relationship columns stats and size

Results from Vertipaq Analyzer in DAX Studio

Column Sales[ProductKey] size is 14,93 mb and Column Product[ProductKey] is 0.1 mb.

ProductKey column is stored as an integer. This would normally be encoded with VALUE encoding. However, because it’s used in a relationship the encoding type is replaced with HASH encoding.
You can also see that the dictionary size and cardinality is the same across the two different columns. However, the data size is different due to the different number of rows in the tables.

Test model #2 – Whole number

Columns contained values with sequence starting from 10000000001).
This data model is an exact copy of data model #1. With the exception that the surrogate key column has higher integer values.

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1949,8 milliseconds

Average warm query execution time: 1776,2 milliseconds

Relationship columns stats and size

Results from Vertipaq Analyzer in DAX Studio

There’s one interesting thing to notice here. The size of the dictionary is around twice the size of the previous model. Even though both models have the column stored as whole numbers(Int64) in Power BI. The only difference is the size of the integers, which affect the storage and therefore also the size of the dictionary.

Test model #3 Integer as Text data type

This data model is an exact copy of data model #1. With the exception that the key column is stored as Text instead of Whole numbers

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1986,80 ms

Average warm query execution time: 1742,2 ms

Relationship columns stats and size:

Results from Vertipaq Analyzer in DAX Studio

Test model #4 GUIDs (Text)

This data model contains GUID’s (globally unique identifier) which are commonly found in various source systems. It’s a 32 digit, 128-bit text string which is statistically unique.

Benchmark result

Benchmark result from DAX Studio

Average cold query execution time: 1971,4 ms

Average warm query execution time: 1673,4 ms

Relationship columns stats and size:

Results from Vertipaq Analyzer in DAX Studio

Summary

Storage comparison

Storage summary table from all test

Lets’ take a closer look at the numbers.
We can see that the two Integer columns (1 and 2) produces the smallest column size in the fact tables, but not in the dimension table. However, it is worth mentioning that the size of the fact table column changed for every refresh. Sometimes, model #2 produced the biggest size. This is probably the result of a different sort order causing different RLE-encodings for each refresh, hence the different total size of the column. Therefore, we can’t draw any conclusions based on these numbers. The dimension table, however, has only 1 segment and therefore returns the same size every time. The dictionary also always remained the same between refreshes, as expected.

There’s a rather big difference in dictionary size between model #1 and #2, even though they are both integers with the same cardinality (number of unique values). Dictionary in model #2 is around 100% bigger than model #3. The reason for this is that column in model #2 requires more storage for each value in the dictionary as the values are larger than in model #1.

Model number #3 has the lowest dictionary size out of all tested models. It stores the same values as model #1 but as Text data type. Because of this the dimension table in model #3 has the smallest column size out of all models. The fact table size is slightly bigger. However, it’s hard do draw any conclusions based on that because the size changes for every refresh.

Query times comparison

Query execution summary table from all test

By looking at these numbers, it’s hard to see any difference in query time in both cold and warm query executions. The numbers are very close to each other across all different models. No matter if the data type is Whole numbers or Text.

Conclusion

By looking at the results from these test, we can conclude that different Power BI data types does not seem to have an impact on performance.

In general, there was no major difference in either storage or execution times across the different models. However, there’s one indication that I could see during this test and the several different executed benchmarks.

Columns with a larger dictionary and column size tended to run slightly longer. With slightly longer I mean around a couple % in difference. This is however all expected behavior. A larger dictionary and column size would increase memory footprint and also affect query execution times. It is therefore not the different data types that has an impact on performance, but the actual values stored in the columns. The larger the values stored (bytes), the larger the dictionary which results in a larger memory footprint.

Remember, as I mentioned earlier in the article, all columns used in a relationship are encoded using HASH encoding. It is therefore the size of the dictionary that will have the biggest impact on performance in this case. Don’t forget to read about encoding methods, dictionaries and other useful information about Vertipaq storage techniques in my article here

Thank you for reading

That is all for now. I hope that you enjoyed this post and that you learned something from it.
If you did, please leave a comment below.

15 thoughts on “Power BI data types in relationships – Does it matter?”

  1. I have figured out why there were no comments. The “post comment” button is invisible until you hover over it. You might want to look into that! (I’m on MS Edge)

    Reply
  2. I can’t believe noone replied to this great article yet! Thanks a lot for taking the time to investigate this.

    Now that you have kind of concluded that it doesn’t matter much, my only question remaining is: will my data refresh take longer using hashes than ints?

    Again, thanks for this insight, it made me more confident that using hashes as the primary and foreign keys is not so bad at all ????

    Reply
    • Thank you for enjoying the article!
      I haven’t tested comparing the refresh. But depending of the data volumes, I would expect it to have som kind of impact, even if small. Considering that data need to be 1. Imported from source and 2. loaded into memory.
      It would be interesting to try though!

      Reply
  3. I wish I had read this before I spent countless hours, actually more like days, replacing all GUID relationships with numeric keys in our Dynamics CRM PowerBI models … I have an ETL that generates new whole number index keys, replaces all the GUIDs with those index keys in the master table and then does the same in all the related tables.

    Besides me wasting a lot of time doing this initial design, this is also quite a bit of ETL computing power that I’m wasting daily during the refreshs …

    Reply
    • I noticed I haven’t replied to this comment. First of all, thank you for reading and enjoying the article. I am glad that you learned something from it.
      Even though you made all that work it might not be worthless. For larger models GUID values could could bloat your dictionary and impact the overall performance of your semantic model.
      As usual, it depends. This article was written mostly for people to learn and understand the behavior and technology instead of just assuming things that might not be true.

      Reply
  4. I just tried this out myself and I cannot reproduce your results. My GUID string columns are all about 50-100% larger than my INT64 columns. Now granted, I used tables with a cardinality of 50k to 200k rows.

    But for each of the 5 tables where I loaded both the GUID and the Key in the same table, the GUID was each time almost double the size.

    Reply
    • It’s expected that the size of the column containing GUID (strings) would be larger than using integer due to the dictionary size. However, as you can see in the article I didn’t get a 50-100% increase of size.
      The only time I got that kind of difference was comparing integer stored as strings vs GUID stored as strings.
      Similarly a large integer value would increase the size accordingly compared to a small integer value.

      Reply
  5. We got about 1/3 faster queries in our large data model when we changed the relationship to use INTEGER instead of STRING in relations. adjustion of chancing in doing doing it in dimension tables with huge dimension (4 million) rows.

    – ~1,2 billion rows in fact table
    – ~4,3 million rows in dimension
    – 1:M relation
    – Everything is HASH encoded

    For us it seems that it depends, but mostly with big dimensions and when you are filtering through the dimension table. I don’t know the threshold, but in this case at least), used in example for product dimension. The effects can be best when you are filtering the dimension (query) with product category for example. The query amount seems to be always lower for INTEGER relationship column than STRING column. So many things that can effect this one, we for example are calculating through dimension table, basically just calcilate(sum(‘dimension'[unit_cost]), fact). Most of the cases it does not seem to matter.

    Reply
    • The conclusion is that the size of the data has an effect on execution times, and not so much the data type. Columns with larger data size in dictionaries seems to decrease performance slightly. Perhaps even more in larger models as yours.
      This means that using a large integer would in some cases be slower than using small(short) text values.

      Reply
      • This is not the case, Integer (81 million bytes) relation has larger dictionary than string (73 million bytes) in our case, so that is not the case. The data type. The only differens is that one is string, and one is amazon FNV_HASH() function chancing to Integer.

        My guess is that something might be happening with computers and L1 and L2 cache calculating things we make them to what can make the STRING the packaging and showing of the numbers count in to things also. But tbh i have no idea.

        BR Samu

        Reply
  6. Hi Max,

    Thank you for this article. Really helpful.

    I assume that this is related only to Import mode.

    Can you confirm please.

    Kind regards,
    Aneta

    Reply
    • Yes. The article describes this based from the view of vertipaq, which is the storage engine used in import mode.
      If you use DirectQuery sources, you will instead depend on the underlying data source database engine.

      Reply
  7. Hi,
    This test lacks a lot of things to actually draw the conclusions your stating.

    Two things comes up top of mind
    1. What happens if we scale the data (10M records is nothing). How about 100, 500, 1B?
    2. What happens if you hint the value encoding as integers instead?

    Reply
    • Thank you for your comment.

      Columns used in a relationship will always be hash encoded, even if you try to hint it as value encoded. I’ve mentioned it here in a post about Vertipaq compression
      Since Vertipaq use hash-encoding and store only the dictionary index, it doesn’t matter what the original column values are when it comes to query performance.

      However, as I mention above in the article, increasing the size of the dictionary (which would also be the case in a larger model) would affect performance.
      So for example, using a large string as a key in your relationship would probably bloat your dictionary and increase the memory footprint.
      This would most likely have a negative impact on the overall performance of your model. But it’s not caused by a specific datatype by itself.

      Reply

Leave a Comment