Vertipaq is the actual database engine used in Power BI. It is also the engine used in Analysis Services Tabular and Excel Power Pivot.
One of the reasons behind Vertipaq’s great performance is due to the different compression and encoding techniques that it applies to the data. The overall purpose of the different compression techniques used is to reduce the RAM footprint of your data model as much as possible. It is not uncommon that data models are reduced down to 10% or even less of it’s original size. Of course, the rate of compression depends on a number of parameters.
In this article I will go through the different compression techniques used in Power BI. Understanding these concepts will give you one more tool to potentially increase the performance of your Power BI data model.
There are three types of encoding techniques used:
- Value encoding
- Dictionary encoding (also called Hash encoding)
- RLE encoding
Value encoding is a compression technique only applied to integer datatypes. By finding mathematical relationships between values it can highly reduce the number of bits stored in a column.
Lets say that you have a column containing the number of website traffic per day. The numbers range from approximately 1700 and 2200. The biggest value 2200 takes around 12 bits of storage.
By reducing all values in the column by the lowest number (1700) you get a range of values between 0 and 500 instead of 1700-2200. In this case the highest number(500) only takes up 9 bits of storage.
An example of this can be seen in figure 1.
This gives a storage saving of 3 bits per row. This might not seem like a huge saving in storage in this particular example. However, keep in mind that depending on the distribution of data in a column the effect can be even larger.
This is of course only one example of a mathematical technique used in value encoding.
Basically, Vertipaq tries to find patterns in the data where it can apply various mathematical calculations to reduce the storage cost. Similar as the example above. Applying this over potentially millions or even billions of rows across multiple tables and columns can have a significant effect on your models size in memory.
These values also needs to be converted back to the original value when using the actual data from the column. As you might understand, this technique can drastically improve the storage (RAM footprint) and therefore also performance in your model.
There is one important aspect of value encoding. As mentioned, this type of compression is only applied to integer datatypes. However, there is an exception. Any column used in a relationship will never be subject for value encoding. Even if the column datatype is integer.
You can try to apply encoding hints to your model. However, the model will automatically change those columns back to dictionary(Hash) encoding after the next refresh.
I have not yet found any documentation about this behavior. It is however noted in various places across the community. For example in this article, written by Marco Russo & Alberto Ferrari from SQLBI about comparison between integer and date datatypes in a relationship.
Dictionary encoding, also known as hash encoding, uses a separate dictionary that holds all unique values in a column and assigns a numeric index to each value. The original values are then replaced by the new numerical index values.
Figure 2 shows an example of the top 15 most Olympic gold medalists by nation. As you can see there are in total 7 distinct nations in the list.
Once the dictionary is created. Vertipaq will start replacing the original values with the corresponding ID value. The original sample data (left table) totals to a size of 146 bytes (Power BI stores string as 2 bytes per character).
The example dictionary ID requires 3 bit for each ID number. The dictionary encoded data would have a total size of 15(rows) x 3(bits) = 45 bits. That equals to around 5,6 bytes of data.
Lets compare the original data with the encoded data. The dictionary would consume around 57 bytes for the string values + 3 x 7 = 21 bits for the index values. 21 bits equals around 2,6 bytes. The total size of the dictionary in this example would be almost 59,6 bytes.
The storage size for the dictionary + value encoded data results in 5,6 + 59,6 = 65,2 bytes. This can be compared to the original 146 bytes. Imagine this being applied over large datasets with multiple large string columns. The storage savings could be huge!
The benefit of this technique will of course depend of the cardinality of the column or the number of distinct values. The success of value encoding is driven by this factor. The compression will be more effective if there are fewer unique values (lower cardinality).
Value encoding works with any datatype and functions the same way. The main difference between different types of data is mostly related to the size of the dictionary.
Run Length Encoding (RLE)
RLE encoding is an additional encoding method used on top of Value encoding or Dictionary encoding. This method tries to save space by identifying and replacing blocks of repeated values in a column.
An example can be seen in Figure 3. The value AAAAA in Column A is repeated across 568 rows, between row #1 and #568. The value BBBBB is repeated 754 times between row #700 and #1453.
The effect of RLE encoding can be seen on the right table.
568 + 754 = 1322 rows of data are reduced to only two rows.
One thing to keep in mind is that Vertipaq will not store the data as displayed by the example in figure 3.
The effect from both Value and Dictionary encoding is that all columns will be stored as integers only. This means that RLE encoding will be applied on integer values in all cases. It will not apply RLE encoding on the original data imported to the model.
Different datatypes – same outcome
We have now looked at the different techniques used for compressing and storing data in memory. You can see that different datatypes will not effect of the outcome of the compression. Encoded data will be converted to integer values always.
One important thing to consider. Columns with Dictionary encoding applied will also have a dictionary which stores the original distinct values. These are not stored as integers and will therefore take up more space. The dictionary size will depend on the type of data (for example length of strings) as well as the cardinality of the column.
The effect of the different encodings has an interesting effect. Power BI can be seen as more or less data type independent when it comes to relationship columns. You will most likely not see a huge benefit in performance when having integer values in relationship over other datatypes.
However, this depends on the size of the dictionary. Remember, dictionary encoding will always be applied to columns used in a relationship. The size of the dictionary will depend on the original data in the column. A bigger dictionary will have a negative effect on your models size in memory which could potentially lead to decreased performance overall.
Thank you for reading
That is all for this time. I hope that you learned something from this post. What do you think about this topic? Please tell by leaving a comment below.
Credit to SQLBI on inspiration for this blog post