Optimizing Power BI: A Practical Guide to DirectQuery Performance – Part 3

Welcome back to part 3 in a series of articles where you will learn how to improve and optimize the performance of your Power BI DirectQuery semantic models.
In part 1 we covered the basic aspects of DirectQuery semantic models in Power BI, such as the importance of star schema modeling in Power BI and how to troubleshoot slow running queries in Power BI.
In part 2 we expanded on those ideas and walked through how to practically improve your DirectQuery semantic model performance. We also discussed the importance of moving data transformations upstream and limiting complex DAX and Power Query logic in your semantic model.

In this part, we will cover some other advanced topic in Power BI that can impact performance of DirectQuery semantic models.
We will introduce concepts such as:

  • Multiple data islands / source groups and cross source group queries.
  • DAX fusion and how you can leverage it.
  • How does query parallelism work?
  • How to change join types for Power BI generated queries to your underlying data source.

Combining data from different data islands

Data in your semantic model can come from both DirectQuery and Import data sources. Data originating from separate storage engines or different DirectQuery data sources are seen as different data islands (source groups). A Power BI semantic model can consist of multiple data islands, which also makes your semantic model a composite model.

Limited and Regular relationships in Power BI DirectQuery

Relationships in Power BI are defined as either regular relationships or limited relationships. Regular relationships are created whenever Power BI can determine the “one” side of a relationship. Meaning the column involved in a relationship that contains unique values. For example, if you have dimension and fact tables in your model, the dimension tables are usually on the one side of a relationship and the fact tables on the many side.
A one-to-many relationship between two tables within the same data island is always classified as a regular relationship.

A limited relationship can be defined in two scenarios. Either because a relationship within the same source group is defined as a many-to-many relationship. Or a relationship between two tables spanns different source groups. In this case it doesn’t matter if the relationship is defined as one-to-many relationship, Power BI will still define it as a limited relationship.

One-to-many intra-source queries will always provide the best performance, while cross-source queries will perform the worst. From a performance perspective, it’s, therefore, recommended that you strive to keep your relationships as regular relationships and not as limited

Limited relationships spanning multiple source groups

Whenever there is a relationship between different source groups, a limited relationship is created. A limited relationship can be identified in the graphical interface of Power BI with parenthesis like marks on both side of the relationship. A limited relationship is also created when there is a many-to-many relationship between two tables.

An image showing a limited relationship in Power BI Desktop
Limited relationship in Power BI Desktop

A query spanning multiple source groups is called an inter/cross source group query. The effect of this is that any joins resulting of relationships between multiple tables will not be considered or executed by the storage engine. The effect of this is that all joins need to be handled by the formula engine. In the end this will result in less performing queries compared to inter source group queries. Because inter source group queries can take advantage of joins during query execution at your data source, therefore reducing the data returned to the formula engine.

1 million row limit in Power BI DirectQuery

Furthermore, this can also cause potential issues with large tables, as there is a limitation for DirectQuery queries of 1 million rows. Let’s say, for example, that we have two tables, Table A and Table B.
Table A is in Import mode and Table B is in DirectQuery mode. The tables are joined with a relationship and data from both tables are displayed in a visual.
Table A is filtered; however, the join with the filtered rows will never traverse to Table B because it is located in another source group. Therefore, a query to retrieve data for Table B will be executed without the join, potentially returning more rows than needed to the formula engine. If the data retrieved are larger than 1 million rows, it will throw an error.

An image showing an error in Power BI desktop displaying "The resultset of a query to external data source has exceeded the maximum allowed size of '1000000' rows."
1 million row limitation in DirectQuery

Blank rows in limited relationships

The conclusion is that a semantic model containing multiple data islands and therefore having limited relationship can have a negative impact on the performance of your semantic model. For example:

In any case where an inter/cross-source query is executed, the individual queries must be sent to each individual storage engine before returning the semantic model. Only after that the formula engine can perform joins and append the data. Not only can this cause more data movement, it’s important to understand that the formula engine is a single-threaded process only utilizing one single core.

It’s important to understand the behavior of how your semantic model executes queries depending on how you model your data. This can have a significant impact on the performance of your model if you are not careful. This is especially true when using DirectQuery mode while Power BI’s vertipaq engine will be more forgiving using import mode.

DAX Fusion

DAX Fusion is an enhancement in the DAX engine aimed to improve query performance of DAX queries. Fusion is the technique of merging multiple queries together resulting in fewer storage engine queries sent to the underlying storage. Have you followed along on part 1 and part 2 of these series? If so, you should know by know that limiting the number of underlying queries sent to your data source can be of great importance when optimizing your DirectQuery datasets.

There are two types of fusion techniques, horizontal fusion and vertical fusion. Vertical fusion has been around in Power BI since Power BI’s release while horizontal fusion was introduced in september 2022.

Vertical fusion

WHen it comes to Vertical fusion aims to merge DAX queries in a single visual that uses similar data and combine them into a single storage engine query. For example, using multiple measures with simple calculations over different columns over the same table would be a case where vertical fusion could kick in. This could be e.g. one measure doing a SUM() over one column and another measure doing an AVERAGE() over another column. If both measures operate over the same table these could be a candidate for merging the underlying storage engine queries into a single query.

Let’s take a look at an example.

In this following table visual, we have three different measures. MAX, AVG and COUNTROWS

A table visual in Power BI displaying column Brand and measures MAX, AVG and COUNTROWS
A table visual in Power BI displaying column Brand and measures MAX, AVG and COUNTROWS

The DAX for these three measures looks like this. As you can see, they are all simple calculations on top of different columns but within the same table.

MAX = MAX('Sales'[Quantity])

AVG = AVERAGE('Sales'[Quantity])

COUNTROWS = COUNTROWS('Sales')

By exporting the underlying DAX query (not sure how to do this? Check out part one in this series on how to find and troubleshoot DAX queries generated by visuals) we can take a look at the underlying storage engine queries using Server timings in DAX Studio.

In this case, only 1 SQL query has been generated and sent to the underlying data source. As you can see in the preview of the SQL query, the generated query already contains some of the aggregations used in the DAX measures, such as COUNT_BIG() and SUM().

An image showing Server timings in DAX Studio and the results of a DAX query
Server timings in DAX Studio

Now, let’s try and break the fusion. Let’s do that by changing one of the measures to include a filter.
In this scenario, I want to see AVG quantity on orders that has quantity > 1

AVG = CALCULATE(AVERAGE('Sales'[Quantity]),'Sales'[Quantity] > 1)

The values in the visual changes, as expected.

A table visual in Power BI displaying column Brand and measures MAX, AVG and COUNTROWS
A table visual in Power BI displaying column Brand and measures MAX, AVG and COUNTROWS.
Values has changed after modification to the measure.

Let’s take a look at the server timings in DAX Studio.

An image showing Server timings in DAX Studio and the results of a DAX query.
Server timings in DAX Studio

As you can see, we now have 2 SQL queries generated and sent to the underlying data source. The storage engine had to run for 74ms compared to the previous 24ms.
The total execution time increased from 32 ms to 44 ms, an increase of 37.5%.
These queries was also benchmarked by running 5 cold and 5 warm executions. The result was consistent with the ones showed here.

By changing one DAX measure and adding a filter to the CALCULATE function, we altered the outcome of DAX Fusion, forcing the engine to send an additional underlying SQL query to the data source

Horizontal fusion

Horizontal fusion aims to merge DAX queries that uses the same data but with different filter conditions. If you have multiple measures in a visual using the same column but with different filter contexts, those would be a candidate for horizontal fusion.

Let’s take a look at an example. In this case we have two measures, Sales 2019 and Sales 2020. These measures each returns the sales amount for one single year, as you can see by the DAX filter at the end of the CALCULATE function.

Sales 2019 =
CALCULATE (
    SUMX (
        'Sales DQ',
        'Sales DQ'[Net Price] * 'Sales DQ'[Quantity]
    ),
    'Date DQ'[Year] = 2019
)

Sales 2020 =
CALCULATE (
    SUMX (
        'Sales DQ',
        'Sales DQ'[Net Price] * 'Sales DQ'[Quantity]
    ),
    'Date DQ'[Year] = 2020
)

By using the same table visual, we can examine the query in server timings in DAX Studio. We can see that only one SQL query is sent to the underlying data source. By looking further at the SQL code, I can also see that the SQL code generated groups the net price * quantity by both brand and year (Not visible in the image). This means that the engine has successfully interpreted the two DAX measures and merged it into a single SQL query at an optimal granular level.

An image showing Server timings in DAX Studio and the results of a DAX query
Server timings in DAX Studio

Now, let’s add a third measure called Sales and check the server timings again. This measure returns the sales amount without any filter over the year column.

Sales =
CALCULATE (
    SUMX (
        'Sales DQ',
        'Sales DQ'[Net Price] * 'Sales DQ'[Quantity]
    )
)


As you can see by looking at the server timings result, this caused a second SQL query to be generated. Resulting in two queries being generated and sent to the underlying data source. The reason for this is likely due to mixing different filter logic within the three measures, causing Fusion to break. In this case fusion could only merge the measures Sales 2019 and Sales 2020 into a single SQL query, but not measure Sales, which had to be sent as a separate SQL query.

An image showing Server timings in DAX Studio and the results of a DAX query
Server timings in DAX Studio

This is just an example of how horizontal fusion works. If you want to learn more about Horizontal fusion, SQLBI, as always, has a great in-depth article about the feature.

DAX Fusion and Power BI DirectQuery

As you probably understand, DAX fusion is a powerful technology that can have a huge impact on the performance of your semantic model, especially when using DirectQuery data sources. You should always aim to build your models so that DAX fusion can be utilized. Understanding DAX fusion behavior is an important skill that can help you greatly improve the performance of any Power BI semantic model.

Without fusion, these kinds of queries would be sent as individual queries, greatly slowing down your semantic model’s performance. As I’ve mentioned in this series, having a proper modeled semantic model is of great importance. The same goes for following the maxim of doing your data transformations as far upstream as possible. DAX Fusion is one of many reasons why these things matter and why it can have a great impact on the performance of your semantic model. Avoiding the best practices around data modeling might mean that you need to introduce complex calculations into your DAX measures. This could potentially cause DAX fusion to not kick in.

Parallelism for storage engine queries

There are multiple types of queries that can be sent from a Power BI semantic model to the underlying data source.
When using import mode, queries will be sent whenever a data refresh operation is triggered on your semantic model .
For DirectQuery semantic models, storage engine queries are sent to retrieve data for your visuals. Storage engine queries are executed for each visual and for a single visual there can be multiple storage engine queries executed.

SE queries parallelism for a single visual

Whenever a DAX query is generated from a Power BI visual, it will trigger one or more storage engine queries. These queries can be sent in parallel to your underlying data source, greatly improving the performance of any DAX query in your semantic model.
Unfortunately, if you are using Power BI Pro, all queries will be sent sequentially. Therefore, it’s important to understand the type of queries being sent from your reports.
For example, a single matrix can send multiple different data source queries to get the results for the all row and column values, values (measure), and different totals. If you are using Power BI Pro, these queries will always be sent sequentially for a single visual.
For Power BI Premium and Embedded scenarios (with 8 or more vCores), there is a possibility to change this setting to make queries run in parallel instead of sequentially.

Note. This feature is currently (as of november 2023) in preview and needs to be activated to be used. Once it is shipped as general availability (GA) this setting will most likely be set to automatic to let Power BI decide for itself the best amount degree of parallelization. To active this feature, you need to change the TOM object Model.MaxParallelismPerQuery to which degree of parallelism you want.

Maximum number of data source connections

In Power BI, you can define the maximum number of connections for a single data source. This setting defines how many underlying queries that can be sent simultaneous to each of your underlying data sources. This setting should not be mixed up with the max parallelism per query setting described above. This means that this setting will not have any effect on parallelism for queries generated from a single visual in Power BI. However, it will have an effect for multiple queries generated from multiple different visuals in you Power BI reports.

You can change this setting in two ways. Either in Power BI Desktop by going to Options -> CURRENT FILE -> DirectQuery and change the maximum connections per data source setting.

An image showing the options for maximum connections per data source in Power BI Desktop
Options for Maximum connections per data source in Power BI Desktop


Or you can change it in you semantic model by changing the TOM property Model.DataSourceDefaultMaxConnections. This can be done in for example Tabular Editor.

An image showing the setting Data Source Default Max Connections in Tabular editor
Data Source Default Max Connections property in Tabular Editor

Power Query parallelism in Power BI Desktop

When working in Power BI desktop, you have some control over the behavior of query parallelization.
You can set the maximum number of simultaneous evaluations as well as the maximum memory used per simultaneous evaluation.

The setting for the maximum number of simultaneous evaluations defines how many Power Query evaluations can run simultaneously on your machine. By default, the number is equal to the number of logical CPU cores available on your machine.

It’s important to note that these settings are only applied in Power BI Desktop and not in the Power BI Service.

AN image showing options for parallel loading of tables in Power BI Desktop
Parallell loading of tables option in Power BI desktop

Changing join type for DirectQuery sources

When working with DirectQuery semantic models, each interaction generates a query to the underlying data source. As a default behavior, If a query includes columns from multiple tables, then those queries are generated with outer joins.
An outer join means that all results are returned from table A with matching rows from table B. This is not always ideal and could potentially cause performance issues in your model, depending on what your data looks like. Let’s take a look at the options available.

When querying multiple tables that has a relationship in Power BI, the joins generated by the DirectQuery storage engine are outer joins by default. An outer join means that all rows are returned from table A and matching rows from table B.

There is a way to change this behavior in Power BI that switches the join type from a left outer join to an inner join instead. An inner join returns the result from two tables only when there are matching rows between all tables. Using an inner join in SQL could enhance performance during query execution.

Referential integrity

The reason why left outer joins are used by default is because Power BI can’t assume referential integrity in your data model.
What does that mean? For a model to have referential integrity, it means that any value in the many side of a relationship needs to also be present in the one side. For example, if you have dimension key values in your fact table that does not exists in your dimension tables, or if null values exists in the column, then the relationship does not have referential integrity.

If a relationship between two tables does not have referential integrity, using inner joins in a query can return incorrect data. If you have values in your fact table with non matching values in your dimension table, an inner join will not include those rows in the result. In that case you would have missing data from your fact table, returning potentially wrong results in your reports.

However, let’s say that you have full control over your data model and can guarantee that your model has referential integrity. In that case you can change the behavior of the joins generated by the DirectQuery Storage engine. This can be done in the relationship view in Power BI.

An image showing the Edit relationship dialogue in Power BI Desktop, with Assume referential integrity highlighted
Assume referential integrity in Power BI Desktop

Let me show you with an example.
This is a table visual showing Sales by Year. Sales values comes from one table and the year column from another.

An image showing a table visual in Power BI Desktop
Table visual with Sales over Year

The relationship between the tables a one-to-many relationship. Assume referential integrity is set to off in this relationship.

An image showing a relationship between two tables in Power BI Desktop
A relationship between two tables in Power BI Desktop

Let’s take a look at the query generated from this visual. As you can see, the query retrieves data from table [dbo].[Sales] and [dbo].[Date]. The tables are joined by a LEFT OUTER JOIN. This means that all rows from [dbo].[Sales] are returned, as well as matching rows from [dbo].[Date].

An image of a SQL query generated by Power BI showing left outer join between tables Sales and Date
Query showing left outer join between tables Sales and Date

If we instead enable referential integrity for that specific relationship, the query changes and uses an inner join.

An image of a SQL query generated by Power BI showing an inner join between tables Sales and Date
Query showing left outer join between tables Sales and Date

When enabling Assume referential integrity the query will be executed using inner joins instead of outer joins.
At the time of writing this article, the following requirements are needed to enable the setting in Power BI.

  • Data in the From column in the relationship is never Null or blank
  • For each value in the From column, there’s a corresponding value in the To column

A word of caution

Power BI validates parts of your data model when the feature is enabled. This, however, is by no means bulletproof. The validation only occurs at the time of editing the relationship. This means that any subsequent change in the data breaking the referential integrity will not be caught. There’s also a risk that the validation does not discover all issues in a very large semantic model.

In the end, it is up to you to ensure that your data model meets the requirements for enabling this setting. Otherwise you might end up with a data model that returns wrong results to your end users. Also, make sure that you have your data modeled according to Power BI best practices.

Summary

This article has covered advanced techniques for improving your Power BI semantic model’s performance. We built on the topics discussed in Part 1 and Part 2 of our series to explore more complex aspects of Power BI DirectQuery optimization. Even though many topics can be applied to any Power BI solution.

We discussed combining data from different sources and how it can affect performance as well as learning about DAX Fusion. We also looked at query parallelism and how to change join types for DirectQuery sources.

By applying and learning these topics, you can improve your Power BI semantic models and ensure they run smoothly and efficiently. Now it’s up to you to apply these insights in your own projects and continue exploring ways to enhance your data modeling skills.

In the next part of this series we will discuss aggregation tables and how you can use them in Power BI DirectQuery models to speed up your reports!

Thank you for reading

If you enjoyed this article or learned something from it, please share it with others or comment below.
What is your experience using DirectQuery? Do you have any tips? Please share them below!


Don’t forget to follow for the next part in this series!

Leave a Comment