-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CTE Materialization documentation #22682
Conversation
cc: @feilong-liu @mlyublena @tdcmeehan @rschlussel for any suggestions |
5d43a16
to
ce3f455
Compare
CTE Materialization | ||
================================= | ||
|
||
CTEs (Common Table Expressions) are defined as subqueries that appear in a WITH clause provided by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common Table Expressions (CTEs)
================================= | ||
|
||
CTEs (Common Table Expressions) are defined as subqueries that appear in a WITH clause provided by the user. | ||
Their repeated usage within a query can lead in unnecessary redundant computations, excessive data retrieval, and high resource consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lead to
period after consumption
CTEs (Common Table Expressions) are defined as subqueries that appear in a WITH clause provided by the user. | ||
Their repeated usage within a query can lead in unnecessary redundant computations, excessive data retrieval, and high resource consumption | ||
|
||
To address this, Presto supports CTE Materialization. This feature allows intermediate CTEs to be materialized and effectively stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"materialized and effectively stored
and reused within" --> materialized, stored, and reused within
or maybe just "reused"
and reused within the scope of the same query. | ||
|
||
This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query. | ||
These CTEs will be stored in temporary tables that are bucketed based on the first projection column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
period
This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query. | ||
These CTEs will be stored in temporary tables that are bucketed based on the first projection column | ||
|
||
To use this feature the connector used by the query must support the creation of temporary tables. Currently, only the :doc:`/connector/hive` connector offers this capability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma after feature
How to use CTE Materialization | ||
-------------- | ||
|
||
The following configurations and session properties are provided to enable CTE materialization and fine-tune its settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete "are provided to"
* **Allowed values:** ``ALL``, ``NONE``, ``HEURISTIC`, ``HEURISTIC_COMPLEX_QUERIES_ONLY`` | ||
* **Default value:** ``NONE`` | ||
|
||
Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete "to use"
* **Default value:** ``NONE`` | ||
|
||
Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries. | ||
``NONE`` indicates that no CTEs will be materialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete "indicates that"
|
||
Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries. | ||
``NONE`` indicates that no CTEs will be materialized. | ||
``ALL`` indicates that all CTEs in the query will be materialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete "indicates that"
* **Minimum value:** ``0`` | ||
* **Default value:** ``4`` | ||
|
||
Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this sentence. Please rewrite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like:
Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number | |
When ``cte-materialization-strategy`` is set to ``HEURISTIC`` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``, then CTEs will be materialized if they are appear in a query at least``cte-heuristic-replication-threshold`` number of times |
* **Minimum value:** ``0`` | ||
* **Default value:** ``4`` | ||
|
||
Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like:
Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number | |
When ``cte-materialization-strategy`` is set to ``HEURISTIC`` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``, then CTEs will be materialized if they are appear in a query at least``cte-heuristic-replication-threshold`` number of times |
To address this, Presto supports CTE Materialization. This feature allows intermediate CTEs to be materialized and effectively stored | ||
and reused within the scope of the same query. | ||
|
||
This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query. | |
Materializing CTEs can improve performance when the same CTE is used multiple times in a query by reducing recomputation of the CTE. However, there is also a cost to writing to and reading from disk, so the optimization may not be beneficial for very simple CTEs or CTEs that are not used many times in a query. |
and reused within the scope of the same query. | ||
|
||
This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query. | ||
These CTEs will be stored in temporary tables that are bucketed based on the first projection column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These CTEs will be stored in temporary tables that are bucketed based on the first projection column | |
Materialized CTEs are stored in temporary tables that are bucketed based on the first projection column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks for the doc! I made a lot of suggestions that are mostly repetitive. I had a few comments that I ask for your thoughts.
Also, I'm planning to reformat the Properties Reference page to remove the indents for content under each header.
Instead of cluttering this review with even more comments, I'll ask you here to remove the indents for the content of the property entries under the headings.
Here's a screenshot of my local build where I removed the indents for cte-materialization-strategy
, showing the format change in the build - removing the gray vertical line - from that entry, but leaving the entry for cte-heuristic-replication-threshold
to show what I'm asking you to do. Let me know if you have any questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jaystarshot for this doc. Have some comments.
Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries. | ||
``NONE`` indicates that no CTEs will be materialized. | ||
``ALL`` indicates that all CTEs in the query will be materialized. | ||
``HEURISTIC` greedily materializes the earliest parent CTE which is repeated >= cte_heuristic_replication_threshold times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaystarshot : Do you intend to have a blog article or some doc giving more examples (queries and plans) demonstrating the behavior of these config ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is a good idea, i can add it as a blog or https://github.com/prestodb/presto/wiki
|
||
The number of partitions to be used for materializing Common Table Expressions (CTEs) in queries. | ||
This setting determines how many buckets or writers should be used when materializing the CTEs, potentially affecting the performance of queries involving CTE materialization. | ||
A higher number of partitions might improve parallelism but also increases overhead in terms of memory and network communication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its strange that a property called partitioning actually affects the buckets. But please emphasize in the explanation that these are buckets and not partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually comes from exchange materialization where we use the hash partition count to count the number of buckets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining! I thought this was strange, but the explanation makes sense.
This setting specifies which catalog should be used for CTE materialization and for determining how to partition the materialization of CTEs in queries. | ||
This can also be specified on a per-query basis using the ``cte_partitioning_provider_catalog`` session property. | ||
|
||
``cte-filter-and-projection-pushdown-enabled`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have an example of this as well. Do you have something in a design doc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#21638 There is a good example on this ticket
* **Allowed values:** ``HIVE_COMPATIBLE``, ``PRESTO_NATIVE`` | ||
* **Default value:** ``PRESTO_NATIVE`` | ||
|
||
This setting specifies the Hash function type for cte materialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do each of these settings mean ? Can you elaborate ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cb69e89
Sample commit which added them.
Its used
switch (bucketFunctionType) { |
I think it determines which hashing functions to use when inserting records into the buckets.
``NONE`` indicates that no CTEs will be materialized. | ||
``ALL`` indicates that all CTEs in the query will be materialized. | ||
``HEURISTIC` greedily materializes the earliest parent CTE which is repeated >= cte_heuristic_replication_threshold times | ||
``HEURISTIC_COMPLEX_QUERIES_ONLY` greedily materializes the earliest parent CTE which meets the heuristic criteria and has a join or aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may consider adding other operations, for example window operator which is also expensive for this strategy. But of course, this will be a separate PR from this one.
|
||
* **Type:** ``string`` | ||
* **Allowed values:** ``PAGEFILE``, ``ORC``, ``DWRF``, ``ALPHA``, ``PARQUET``, ``AVRO``, ``RCBINARY``, ``RCTEXT``, ``SEQUENCEFILE``, ``JSON``, ``TEXTFILE``, ``CSV`` | ||
* **Default value:** ``ORC`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we want to change the default value if the recommendation is another one (again on a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can I will do it in a follow up PR. This is also used by exchange materialization and I did not want to break any default flow incase anyone was using the default value. Also prestissimo doesn't support PAGEFILE for now
b284122
to
55ed042
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A couple of nits for formatting issues, but from a docs standpoint it looks good. Happy to take another look when you're done.
Thanks for the review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix! In addition to the one comment about the value descriptions in cte-materialization-strategy, I have a request that would be a repetitive set of changes and is simpler to make here.
Suggest removing the indent throughout to remove the vertical stripe on the left. I'm trying to get to this throughout the Presto doc and it's taking time, but your new doc following this behavior would help.
Screenshots of the first one and a half entries (before, and after - the after also includes my suggested formatting for the first item's values) from my local environment to show what I mean.
(The second entry is unchanged in after, only cte-materialization-strategy acts as an example.)
6ecdd80
to
37daaaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Only a few nits.
|
||
The number of partitions to be used for materializing Common Table Expressions (CTEs) in queries. | ||
This setting determines how many buckets or writers should be used when materializing the CTEs, potentially affecting the performance of queries involving CTE materialization. | ||
A higher number of partitions might improve parallelism but also increases overhead in terms of memory and network communication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining! I thought this was strange, but the explanation makes sense.
45d061c
to
e04ddb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pulled updated branch, new local build of docs to review, everything looks good. Thanks!
Description
Fixes #22648
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.
If release note is NOT required, use: