-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC: simple, more flexible log aggregation API for Archivers/RecordBuilders #20798
base: 5.x-dev
Are you sure you want to change the base?
Conversation
…via RecordBuilders
…d since archiving is a recursive process
…before this change)
…rd needed is created by a RecordBuilder
…ta is present within them. if some are not present, only archive those in a new partial archive.
…rent archive request
…iver class instances will not be created
…meric archive table, otherwise blob archive table
…typo causing this to fail
…yet mark RecordBuilder as api
…ures as necessary)
If you don't want this PR to be closed automatically in 28 days then you need to assign the label 'Do not close'. |
a2ce9c7
to
7bbdbff
Compare
7bbdbff
to
3272571
Compare
…ategory data is correct, and a bug fix)
…is done for segments + when auto-joining, allow log tables to specify multiple columns to join on and prioritize getWaysToJoinToOtherLogTables() over the presence of an idvisit column
… if they need to, and add fetchRow() method to LogAggregationQuery
FYI @tsteur this can be looked at now |
@diosmosis can you please rebase/merge in latest 5.x and look into the failing phpcs test? Thanks |
@michalkleiner this shouldn't be merged as is, I assume someone will have to make a decision whether it should go into core first. Hence no "Needs Review" label. |
If you don't want this PR to be closed automatically in 28 days then you need to assign the label 'Do not close'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general approach here of decoupling the aggregation from the query elements. More abstraction during the query build will allow better optimizations in the future. For example, window function support for ranking queries which would be easier to delegate to the DB adapter if less of the query is being written directly in code.
I can't see any issues with the code or structure here and since it's a PoC I won't get too detailed.
Perhaps it could be a good idea to test the new LogAggregationQuery
class with some of the more complicated existing archiving queries such as those in the Funnels plugin or pageview conversions to confirm that edge cases like sub-queries can be handled?
I'm not sure what the next step is for this PR, but if there's any specific feedback that you'd find useful then let me know and I can raise it with the rest of the team, otherwise I look forward to seeing the final version of this merged once it's ready 👍
Thanks @diosmosis, I've taken a look at those companion PRs and judging by the complexity of those queries it looks like there should be more than enough flexibility to handle pageview conversions and funnels too 🙂 |
Description:
This PR is a follow up to the RecordBuilder change. Requested by @tsteur it is an exploration into providing a simpler, more understandable API for performing log aggregation during archiving. It should not be merged as is (nor is it actually merge-able).
Description of the API vs. existing code
The current API is essentially manually constructing SQL and building DataTables from the result set. Developers must manually provide selects, from metadata, where conditions, group bys, etc. Even for simpler cases, this can be confusing to read:
Log aggregation is generally never simple, and usually involves the use of other complex bits of code like RankingQuery (which truncates an entire result set in mysql to a predefined number of rows, the row limit usually numbering in the thousands).
The current attempt at dealing w/ this complexity is the monolithic LogAggregator class which contains several methods for common log aggregation use cases, like querying the visits table for general visits metrics. For core plugins these methods can be used in many places, but for anything more interesting (like every premium plugin), they are inadequate. In fact, the queries for premium plugins are often far more complicated, and thus even harder to understand.
The API introduced in this PR surrounds a new class:
LogAggregationQuery
. Dimensions and metrics are added one by one and the SQL query is then generated from that information. (Note: It's possible to addfrom
entries w/ specific joins, but the aim is to eventually avoid having to think in relational terms in the API for this class.) Queries return generators that iterate over every row and auto-close the cursor, and can be supplied to new DataTable methods that automatically iterate over them and sum the data. Example usage:Proof of concept uses can be found here for the Goals plugin and as PRs in the MediaAnalytics and CrashAnalytics plugins (including the use of RankingQuery).
Potential future directions if the API is used
This change is not an end unto itself, but a stepping stone to other potentially more valuable changes:
Remove the LogAggregator class
With LogAggregationQuery class, the (more) hardcoded queries made in LogAggregator are no longer necessary. The class could be deprecated and removed.
Using only ArchivedMetric/Dimension classes to build queries
The original aim of this PoC was to only allow metadata classes like ArchivedMetric/Dimension in methods like
LogAggregation::addDimension
. Unfortunately, very few plugins use them and useArchivedMetric
in a way that would be optimal for generic log aggregation queries, so the PoC also includes (and depends on) methods that specify SQL specifically.A future change, however, could introduce more ArchivedMetric/Dimension classes (along with fixing any inefficiencies in existing uses). This would further decouple LogAggregationQuery with relational database concepts, relying solely on concepts in the higher domain of analytics.
This is another stepping stone to what could be a very valuable end result:
Potential path for supporting different backends (for log aggregation at least)
With the above change, and perhaps some other changes added to
LogTable
entities, log aggregation can be done within plugins without having to write SQL. It follows that non-mysql or other non-relational backends could be supported by providing differentLogAggregationQuery
implementations. While this would not be all the work required for Matomo supporting other backends, it's a significant step in that direction.Debugging commands that can help visualize the structure and flow of archiving
Since SQL queries with
LogAggregationQuery
are created from chunks, like dimension sql, metric sql, etc, it could be possible to create some debugging commands to help developers see how log aggregation for a specific plugin works. The basic process in a RecordBuilder is:A simple visualization of how the query result set looks and how the record row looks can be automatically generated from a LogAggregationQuery instance. Which could aid in understanding a plugin's archiving code for someone who doesn't yet understand.
Review