Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Adding column to a satellite causes error on bigquery #227

Closed
tenajima opened this issue Feb 15, 2024 · 6 comments
Closed

[BUG] Adding column to a satellite causes error on bigquery #227

tenajima opened this issue Feb 15, 2024 · 6 comments
Assignees
Labels
bigquery Issues specific to Google BigQuery bug Something isn't working

Comments

@tenajima
Copy link

tenajima commented Feb 15, 2024

Hello,

Is your feature request related to a problem? Please describe.
I tried to add a new column to the satellite. I added it, expecting that the existing records would also be re-imported into the satellite by including the new column in the hashdiff.
However, an error occurred in the part where automate-dv compiles the code.
I am using BigQuery.

Describe the solution you'd like
How about reverting the change from window_cols to source_cols that was made in the latest_records CTE section of sat.sql in this commit?

Describe alternatives you've considered
When adding a new column, create a new satellite instead of adding it to the existing satellite.

Additional context
I wanted to know the background of the above commit, but the link to the development repository in CONTRIBUTING.md was broken and I couldn't view it. Could you please update it?
If you accept the proposed solution, I will create a Pull Request.

Thank you for your attention to this matter.

AB#5350

@tenajima tenajima added the feature This is is requesting a new feature label Feb 15, 2024
@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Feb 15, 2024

Hello and thank you for this request. I've changed it to a bug report because this isn't something that should be happening.

Generally, speaking, when modifying the columns of a raw vault structure in production, you shouldn't be modifying the table directly and in-place but creating a new version of the model and transitioning downstream dependent tables to that new version, using dbt model versions or a suitable manual alternative.

With that said, during development, ad-hoc changes are necessary to iterate quickly, and regardless this should not be causing an error.

Please can you provide further details of the error (error messages etc.) to help us find the root cause of the issue?

In regards to the commit you have referenced, this was one commit in a larger satellite re-work to support intra-day loading and introduce more robust satellite logic to handle problem data and incorrect data loading approaches. This was released in 0.10.0, so there's further context in those release notes.

Our development repository is currently private at this time and we cannot accept contributions via the public repository - this is an ongoing issue which we are seeking to resolve and improve for the community.

@DVAlexHiggs DVAlexHiggs added bug Something isn't working and removed feature This is is requesting a new feature labels Feb 15, 2024
@DVAlexHiggs DVAlexHiggs changed the title [FEATURE] Add new column to sat with no error [BUG] Add new column to sat with no error Feb 15, 2024
@DVAlexHiggs DVAlexHiggs changed the title [BUG] Add new column to sat with no error [BUG] Adding column to a satellite causes error on bigquery Feb 15, 2024
@DVAlexHiggs DVAlexHiggs added the bigquery Issues specific to Google BigQuery label Feb 15, 2024
@tenajima
Copy link
Author

tenajima commented Feb 15, 2024

Hello,
Thank you for your response. Here is the additional information that you requested.
The error message is as follows:
Name {new_column} not found inside current_records at [30:165]

Compiled query is as follows:

The compiled query (please note that the column names and others have been replaced with dummy names)
/* {"app": "dbt", "dbt_version": "1.7.7", "profile_name": "user", "target_name": "prod", "node_id": "model.project_name.sat_table_name"} */

create or replace table `project_name`.`dbt_raw_vault`.`sat_table_name__dbt_tmp`
      
OPTIONS(
  description="""""",
  expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)
)
as (

-- Generated by AutomateDV (formerly known as dbtvault)

WITH source_data AS (
    SELECT a.primary_key, a.hashdiff, a.attribute_1, a.attribute_2, ... , a.load_timestamp, a.record_source
    FROM `project_name`.`dbt_stage`.`stg_table_name` AS a
    WHERE a.primary_key IS NOT NULL
),

latest_records AS (
    SELECT current_records.primary_key, current_records.hashdiff, current_records.attribute_1, current_records.attribute_2, ..., current_records.load_timestamp, current_records.record_source,
        RANK() OVER (
           PARTITION BY current_records.primary_key
           ORDER BY current_records.load_timestamp DESC
        ) AS rank_num
    FROM `project_name`.`dbt_raw_vault`.`sat_table_name` AS current_records
        JOIN (
            SELECT DISTINCT source_data.primary_key
            FROM source_data
        ) AS source_records
            ON source_records.primary_key = current_records.primary_key
    QUALIFY rank_num = 1
),

first_record_in_set AS (
    SELECT
    sd.primary_key, sd.hashdiff, sd.attribute_1, sd.attribute_2, .... , sd.load_timestamp, sd.record_source,
    RANK() OVER (
            PARTITION BY sd.primary_key
            ORDER BY sd.load_timestamp ASC
        ) as asc_rank
    FROM source_data as sd
    QUALIFY asc_rank = 1
),

unique_source_records AS (
    SELECT DISTINCT
        sd.primary_key, sd.hashdiff, sd.attribute_1, sd.attribute_2, ..., sd.load_timestamp, sd.record_source
    FROM source_data as sd
    QUALIFY sd.hashdiff != LAG(sd.hashdiff) OVER (
        PARTITION BY sd.primary_key
        ORDER BY sd.load_timestamp ASC)
),

records_to_insert AS (
    SELECT frin.primary_key, frin.hashdiff AS hashdiff, frin.attribute_1, frin.attribute_2, ..., frin.load_timestamp, frin.record_source
    FROM first_record_in_set AS frin
    LEFT JOIN latest_records lr
        ON lr.primary_key = frin.primary_key
        AND lr.hashdiff = frin.hashdiff
        WHERE lr.hashdiff IS NULL
    UNION DISTINCT
    SELECT usr.primary_key, usr.hashdiff, usr.attribute_1, usr.attribute_2, ..., usr.load_timestamp, usr.record_source
    FROM unique_source_records as usr
)

SELECT * FROM records_to_insert
    );

I hope this information is helpful for troubleshooting. Looking forward to your feedback and solution.
Best Regards

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Feb 15, 2024

Ok that's clarified things, thank you. The issue is arising because the satellite logic requires a 'look-ahead' to the records currently in the Satellite. As the new column is not in the current version of the satellite (because it didn't exist before) it is giving an error because it cannot find the column.

Generally, in development we would tend to just use the dbt --full-refresh flag, which would re-create the table with the new column, allowing future loads to reference the new column correctly.

It looks like your table is being deployed to production, so not only would I suggest avoiding doing a --full-refresh but even if this error was not occurring, I would not advise adding columns in an ad-hoc manner like this, but instead I would suggest implementing appropriate version control as mentioned in my first reply.

Again saying this, we should probably 'plug' this issue - something like dbt's built-in on_schema_change could be used here to handle newly-discovered columns, perhaps back-filling the values with NULLs.

This wouldn't be a trivial solution to implement however, and we must consider the investment of time taken to develop the solution vs the likelihood of the issue and whether or not it makes sense for us to handle it in the AutomateDV code or whether it should be the user's responsibility.

I would be interested in your thoughts on this!

@tenajima
Copy link
Author

Hi,

Thanks for clarifying the problem and providing detailed feedback. As you pointed out, our team uses sync_all_columns with on_schema_change, anticipating the addition of new columns upon their introduction. An error led us to investigate this issue further.

The addition of a column this time was due to the omission in the satellite definition of a column defined in raw_stg. While the ideal scenario might call for the addition of a new satellite, the frequent creation of small satellites in operation can lead to messy management.

At the same time, it's difficult to maintain strict checks to ensure all columns defined in raw_stg are correctly added to the satellite. In such omission cases, intuitively adding it to the satellite, computing a new hashdiff, and initiating a re-import makes sense to us. We agree that NULL should fill the past values in these cases.

Thank you again for your ideas!

@mrcool4
Copy link

mrcool4 commented Apr 8, 2024

Hi @DVAlexHiggs, we are facing this exact issue. Any chances of incorporating this feature in the future releases?

@DVAlexHiggs DVAlexHiggs added bigquery Issues specific to Google BigQuery and removed bigquery Issues specific to Google BigQuery labels May 15, 2024
@DVAlexHiggs
Copy link
Member

Fixed and released in v0.11.0, Thank you for your patience on this one!

If the issue persists, please re-open this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery Issues specific to Google BigQuery bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants