Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data #10161

Open
3 tasks done
ronco opened this issue May 16, 2024 · 5 comments
Open
3 tasks done
Labels
enhancement New feature or request triage unit tests Issues related to built-in dbt unit testing functionality user docs [docs.getdbt.com] Needs better documentation

Comments

@ronco
Copy link

ronco commented May 16, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When writing unit tests that include semi-structured data columns (SUPER in Redshift for example), the fixture data can only be staged via SQL. This feature calls for understanding structured data in dict fixtures.

An example fixture definition would be something like this:

- input: ref('table_with_super')
  format: dict
  rows:
    - { structured_column: {foo: bar}, int_column: 123 }

Describe alternatives you've considered

This functionality can be achieved using SQL fixtures like so:
SELECT json_parse('{"foo":"bar"}')::SUPER as structured_column
This is quite cumbersome as you have to specify all significant columns in your sql.

Who will this benefit?

Anyone making use of semi-structured data columns.

Are you interested in contributing this feature?

I am open to contributing if this is of interest.

Anything else?

No response

@ronco ronco added enhancement New feature or request triage labels May 16, 2024
@dbeatty10 dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label May 16, 2024
@graciegoheen
Copy link
Contributor

Hi! I believe we do support json structured types in unit test definitions using format: dict (see our docs here).

You'll need to use quotes though like so:

- input: ref('table_with_super')
  format: dict
  rows:
    - { structured_column: '{"foo": "bar"}', int_column: 123 }

@ronco
Copy link
Author

ronco commented May 21, 2024

Hey @graciegoheen , thanks for getting back to me. I've given your suggestion a try and I'm still running into difficulty. When I have this input:

          - { created_at: '2024-05-17 21:05:00', extra_info: '{"amount_usd": 50}' }

it generates this SQL for the fixture data: cast('{"amount_usd": 50}' as super)
When I try to query that fixture I get NULL:

with input_data as (select    cast('{"amount_usd": 50}' as super) as extra_info)
select extra_info.amount_usd from input_data

amount_usd|
----------+
          |

It seems the straight casting of the string to super isn't decomposing the data into structured values. However, if I switch the SQL to use json_parse I get the expected value:

with input_data as (select    cast(json_parse('{"amount_usd": 50}') as super) as extra_info)
select extra_info.amount_usd from input_data

amount_usd|
----------+
50        |

Is there a setting I'm missing for the Redshift adapter perhaps to make this work? Any other suggestions?

Thanks!

@dbeatty10 dbeatty10 self-assigned this May 22, 2024
@dbeatty10
Copy link
Contributor

Hey @ronco -- @graciegoheen and I discussed this a little bit yesterday.

You are spot-on with the following:

It seems the straight casting of the string to super isn't decomposing the data into structured values

This is because dbt unit testing doesn't do any unnesting / decomposition of the SUPER object -- it just leaves it as the entire object.

In order to do a unit test for one of the nested values, then you'll need to supply expected values for all of the nested values as well. See below for a full example.

If this doesn't work for you, could you share an example of your YAML for your unit test and your SQL models and we'll take another look?

More detail

It looks like the key insight in dbt-redshift is that:

  • You must specify all fields in a Redshift SUPER in a unit test. You cannot use only a subset of fields in a SUPER.

This is similar to a caveat in dbt-bigquery:

You must specify all fields in a BigQuery STRUCT in a unit test. You cannot use only a subset of fields in a STRUCT.

Example

models/stg_data_types.sql

select
    1 as int_field,
    2.0 as float_field, 
    1 as numeric_field,
    'my_string' as str_field,
    'my,cool''string' as str_escaped_field,
    true as bool_field,
    cast('2020-01-02' as date) as date_field,
    cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
    cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
    cast(json_parse('{"bar": "baz", "balance": 7.77, "active": false}') as super) as json_field

models/fct_data_types.sql

select * from {{ ref("stg_data_types") }}
unit_tests:
  - name: test_my_data_types
    model: fct_data_types
    given:
      - input: ref('stg_data_types')
        rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
    expect:
      rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'

Note how the last line in the unit test includes bar, balance, and active, even though you might only care about testing the value of balance.

@dbeatty10 dbeatty10 removed their assignment May 22, 2024
@ronco
Copy link
Author

ronco commented May 23, 2024

Hey @dbeatty10 , thanks for all this guidance. I am a little confused on how to pull this off. We're using our SUPER field for flexible data storage. In other words, there is no underlying defined STRUCT, the content varies from row to row. So I'm not sure what values I would need to add.

As requested, here are sample files illustrating the issue:
stg_data_types.sql:

select
    1 as int_field,
    2.0 as float_field, 
    1 as numeric_field,
    'my_string' as str_field,
    'my,cool''string' as str_escaped_field,
    true as bool_field,
    cast('2020-01-02' as date) as date_field,
    cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
    cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
    cast('' as super) as json_field

and fct_data_types.sql:

select str_field,
sum(json_field.balance) as sum_balance
from {{ ref("stg_data_types") }}
group by 1

and finally schema.yml:

unit_tests:
  - name: test_my_data_types
    model: fct_data_types
    given:
      - input: ref('stg_data_types')
        rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
    expect:
      rows:
         - str_field: 'my_string'
           sum_balance: 7.77

When I run that test I get the following output:

17:06:28  1 of 1 START unit_test fct_data_types::test_my_data_types ...................... [RUN]
17:06:30  1 of 1 FAIL 1 fct_data_types::test_my_data_types ............................... [FAIL 1 in 2.58s]
17:06:31
17:06:31  Finished running 1 unit test in 0 hours 0 minutes and 6.89 seconds (6.89s).
17:06:31
17:06:31  Completed with 1 error and 0 warnings:
17:06:31
17:06:31  Failure in unit_test test_my_data_types (models/dbt-test/schema.yml)
17:06:31

actual differs from expected:

@@,str_field,sum_balance
→ ,my_string,7.77→NULL

I think the problem is the cast is treating the json_field as a plain string, rather than a structured field I can extract data from. Let me know if there's anything else I can provide to get to the bottom of this.

Thanks!

@graciegoheen graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label May 30, 2024
@ronco
Copy link
Author

ronco commented May 31, 2024

Any further thoughts here? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage unit tests Issues related to built-in dbt unit testing functionality user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

No branches or pull requests

3 participants