[Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data #10161

ronco · 2024-05-16T19:53:24Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When writing unit tests that include semi-structured data columns (SUPER in Redshift for example), the fixture data can only be staged via SQL. This feature calls for understanding structured data in dict fixtures.

An example fixture definition would be something like this:

- input: ref('table_with_super')
  format: dict
  rows:
    - { structured_column: {foo: bar}, int_column: 123 }

Describe alternatives you've considered

This functionality can be achieved using SQL fixtures like so:
SELECT json_parse('{"foo":"bar"}')::SUPER as structured_column
This is quite cumbersome as you have to specify all significant columns in your sql.

Who will this benefit?

Anyone making use of semi-structured data columns.

Are you interested in contributing this feature?

I am open to contributing if this is of interest.

Anything else?

No response

The text was updated successfully, but these errors were encountered:

graciegoheen · 2024-05-21T17:30:21Z

Hi! I believe we do support json structured types in unit test definitions using format: dict (see our docs here).

You'll need to use quotes though like so:

- input: ref('table_with_super')
  format: dict
  rows:
    - { structured_column: '{"foo": "bar"}', int_column: 123 }

ronco · 2024-05-21T20:18:09Z

Hey @graciegoheen , thanks for getting back to me. I've given your suggestion a try and I'm still running into difficulty. When I have this input:

          - { created_at: '2024-05-17 21:05:00', extra_info: '{"amount_usd": 50}' }

it generates this SQL for the fixture data: cast('{"amount_usd": 50}' as super)
When I try to query that fixture I get NULL:

with input_data as (select    cast('{"amount_usd": 50}' as super) as extra_info)
select extra_info.amount_usd from input_data

amount_usd|
----------+
          |

It seems the straight casting of the string to super isn't decomposing the data into structured values. However, if I switch the SQL to use json_parse I get the expected value:

with input_data as (select    cast(json_parse('{"amount_usd": 50}') as super) as extra_info)
select extra_info.amount_usd from input_data

amount_usd|
----------+
50        |

Is there a setting I'm missing for the Redshift adapter perhaps to make this work? Any other suggestions?

Thanks!

dbeatty10 · 2024-05-22T15:08:34Z

Hey @ronco -- @graciegoheen and I discussed this a little bit yesterday.

You are spot-on with the following:

It seems the straight casting of the string to super isn't decomposing the data into structured values

This is because dbt unit testing doesn't do any unnesting / decomposition of the SUPER object -- it just leaves it as the entire object.

In order to do a unit test for one of the nested values, then you'll need to supply expected values for all of the nested values as well. See below for a full example.

If this doesn't work for you, could you share an example of your YAML for your unit test and your SQL models and we'll take another look?

More detail

It looks like the key insight in dbt-redshift is that:

You must specify all fields in a Redshift SUPER in a unit test. You cannot use only a subset of fields in a SUPER.

This is similar to a caveat in dbt-bigquery:

You must specify all fields in a BigQuery STRUCT in a unit test. You cannot use only a subset of fields in a STRUCT.

Example

models/stg_data_types.sql

select
    1 as int_field,
    2.0 as float_field, 
    1 as numeric_field,
    'my_string' as str_field,
    'my,cool''string' as str_escaped_field,
    true as bool_field,
    cast('2020-01-02' as date) as date_field,
    cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
    cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
    cast(json_parse('{"bar": "baz", "balance": 7.77, "active": false}') as super) as json_field

models/fct_data_types.sql

select * from {{ ref("stg_data_types") }}

unit_tests:
  - name: test_my_data_types
    model: fct_data_types
    given:
      - input: ref('stg_data_types')
        rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
    expect:
      rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'

Note how the last line in the unit test includes bar, balance, and active, even though you might only care about testing the value of balance.

ronco · 2024-05-23T17:08:31Z

Hey @dbeatty10 , thanks for all this guidance. I am a little confused on how to pull this off. We're using our SUPER field for flexible data storage. In other words, there is no underlying defined STRUCT, the content varies from row to row. So I'm not sure what values I would need to add.

As requested, here are sample files illustrating the issue:
stg_data_types.sql:

select
    1 as int_field,
    2.0 as float_field, 
    1 as numeric_field,
    'my_string' as str_field,
    'my,cool''string' as str_escaped_field,
    true as bool_field,
    cast('2020-01-02' as date) as date_field,
    cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
    cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
    cast('' as super) as json_field

and fct_data_types.sql:

select str_field,
sum(json_field.balance) as sum_balance
from {{ ref("stg_data_types") }}
group by 1

and finally schema.yml:

unit_tests:
  - name: test_my_data_types
    model: fct_data_types
    given:
      - input: ref('stg_data_types')
        rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
    expect:
      rows:
         - str_field: 'my_string'
           sum_balance: 7.77

When I run that test I get the following output:

17:06:28  1 of 1 START unit_test fct_data_types::test_my_data_types ...................... [RUN]
17:06:30  1 of 1 FAIL 1 fct_data_types::test_my_data_types ............................... [FAIL 1 in 2.58s]
17:06:31
17:06:31  Finished running 1 unit test in 0 hours 0 minutes and 6.89 seconds (6.89s).
17:06:31
17:06:31  Completed with 1 error and 0 warnings:
17:06:31
17:06:31  Failure in unit_test test_my_data_types (models/dbt-test/schema.yml)
17:06:31

actual differs from expected:

@@,str_field,sum_balance
→ ,my_string,7.77→NULL

I think the problem is the cast is treating the json_field as a plain string, rather than a structured field I can extract data from. Let me know if there's anything else I can provide to get to the bottom of this.

Thanks!

ronco · 2024-05-31T15:24:02Z

Any further thoughts here? Thanks!

ronco added enhancement New feature or request triage labels May 16, 2024

dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label May 16, 2024

graciegoheen added awaiting_response and removed triage labels May 21, 2024

github-actions bot added triage and removed awaiting_response labels May 21, 2024

dbeatty10 self-assigned this May 22, 2024

dbeatty10 added awaiting_response and removed triage labels May 22, 2024

dbeatty10 removed their assignment May 22, 2024

github-actions bot added triage and removed awaiting_response labels May 23, 2024

graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data #10161

[Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data #10161

ronco commented May 16, 2024 •

edited by joellabes

graciegoheen commented May 21, 2024

ronco commented May 21, 2024

dbeatty10 commented May 22, 2024

ronco commented May 23, 2024

ronco commented May 31, 2024

[Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data #10161

[Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data #10161

Comments

ronco commented May 16, 2024 • edited by joellabes

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

graciegoheen commented May 21, 2024

ronco commented May 21, 2024

dbeatty10 commented May 22, 2024

More detail

Example

ronco commented May 23, 2024

ronco commented May 31, 2024

ronco commented May 16, 2024 •

edited by joellabes