Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Lab hard fails with "'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte" when reading certain binary types #28372

Open
3 tasks done
whitleykeith opened this issue May 7, 2024 · 4 comments

Comments

@whitleykeith
Copy link

whitleykeith commented May 7, 2024

Bug description

See #28001 for context

I attempted to validate the fix from #28266 but I'm still getting the same error as described in the issue.

How to reproduce the bug

  1. Have a Trino with VARBINARY column
  2. Add binary, non-utf8 encodable data
  3. Try to query it from SQL Lab

Screenshots/recordings

No response

Superset version

467e612-dev

Python version

3.10

Node version

I don't know

Browser

Chrome

Additional context

Stack trace:

superset   File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1484, in full_dispatch_request
superset     rv = self.dispatch_request()
superset   File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1469, in dispatch_request
superset     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
superset   File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
superset     return f(self, *args, **kwargs)
superset   File "/app/superset/views/base_api.py", line 127, in wraps
superset     raise ex
superset   File "/app/superset/views/base_api.py", line 121, in wraps
superset     duration, response = time_function(f, self, *args, **kwargs)
superset   File "/app/superset/utils/core.py", line 1491, in time_function
superset     response = func(*args, **kwargs)
superset   File "/usr/local/lib/python3.10/site-packages/flask_appbuilder/api/__init__.py", line 182, in wraps
superset     return f(self, *args, **kwargs)
superset   File "/app/superset/utils/log.py", line 263, in wrapper
superset     value = f(*args, **kwargs)
superset   File "/app/superset/sqllab/api.py", line 346, in get_results
superset     payload = json.dumps(
superset   File "/usr/local/lib/python3.10/site-packages/simplejson/__init__.py", line 395, in dumps
superset     **kw).encode(obj)
superset   File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 298, in encode
superset     chunks = self.iterencode(o)
superset   File "/usr/local/lib/python3.10/site-packages/simplejson/encoder.py", line 379, in iterencode
superset     return _iterencode(o, 0)
superset UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdb in position 0: invalid continuation byte

Also, sometimes the stack trace and error message can be slightly different. I can see this stack trace too:

File "/app/superset/sqllab/api.py", line 342, in get_results
  result = SqlExecutionResultsCommand(key=key, rows=rows).run()
File "/app/superset/commands/sql_lab/results.py", line 106, in run
 payload = utils.zlib_decompress(
File "/app/superset/utils/core.py", line 984, in zlib_decompress
  return decompressed.decode("utf-8") if decode else decompressed
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
Copy link

dosubot bot commented May 7, 2024

To address the UnicodeDecodeError you're encountering when querying binary, non-UTF-8 encodable data from a VARBINARY column in Trino using SQL Lab in Superset, consider implementing a custom JSON serializer that can handle binary data by encoding it in a format like base64. This will allow binary data to be serialized into a string format that JSON can handle without errors.

Here are the steps you can take:

  1. Implement a Custom Serializer: Create a custom JSON serializer for handling binary data, ensuring it's encoded in a format such as base64 before serialization.

  2. Modify JSON Serialization Calls: Update the json.dumps calls in sqllab/api.py to utilize this custom serializer, particularly where JSON serialization of execution results and data serialization occurs.

  3. Adjust Data Decompression Logic: In utils/core.py, ensure the decompression logic can handle binary data correctly, which may involve type checks before decompression and appropriate handling for binary data.

  4. Thorough Testing: Test the querying of binary data from VARBINARY columns extensively to confirm the absence of the UnicodeDecodeError.

  5. Documentation: Update project documentation to reflect the handling of binary data in SQL Lab queries.

This approach should help you resolve the encoding issues you're facing with binary data in Superset's SQL Lab.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@cccs-tom
Copy link
Contributor

cccs-tom commented May 7, 2024

+1
We're running into this as well. It also affects the Table viz (and possibly others), not just SQL Lab.
@rusackas Could this have anything to do with the stringify stuff you were trying to remove?

@rusackas
Copy link
Member

rusackas commented May 7, 2024

I don't think so, but it's possible. I was trying to allow UTF-8 strings through without them being escaped. It's possible this has to do with something being escaped incorrectly, which may or may not be the same chunk o' code.

@AutumnSun1996
Copy link

it seems that simplejson will handle bytes itself, and will not use logic in 'default' function:

import simplejson


def convert_bytes(val):
    if isinstance(val, bytes):
        return 'converted-' + repr(val)
    return val


print(simplejson.__version__)
# 3.17.3

print(simplejson.dumps({'a': b'b'}, default=convert_bytes))
# {"a": "b"}

print(simplejson.dumps({'a': b'\x00\x85'}, default=convert_bytes))
# raises UnicodeDecodeError

so bytes should be handled before they goes into the dumps function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants