Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Jina reader fails to parse URLs containing Chinese characters #23

Closed
WongLoki opened this issue Apr 18, 2024 · 2 comments
Closed

Comments

@WongLoki
Copy link

Issue: Jina reader fails to parse URLs containing Chinese characters

Description:

We have encountered an issue where the Jina reader fails to parse URLs that contain Chinese characters. This issue is causing our application to throw errors and prevents us from properly extracting content from certain websites.

Steps to Reproduce:

  1. Make a request to the Jina reader API with a URL containing Chinese characters, such as https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB.
  2. Observe that the Jina reader fails to parse the URL and returns an error.

Expected Behavior:

We expect the Jina reader to properly handle URLs containing Chinese characters and successfully parse the corresponding web pages. The reader should be able to decode the URL, retrieve the web page content, and return it as expected.

Actual Behavior:

When a URL containing Chinese characters is passed to the Jina reader, it fails to parse the URL and throws an error. The error message typically indicates that the reader is unable to read properties of undefined, specifically the 'parentNode' property.

Example Error Message:

Failed to fetch https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB: {"code":500,"status":50000,"message":"Cannot read properties of undefined (reading 'parentNode')","name":"TypeError"}

Impact:

This issue prevents our application from properly extracting content from websites that have URLs containing Chinese characters. It limits the functionality of our application and affects the user experience when dealing with such websites.

Potential Causes:

  • The Jina reader may not be properly decoding the URL before making the request, leading to an invalid URL being passed to the underlying parsing logic.
  • The parsing logic within the Jina reader may not be handling URLs with Chinese characters correctly, resulting in the "Cannot read properties of undefined" error.

Workaround:

As a temporary workaround, we have implemented a filtering mechanism in our application to skip URLs that contain Chinese characters. However, this is not an ideal solution as it limits the functionality and coverage of our application.

Request:

We kindly request the Jina team to investigate this issue and provide a fix that allows the Jina reader to properly handle URLs containing Chinese characters. It would be greatly appreciated if you could provide an update on the progress and an estimated timeline for the resolution.

Additional Information:

  • We are currently using the official Jina reader API, not the open-source service.
  • We are in the process of setting up our own service, but we are unsure if the open-source service also has this issue.

Please let us know if you require any further information or if there are any specific details you need to investigate and resolve this issue.

Thank you for your attention to this matter. We look forward to your response and resolution.

Best regards,
Loki.W

@yangcheng
Copy link

seems working for me on mac , both in browser and curl

➜  ~ curl https://r.jina.ai/https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB
Title: 周杰倫

URL Source: https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB

Published Time: 2004-10-14T15:33:12Z

Markdown Content:
周杰倫

[![](https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/%E5%92%AA%E5%92%95%E9%9F%B3%E4%B9%90%E7%9B%9B%E5%85%B8_%2825%29.jpg/220px-%E5%92%AA%E5%92%95%E9%9F%B3%E4%B9%90%E7%9B%9B%E5%85%B8_%2825%29.jpg)](https://zh.wikipedia.org/wiki/File:%E5%92%AA%E5%92%95%E9%9F%B3%E4%B9%90%E7%9B%9B%E5%85%B8_(25).jpg)

2017年的周杰倫


@nomagick
Copy link
Member

nomagick commented May 8, 2024

This was some other bug unrelated to Chinese characters.
Was fixed in cbc13ec

@nomagick nomagick closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants