Getting frequent 504s with medium sized PDFs and ocr_only and hi_res strategies #393

mennthor · 2024-03-18T16:38:33Z

Describe the bug
Sending a single PDF (this one: https://arxiv.org/abs/2310.12931, embedded text, 39 pages) to the self hosted API, with either ocr_only or hi_res strategy.
The server has 32GB RAM, 8 CPU cores and a CUDA enabled GPU, ressources are below 20% CPU and 5% RAM when processing the PDF.
The Unstructured API version is v0.0.61.

The server responds with 504 after some 20 to 30s and the client caller via partition_via_api will try again for some time.
In the server logs I can see, that each time a new request is made it is properly worked on, printing out '[...] unstructured INFO Processing entire page OCR [...]' and the server does not crash.
However the client detaches and discards the request, so when the server is done processing, it is just discarded.

On the client side I get

INFO: Response status code: 504 Retry attempt #1. Sleeping 1.4 seconds before retry.
INFO: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

leaving the process running on the server and triggering a new one which gets detached in a similar fashion.

I tested the same call on the officially hosted API and the results was there after some 5m:30s.

Question: Is there some setting I can make on the server side to avoid that? It is obviously running on the officially hosted service so the answer should be yes. Could you hint me the right way to go with this?

To Reproduce
Calling the API like this:

from unstructured.partition.api import partition_via_api
API_BASE_URL = "....."

result = partition_via_api(
    "files/2310.12931.pdf",
    strategy="ocr_only",
    api_url=f"{API_BASE_URL}/general/v0/general/",
)

Filetype: PDF (see above for exact file)
Any additional API parameters: The server runs with UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB="4096"

Environment:

Using the hosted API or self hosting?
- using the self-hosted API
How are you calling the API? (Langchain, SDKs, cUrl, etc.)
- Calling via Unstructured API SDK partition_via_api

Additional context
Update: The same happens when calling the API with cURL like so:

curl -X "POST" "$API_BASE_URL/general/v0/general" \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F files=@files/2310.12931.pdf \
-F 'strategy=ocr_only'

minus the retry part, because cURL simply stops after a single attempt.
On the server side the behaviour stays the same, the process keeps running until properly finished and is then discarded because the client detached.

The text was updated successfully, but these errors were encountered:

mennthor · 2024-04-18T08:43:48Z

Hi :)
does someone have an idea what might be going wrong here?

awalker4 · 2024-05-04T18:27:51Z

Hi, sorry for the delay here. I think the first place to check would be your nginx config if you have access to it. There's usually some server timeout value that can be increased. In general, our advice for hi_res or ocr pdfs has been to split up the file and send smaller pages in parallel, since these long lived requests can be all sorts of trouble as you're seeing. Our client code supports split_pdf_page=True, which should also work in partition_via_api. More details in the python-client readme. Let me know if this works!

mennthor · 2024-05-06T05:56:28Z

Thx :)
I'll try it and give feedback

mennthor · 2024-05-21T11:03:09Z

Just a short update.
I'm having trouble using the split option, but because of certificate errors (network with custom certificates).
I tried both the linked version and another one by manually splitting and sending via concurrent.futures ThreadPool but both attempts do not work.
This definitely has nothing to do with Unstructured, and I'll try to figure this out to get to the proper testing with the suggested PSDF splitting.

Bryson14 · 2024-05-24T13:39:42Z

I'm having the same issues. 504 means your proxy server is timing out the HTTP request because the unstructured server hasn't responded. You can change the idle timeout period for the proxy server, or you can give the unstructured container more CPU power. I'm looking into using a GPU docker container because the logs show "lib/python3.10/site-packages/torch/cuda/init.py:619: UserWarning: Can't initialize NVM?L".. But I'll have to figure that out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting frequent 504s with medium sized PDFs and ocr_only and hi_res strategies #393

Getting frequent 504s with medium sized PDFs and ocr_only and hi_res strategies #393

mennthor commented Mar 18, 2024 •

edited

mennthor commented Apr 18, 2024

awalker4 commented May 4, 2024

mennthor commented May 6, 2024

mennthor commented May 21, 2024

Bryson14 commented May 24, 2024

Getting frequent 504s with medium sized PDFs and ocr_only and hi_res strategies #393

Getting frequent 504s with medium sized PDFs and ocr_only and hi_res strategies #393

Comments

mennthor commented Mar 18, 2024 • edited

mennthor commented Apr 18, 2024

awalker4 commented May 4, 2024

mennthor commented May 6, 2024

mennthor commented May 21, 2024

Bryson14 commented May 24, 2024

mennthor commented Mar 18, 2024 •

edited