Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify orig_elements documentation #2929

Open
Marcell-Balint opened this issue Apr 25, 2024 · 4 comments
Open

Clarify orig_elements documentation #2929

Marcell-Balint opened this issue Apr 25, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@Marcell-Balint
Copy link

The current docs do not specify that you don't dump the elements as JSON objects into the JSON file.
It would be clearer, if you gave an example of the serialization behavior.

Thanks in advance!

@Marcell-Balint Marcell-Balint added the enhancement New feature or request label Apr 25, 2024
@kaleyroy
Copy link

kaleyroy commented Apr 27, 2024

i met the same issue with you.
Currently the orig_elements output as BASE64 not as JSON format. don't know how to de-serialize to JSON or TEXT.
I am using the partition_api via Docker in local development.
image

@kaleyroy
Copy link

I had figure it out.

from unstructured.staging.base import elements_from_base64_gzipped_json

base64_elements_str = res_elements[4]["metadata"]["orig_elements"]
eles = elements_from_base64_gzipped_json(base64_elements_str)

Found in this PR

image

@Marcell-Balint
Copy link
Author

Marcell-Balint commented Apr 27, 2024

@kaleyroy you could also try this solution from an earclier converstation:
#2887 (comment)

As far as I can understand data pipelines, I don't see too much reason dumping the objects into JSON files. I would much rather recommend processing the data while everything is loaded into memory and only dumping the final output into a JSON.
Note that you can modify elements and chunks, so you don't even have to create a separate structure for your new data. You can extend or modify the text option in your current chunks / elements.

@scanny scanny added the documentation Improvements or additions to documentation label Apr 28, 2024
@scanny
Copy link
Collaborator

scanny commented Apr 28, 2024

Good call, we'll clarify this in the documentation.

In the meantime, if you want to "rehydrate" elements in JSON form into in-memory objects, all you need to do is:

from unstructured.staging.base import elements_from_json

elements: list[Element] = elements_from_json(path_to_json_file)

The list[Element] you get will be just like the one originally produced by partitioning, or chunking, or whatever state it was in when you saved it.

This includes the .metadata.orig_elements value of each chunk being expanded into in-memory elements. The orig_elements are only compressed to Base64 when serialized, which is when Element.to_dict() is called on an element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants