Clarify `orig_elements` documentation #2929

Marcell-Balint · 2024-04-25T07:26:22Z

The current docs do not specify that you don't dump the elements as JSON objects into the JSON file.
It would be clearer, if you gave an example of the serialization behavior.

Thanks in advance!

kaleyroy · 2024-04-27T11:17:19Z

i met the same issue with you.
Currently the orig_elements output as BASE64 not as JSON format. don't know how to de-serialize to JSON or TEXT.
I am using the partition_api via Docker in local development.

kaleyroy · 2024-04-27T14:12:40Z

I had figure it out.

from unstructured.staging.base import elements_from_base64_gzipped_json

base64_elements_str = res_elements[4]["metadata"]["orig_elements"]
eles = elements_from_base64_gzipped_json(base64_elements_str)

Found in this PR

Marcell-Balint · 2024-04-27T20:29:38Z

@kaleyroy you could also try this solution from an earclier converstation:
#2887 (comment)

As far as I can understand data pipelines, I don't see too much reason dumping the objects into JSON files. I would much rather recommend processing the data while everything is loaded into memory and only dumping the final output into a JSON.
Note that you can modify elements and chunks, so you don't even have to create a separate structure for your new data. You can extend or modify the text option in your current chunks / elements.

scanny · 2024-04-28T18:24:02Z

Good call, we'll clarify this in the documentation.

In the meantime, if you want to "rehydrate" elements in JSON form into in-memory objects, all you need to do is:

from unstructured.staging.base import elements_from_json

elements: list[Element] = elements_from_json(path_to_json_file)

The list[Element] you get will be just like the one originally produced by partitioning, or chunking, or whatever state it was in when you saved it.

This includes the .metadata.orig_elements value of each chunk being expanded into in-memory elements. The orig_elements are only compressed to Base64 when serialized, which is when Element.to_dict() is called on an element.

Marcell-Balint added the enhancement New feature or request label Apr 25, 2024

scanny added the documentation Improvements or additions to documentation label Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify `orig_elements` documentation #2929

Clarify `orig_elements` documentation #2929

Marcell-Balint commented Apr 25, 2024

kaleyroy commented Apr 27, 2024 •

edited

kaleyroy commented Apr 27, 2024

Marcell-Balint commented Apr 27, 2024 •

edited

scanny commented Apr 28, 2024 •

edited

Clarify orig_elements documentation #2929

Clarify orig_elements documentation #2929

Comments

Marcell-Balint commented Apr 25, 2024

kaleyroy commented Apr 27, 2024 • edited

kaleyroy commented Apr 27, 2024

Marcell-Balint commented Apr 27, 2024 • edited

scanny commented Apr 28, 2024 • edited

Clarify `orig_elements` documentation #2929

Clarify `orig_elements` documentation #2929

kaleyroy commented Apr 27, 2024 •

edited

Marcell-Balint commented Apr 27, 2024 •

edited

scanny commented Apr 28, 2024 •

edited