Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/partition_metadata #2933

Open
Falven opened this issue Apr 25, 2024 · 1 comment
Open

feat/partition_metadata #2933

Falven opened this issue Apr 25, 2024 · 1 comment
Labels
enhancement New feature or request html

Comments

@Falven
Copy link

Falven commented Apr 25, 2024

Is your feature request related to a problem? Please describe.
I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and head > title elements.

Describe the solution you'd like
Some flexible way to define additional metadata to extract per document type. Text file types could be via regex (as currently seemingly supported), html via selectors, etc.

Describe alternatives you've considered
Doing it post partitioning, before indexing, but it's not elegant nor efficient.

Additional context
Even using LLM's to extract metadata as orchestration frameworks support would be great.

@Falven Falven added the enhancement New feature or request label Apr 25, 2024
@scanny scanny added the html label Apr 25, 2024
@adieuadieu
Copy link
Contributor

adieuadieu commented Apr 29, 2024

I've also wanted this. The title, but also meta tags like the keywords and description, and the og tags. Currently I fetch the URL myself, parse these things out with beautifulsoup, then pass the response text to partition for the rest. But, would somehow be nicer if partition_html could return these things in a more structured way. Especially for title, would be nice if it came back as an e.g. PageTitle (or, I guess HTMLHeadTitle ?) element type, or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request html
Projects
None yet
Development

No branches or pull requests

3 participants