Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Introduce parser supplier support in FileSystemDocumentLoader #1031

Merged
merged 2 commits into from May 6, 2024

Conversation

KaisNeffati
Copy link
Contributor

Issue

#1026

Change

General checklist

  • There are no breaking changes
  • I have added unit and integration tests for my change
  • I have manually run all the unit and integration tests in the module I have added/changed, and they are all green
  • I have manually run all the unit and integration tests in the core and main modules, and they are all green

Checklist for adding new model integration

  • I have added my new module in the BOM

Checklist for adding new embedding store integration

  • I have added a {NameOfIntegration}EmbeddingStoreIT that extends from either EmbeddingStoreIT or EmbeddingStoreWithFilteringIT
  • I have added my new module in the BOM

Checklist for changing existing embedding store integration

  • I have manually verified that the {NameOfIntegration}EmbeddingStore works correctly with the data persisted using the latest released version of LangChain4j

@langchain4j langchain4j added the P1 Highest priority label Apr 29, 2024
@langchain4j
Copy link
Owner

@KaisNeffati thanks a lot!

I am not sure the proposed solution will solve the issue, as:

  • Current users already use existing methods of FileSystemDocumentLoader and they will have to change their code to use new methods with suppliers in order to fix the issue
  • New users will not necessarily find and use the "correct" new methods with suppliers, so the wrong behavior might still persist for them and be unobvious
  • All other parsers are stateless, so having ApacheTikaDocumentParser statefull is counter-intuitive

Instead, I would propose to make ApacheTikaDocumentParser stateless:

  • the creation of Parser, ContentHandler and other components should be done in parse method instead of the constructor
  • a new constructor should be added. this ctor should accept suppliers of Parser, ContentHandler and other tika components
  • existing non-default constructor should be marked as @Deprecated with a comment to use ctor with suppliers instead if user wants to use this parser for multiple files

WDYT? Thanks!

@KaisNeffati KaisNeffati force-pushed the feature/parser-supplier branch 2 times, most recently from 7c56749 to aa1269d Compare April 29, 2024 20:17
@KaisNeffati
Copy link
Contributor Author

Yeah ! that's a better approach , I'll update the PR accordingly !

Well seen !

Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KaisNeffati thank you, good job! Just a few minor comments

Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KaisNeffati thank you! Good job!

@langchain4j langchain4j merged commit f34c543 into langchain4j:main May 6, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Highest priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants