Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add id_generator functionality to the document cleaner. #7683

Closed
wants to merge 9 commits into from

Conversation

CarlosFerLo
Copy link
Contributor

Related Issues

Proposed Changes:

I added a new parameter to the DocymentCleaner called id_generator, it expects a Callable[[Document, Document], str] and it is used to generate the new id for the cleaned documents. The first argument is the old document and the second is the new document.

I have added two id generators DEFFAULT_ID_GENERATOR and KEEP_ID where the first one keeps the id from the new document unchanged and the second one returns the id of the original document. The keep id function is to relate to the original issue.

How did you test it?

Added a few unit tests.

Notes for the reviewer

I am not sure if the existence of the DEFFAULT_ID_GENERATOR and KEEP_ID is what you have in mind, if you don't like it we may just delete them.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes ✅
  • I added unit tests and updated the docstrings ✅
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:. ✅
  • I documented my code ✅
  • I ran pre-commit hooks and fixed any issue ✅

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels May 12, 2024
@CarlosFerLo CarlosFerLo marked this pull request as ready for review May 12, 2024 16:54
@CarlosFerLo CarlosFerLo requested review from a team as code owners May 12, 2024 16:54
@CarlosFerLo CarlosFerLo requested review from dfokina and silvanocerza and removed request for a team May 12, 2024 16:54
@coveralls
Copy link
Collaborator

coveralls commented May 12, 2024

Pull Request Test Coverage Report for Build 9083377539

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.02%) to 90.458%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/document_cleaner.py 1 99.02%
Totals Coverage Status
Change from base Build 9081766626: 0.02%
Covered Lines: 6560
Relevant Lines: 7252

💛 - Coveralls

@vblagoje
Copy link
Member

@silvanocerza see Github issue for previous discussion on how we got here. I can also take back over PR review because I discussed this with @CarlosFerLo Perhaps have a look as well for general comment and thumbs up/down.

@vblagoje vblagoje self-requested a review May 13, 2024 07:57
@vblagoje
Copy link
Member

@CarlosFerLo this is much better and almost there. A few minor things needed to push this one through the finish line:

  1. Once we introduce these non-standard types for (de)serialization like Callable we need to implement to_dict and from_dict functions in DocumentCleaner class. We already have methods serialize_callable and deserialize_callable that you can use here, super simple, have a look at any of the generators and how it does this exact same thing for streaming_callback field.

  2. Thanks for reno note, we don't use highlights unless it is a major feature, this one is not, so let's just drop the highlights section and also issues section, just leave enhancements.

  3. Add a few unit tests to make sure to_dict and from_dict are implemented well.

And that should be it, let's integrate this one at that point. Thanks @CarlosFerLo

@CarlosFerLo
Copy link
Contributor Author

@vblagoje got it, working on it right now. I thought highlights were just a summary of everything, needed in all reno. Should I include issue numbers now on or not?

@vblagoje
Copy link
Member

@vblagoje got it, working on it right now. I thought highlights were just a summary of everything, needed in all reno. Should I include issue numbers now on or not?

No problem @CarlosFerLo , yeah just remove the issue reference, leave the enhancement portion only 🙏

@CarlosFerLo
Copy link
Contributor Author

@vblagoje I think that everything has been solved. :)

@vblagoje
Copy link
Member

@CarlosFerLo one last nitpick and we can integrate it 🙏

Copy link
Contributor

@silvanocerza silvanocerza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong approach in my opinion.

The best and simpler approach from #7618 is the way to go. This adds too much unnecessary complexity.

@CarlosFerLo
Copy link
Contributor Author

This is the wrong approach in my opinion.

The best and simpler approach from #7618 is the way to go. This adds too much unnecessary complexity.

@vblagoje you decide, although I believe this new implementation could have more applications, this extra complexity is somewhat too much for the little use it has. But it is your decision to make this merge or not. :)

@CarlosFerLo CarlosFerLo requested a review from vblagoje May 15, 2024 20:47
@CarlosFerLo
Copy link
Contributor Author

@silvanocerza @vblagoje should I reopen #7618 PR?

@silvanocerza
Copy link
Contributor

Yeah, go ahead.

@CarlosFerLo
Copy link
Contributor Author

Cannot reopen previous PR as it's based on the same branch as this one. Later I will create a new branch with the content of the other PR to be able to reopen the pull request.

@vblagoje
Copy link
Member

Hey @CarlosFerLo sure, apologies for back and forth, @silvanocerza is an expert in these details, let's go with his recommendations.

@CarlosFerLo
Copy link
Contributor Author

@vblagoje do not worry, I like to learn how to implement different things, and at first I found your way more promissing.

@vblagoje
Copy link
Member

@vblagoje do not worry, I like to learn how to implement different things, and at first I found your way more promising.

They are both good, it's nuances. My colleague comes with a lot more Python background and I trust his judgement in these types of decisions.

@CarlosFerLo
Copy link
Contributor Author

See the new PR here: #7703

@vblagoje
Copy link
Member

Superseded by #7703 closing this one

@vblagoje vblagoje closed this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preprocessing should allow keeping Document ids unchanged
4 participants