Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Vision Grid Transformer for Document Layout Analysis #100

Open
gregbugaj opened this issue Jan 6, 2024 · 2 comments
Open

Implement Vision Grid Transformer for Document Layout Analysis #100

gregbugaj opened this issue Jan 6, 2024 · 2 comments

Comments

@gregbugaj
Copy link
Collaborator

gregbugaj commented Jan 6, 2024

Implement Vision Grid Transformer for Document Layout Analysis

AlibabaResearch recently published a new model for Document Layout Analysis which sets a new benchmark in the task of Document Layout Analysis.

Introduction - To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding
https://arxiv.org/abs/2308.14978

Effect on LLM usage - VGT can dissect the page into different portions (headers, subheaders, titles, etc.) which can then be OCRed and passed to an LLM for RAG.

@shuaills
Copy link

shuaills commented Feb 8, 2024

I'm working on a similar project and am excited to see that you have already started. I'm curious about your progress. If needed, I can offer my help.

@gregbugaj
Copy link
Collaborator Author

That would be great, I have started looking at Advanced Literate Machinery.

I was not able to obtain the weight to test the model, but it does looks very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants