Implement Vision Grid Transformer for Document Layout Analysis #100

gregbugaj · 2024-01-06T00:18:16Z

Implement Vision Grid Transformer for Document Layout Analysis

AlibabaResearch recently published a new model for Document Layout Analysis which sets a new benchmark in the task of Document Layout Analysis.

Introduction - To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding
https://arxiv.org/abs/2308.14978

Effect on LLM usage - VGT can dissect the page into different portions (headers, subheaders, titles, etc.) which can then be OCRed and passed to an LLM for RAG.

shuaills · 2024-02-08T01:51:51Z

I'm working on a similar project and am excited to see that you have already started. I'm curious about your progress. If needed, I can offer my help.

gregbugaj · 2024-02-09T14:53:05Z

That would be great, I have started looking at Advanced Literate Machinery.

I was not able to obtain the weight to test the model, but it does looks very good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Vision Grid Transformer for Document Layout Analysis #100

Implement Vision Grid Transformer for Document Layout Analysis #100

gregbugaj commented Jan 6, 2024 •

edited

shuaills commented Feb 8, 2024

gregbugaj commented Feb 9, 2024

Implement Vision Grid Transformer for Document Layout Analysis #100

Implement Vision Grid Transformer for Document Layout Analysis #100

Comments

gregbugaj commented Jan 6, 2024 • edited

shuaills commented Feb 8, 2024

gregbugaj commented Feb 9, 2024

gregbugaj commented Jan 6, 2024 •

edited