Dataset Card for PubLayNet

annotations_creators

language

language_creators

license

multilinguality

pretty_name

size_categories

source_datasets

Dataset Card for PubLayNet

Dataset Description

Homepage: https://developer.ibm.com/exchanges/data/all/publaynet/
Repository: https://github.com/shunk031/huggingface-datasets_PubLayNet
Paper (Preprint): https://arxiv.org/abs/1908.07836
Paper (ICDAR2019): https://ieeexplore.ieee.org/document/8977963

Dataset Summary

PubLayNet is a dataset for document layout analysis. It contains images of research papers and articles and annotations for various elements in a page such as "text", "list", "figure" etc in these research paper images. The dataset was obtained by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure

Data Instances

import datasets as ds

dataset = ds.load_dataset(
    path="shunk031/PubLayNet",
    decode_rle=True, # True if Run-length Encoding (RLE) is to be decoded and converted to binary mask.
)

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

[More Information Needed]

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

CDLA-Permissive

Citation Information

@inproceedings{zhong2019publaynet,
  title={Publaynet: largest dataset ever for document layout analysis},
  author={Zhong, Xu and Tang, Jianbin and Yepes, Antonio Jimeno},
  booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
  pages={1015--1022},
  year={2019},
  organization={IEEE}
}

Contributions

Thanks to ibm-aur-nlp/PubLayNet for creating this dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
PubLayNet.py		PubLayNet.py
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

creative-graphic-design/huggingface-datasets_PubLayNet

Folders and files

Latest commit

History

Repository files navigation

Dataset Card for PubLayNet

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

About

Topics

Resources

Stars

Watchers

Forks

Languages