Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

This work proposes methods to identify the data to be annotated, to balance model performance and annotation costs.

Vision-language models work poorly on data from underrepresented countries. This is primarily due to the diverse appearance of topics (objects and actions) across countries (e.g., ``toothbrush''). However, collecting diverse global data is very expensive. As solutions to budget annotations, we propose to: (1) annotate the images visually different from the ones in high-resource datasets such as LAION or ImageNet; (2) supplement data from low-resource countries with data from visually similar countries.

We hope our work contributes to building more inclusive and affordable vision-language models and datasets to help democratize AI globally.

For more information, read our COLING 2024 paper:

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

By Oana Ignat, Longju Bai, Joan Nwatu, and Rada Mihalcea.

This repository includes the obtained results.

Obtained Results

The data before and after pre-processing and the topic mapping is shown in data/data_pre-processing.csv
The removed (topic, country) pairs with less than 10 images are shown in data/data_removed.csv
The RQ1 answer, all the (topic, country) pairs that are consistently dissimilar to the high-resource data are in data/output_RQ1.csv
The RQ2 answer, all the (topic, country) pairs, and their most similar countries are in data/output_RQ2.csv

Citation

@inproceedings{ignat-etal-2024-annotations-budget,
    title = "Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost",
    author = "Ignat, Oana  and
      Bai, Longju  and
      Nwatu, Joan C.  and
      Mihalcea, Rada",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.112",
    pages = "1239--1259",
    abstract = "Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs. The resulting lists of countries and corresponding topics are made available at https://github.com/MichiganNLP/visual{\_}diversity{\_}budget.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
.gitignore		.gitignore
COLING.BudgetAnnotations.2024.pdf		COLING.BudgetAnnotations.2024.pdf
README.md		README.md
task_overview.png		task_overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

COLING.BudgetAnnotations.2024.pdf

COLING.BudgetAnnotations.2024.pdf

README.md

README.md

task_overview.png

task_overview.png

Repository files navigation

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

Obtained Results

Citation

About

Releases

Packages

MichiganNLP/visual_diversity_budget

Folders and files

Latest commit

History

Repository files navigation

Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

Obtained Results

Citation

About

Topics

Resources

Stars

Watchers

Forks