GVCCI: Lifelong Learning of Visual Grounding
for Language-Guided Robotic Manipulation

Junghyun Kim, Gi-Cheon Kang^*, Jaein Kim^*, Suyeon Shin, Byoung-Tak Zhang

The 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

arXiv | Poster | Presentation Video | Demo Video

Overview

Citation

If you use this code or data in your research, please consider citing:

@article{kim2023gvcci,
  title={Gvcci: Lifelong learning of visual grounding for language-guided robotic manipulation},
  author={Kim, Junghyun and Kang, Gi-Cheon and Kim, Jaein and Shin, Suyeon and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2307.05963},
  year={2023}
}

Environment Setup

Python 3.7+, PyTorch v1.9.1+, CUDA 11+ and CuDNN 7+, Anaconda/Miniconda (recommended)

Install Anaconda or Miniconda from here.
Clone this repository and create an environment:

git clone https://www.github.com/JHKim-snu/GVCCI
conda create -n gvcci python=3.8
conda activate gvcci

Install all dependencies:

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

VGPI Dataset

VGPI (Visual Grounding on Pick-and-place Instruction) is a visual grounding dataset collected from two distinct robotic environments ENV1 and ENV2. While training set only consists of raw images of the environment, each sample in test set consists of:

images containing multiple objects
natural language pick-and-place instructions
bbox coordinates of the corresponding objects

We also provide a data generated by GVCCI.

ENV1

Name	Content	Examples	Size	Link
`ENV1_train.zip`	ENV1 training set images	540	57.9 MBytes	Download
`ENV1_generated_samples.pth`	Sample generated by GVCCI	97,448	7 MBytes	Download
`Test-H.tsv`	Test-H	212	28.4 MBytes	Download
`Test-R.tsv`	Test-R	180	22.6 MBytes	Download

ENV2

Name	Content	Examples	Size	Link
`ENV2_train.zip`	ENV2 training set images	135	47.3 MBytes	Download
`ENV2_generated_samples.pth`	Sample generated by GVCCI	19,511	1.4 MBytes	Download
`Test-E.zip`	Test-E images	30	10.1 MBytes	Download
`Test-E.pth`	Test-E instructions	68	7 KBytes	Download

Each line in Test-H and Test-R represents a sample with the unique-id, image-id, pick instruction, bbox-coordinates, and image base64 string sepearated by tabs as shown below.

103	0034	pick the bottle on the right side of the yellow can	175.64337349397593,45.79014989293362,195.3831325301205,85.2591006423983	iVBORw0KGgoAAAANSUhE...

Each element in ENV1_generated_samples.pth, ENV2_generated_samples.pth, and Test-E.pth consists of the name of the image file in Test-E.zip, bbox coordinates, and pick instruction as shown below.

['0001.png', '',[104.94623655913979,  88.6021505376344,  196.12903225806454,  170.32258064516128], 'pick the green cup in behind', '']

Place the data in ./data folder. We expect data to be uploaded to the following directory structure:

├── data         
│   ├── train       
│   │   ├── ENV1_train
│   │   │   ├── 0000.png      
│   │   │   └── ...      
│   │   ├── ENV2_train   
│   │   │   ├── 0000.png      
│   │   │   └── ...      
│   ├── test  
│   │   ├── Test-H.tsv  
│   │   ├── Test-R.tsv  
│   │   ├── Test-E.pth  
│   │   ├── Test-E  
│   │   │   ├── 0000.png
│   │   │   └── ...      
└──

Visual Feature Extraction

Once you recieve images from whereever (robot, web, etc.), you first need to extract visual features of objects (category, attribute, location) in images to generate the instructions. For visual feature extraction, we leverage the pretrained classifiers and object detector from Faster R-CNN and Bottom-Up Attention. The code is originated and modified from this repository.

We strongly recommend you to use a separate environment for the visual feature extraction. Please follow the Prerequisites here.

Extract the visual features with the following script:

cd visual_feature_extraction
python make_image_list.py
OMP_NUM_THREADS=4 CUDA_VISIBLE_DEVICES=0,1,2,3 python extract.py --load_dir ./output_caffe152/ --image_dir ../data/train/ENV1_train/ --out_path ../instruction_generation/data/detection_results/ENV1/r152_attr_detection_results --image_list_file ./ENV1_train_train_imagelist_split0.txt --vg_dataset ENV1_train --cuda --split_ind 0

The extracted visual feature should be saved as following:

├── instruction_generation        
│   ├── data        
│   │   ├── detected_results
│   │   │   ├── ENV1_train   
│   │   │   │   ├── r101_object_detection_results
│   │   │   │   │   ├── ENV1_train_train_pseudo_split0_detection_results.pth
│   │   │   │   ├── r152_attr_detection_results      
│   │   │   │   │   ├── ENV1_train_train_pseudo_split0_attr_detection_results.pth

The results will be a dictionary of name of the image file for keys and list of each object's features for values.

Instruction Generation

Now, you are ready to generate an instruction based on the extracted features. Generating instructions can be performed with the following script:

cd ../instruction_generation
bash scripts/generate_pseudo_data.sh

Generated data will be saved in .pth format which is a list of sample instructions. Each sample instruction is a list that consists of

image file name
object location
instruction

You can visualize the generated samples through visualize_samples.ipynb.

Here is an example of the visualization.

Visual Grounding

Since you have a generated triplet of image, location, and instructions, you can train any visual grounidng model you want. Here, we provide a sample training and evaluation code of OFA. The source code is from OFA Github.

To train the OFA model, you first need to change the .pth format into .tsv format:

cd ../visual_grounding/OFA
python pth2tsv.py --pathfrom ../../instruction_generation/data/pseudo_samples/ENV1_train/ENV1_train.pth --pathto ../../data/train --name ENV1_train

From here, you can either follow the original OFA github repository or follow the instructions below.

To train on the top of the pretrained OFA model, download the model checkpoint file provided here. Download the model checkpoint Finetuned checkpoint for REFCOCO file to GVCCI/data/OFA_checkpoints/. Then, the following script will train on the top of the pretrained OFA model.

cd run_scripts
nohup sh train_refcoco.sh

Evaluate the trained model:

cd ..
python evaluation.py --modelpath YOUR_MODEL_PATH_HERE

You can also visualize the model's output with visualize.ipynb.

The pre-trained checkpoints of GVCCI can be found below.

GVCCI checkpoints

ENV1(8)	ENV1(33)	ENV1(135)	ENV1(540)	ENV2(8)	ENV2(33)	ENV2(135)
Download	Download	Download	Download	Download	Download	Download

Language-Guided Robotic Manipulation

A robot arm is required for the Language-Guided Robotic Manipulation. The robot we used is Kinova Gen3. Run the following code in your remote server:

python LGRM_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LGRM

LGRM

instruction_generation

instruction_generation

readme_figures

readme_figures

visual_feature_extraction

visual_feature_extraction

visual_grounding/OFA

visual_grounding/OFA

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

GVCCI: Lifelong Learning of Visual Grounding
for Language-Guided Robotic Manipulation

arXiv | Poster | Presentation Video | Demo Video

Overview

Citation

Environment Setup

VGPI Dataset

Visual Feature Extraction

Instruction Generation

Visual Grounding

Language-Guided Robotic Manipulation

Experimental Results

Offline Experiments (Localization)

Online Experiments (LGRM)

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
LGRM		LGRM
instruction_generation		instruction_generation
readme_figures		readme_figures
visual_feature_extraction		visual_feature_extraction
visual_grounding/OFA		visual_grounding/OFA
README.md		README.md
requirements.txt		requirements.txt

JHKim-snu/GVCCI

Folders and files

Latest commit

History

Repository files navigation

GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

arXiv | Poster | Presentation Video | Demo Video

Overview

Citation

Environment Setup

VGPI Dataset

Visual Feature Extraction

Instruction Generation

Visual Grounding

Language-Guided Robotic Manipulation

Experimental Results

Offline Experiments (Localization)

Online Experiments (LGRM)

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages

GVCCI: Lifelong Learning of Visual Grounding
for Language-Guided Robotic Manipulation