Skip to content

[IROS 2023] GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

Notifications You must be signed in to change notification settings

JHKim-snu/GVCCI

Repository files navigation

GVCCI: Lifelong Learning of Visual Grounding
for Language-Guided Robotic Manipulation

Junghyun Kim,   Gi-Cheon Kang*,   Jaein Kim*,   Suyeon Shin,   Byoung-Tak Zhang

The 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

Overview



Citation

If you use this code or data in your research, please consider citing:

@article{kim2023gvcci,
  title={Gvcci: Lifelong learning of visual grounding for language-guided robotic manipulation},
  author={Kim, Junghyun and Kang, Gi-Cheon and Kim, Jaein and Shin, Suyeon and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2307.05963},
  year={2023}
}

Environment Setup

Python 3.7+, PyTorch v1.9.1+, CUDA 11+ and CuDNN 7+, Anaconda/Miniconda (recommended)

  1. Install Anaconda or Miniconda from here.
  2. Clone this repository and create an environment:
git clone https://www.github.com/JHKim-snu/GVCCI
conda create -n gvcci python=3.8
conda activate gvcci
  1. Install all dependencies:
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

VGPI Dataset

VGPI (Visual Grounding on Pick-and-place Instruction) is a visual grounding dataset collected from two distinct robotic environments ENV1 and ENV2. While training set only consists of raw images of the environment, each sample in test set consists of:

  1. images containing multiple objects
  2. natural language pick-and-place instructions
  3. bbox coordinates of the corresponding objects

We also provide a data generated by GVCCI.

ENV1

Name Content Examples Size Link
ENV1_train.zip ENV1 training set images 540 57.9 MBytes Download
ENV1_generated_samples.pth Sample generated by GVCCI 97,448 7 MBytes Download
Test-H.tsv Test-H 212 28.4 MBytes Download
Test-R.tsv Test-R 180 22.6 MBytes Download

ENV2

Name Content Examples Size Link
ENV2_train.zip ENV2 training set images 135 47.3 MBytes Download
ENV2_generated_samples.pth Sample generated by GVCCI 19,511 1.4 MBytes Download
Test-E.zip Test-E images 30 10.1 MBytes Download
Test-E.pth Test-E instructions 68 7 KBytes Download

Each line in Test-H and Test-R represents a sample with the unique-id, image-id, pick instruction, bbox-coordinates, and image base64 string sepearated by tabs as shown below.

103	0034	pick the bottle on the right side of the yellow can	175.64337349397593,45.79014989293362,195.3831325301205,85.2591006423983	iVBORw0KGgoAAAANSUhE...

Each element in ENV1_generated_samples.pth, ENV2_generated_samples.pth, and Test-E.pth consists of the name of the image file in Test-E.zip, bbox coordinates, and pick instruction as shown below.

['0001.png', '',[104.94623655913979,  88.6021505376344,  196.12903225806454,  170.32258064516128], 'pick the green cup in behind', '']

Place the data in ./data folder. We expect data to be uploaded to the following directory structure:

├── data         
│   ├── train       
│   │   ├── ENV1_train
│   │   │   ├── 0000.png      
│   │   │   └── ...      
│   │   ├── ENV2_train   
│   │   │   ├── 0000.png      
│   │   │   └── ...      
│   ├── test  
│   │   ├── Test-H.tsv  
│   │   ├── Test-R.tsv  
│   │   ├── Test-E.pth  
│   │   ├── Test-E  
│   │   │   ├── 0000.png
│   │   │   └── ...      
└── 

Visual Feature Extraction

Once you recieve images from whereever (robot, web, etc.), you first need to extract visual features of objects (category, attribute, location) in images to generate the instructions. For visual feature extraction, we leverage the pretrained classifiers and object detector from Faster R-CNN and Bottom-Up Attention. The code is originated and modified from this repository.

We strongly recommend you to use a separate environment for the visual feature extraction. Please follow the Prerequisites here.

Extract the visual features with the following script:

cd visual_feature_extraction
python make_image_list.py
OMP_NUM_THREADS=4 CUDA_VISIBLE_DEVICES=0,1,2,3 python extract.py --load_dir ./output_caffe152/ --image_dir ../data/train/ENV1_train/ --out_path ../instruction_generation/data/detection_results/ENV1/r152_attr_detection_results --image_list_file ./ENV1_train_train_imagelist_split0.txt --vg_dataset ENV1_train --cuda --split_ind 0

The extracted visual feature should be saved as following:

├── instruction_generation        
│   ├── data        
│   │   ├── detected_results
│   │   │   ├── ENV1_train   
│   │   │   │   ├── r101_object_detection_results
│   │   │   │   │   ├── ENV1_train_train_pseudo_split0_detection_results.pth
│   │   │   │   ├── r152_attr_detection_results      
│   │   │   │   │   ├── ENV1_train_train_pseudo_split0_attr_detection_results.pth

The results will be a dictionary of name of the image file for keys and list of each object's features for values.


Instruction Generation

Now, you are ready to generate an instruction based on the extracted features. Generating instructions can be performed with the following script:

cd ../instruction_generation
bash scripts/generate_pseudo_data.sh

Generated data will be saved in .pth format which is a list of sample instructions. Each sample instruction is a list that consists of

  1. image file name
  2. object location
  3. instruction

You can visualize the generated samples through visualize_samples.ipynb.

Here is an example of the visualization.




Visual Grounding

Since you have a generated triplet of image, location, and instructions, you can train any visual grounidng model you want. Here, we provide a sample training and evaluation code of OFA. The source code is from OFA Github.

To train the OFA model, you first need to change the .pth format into .tsv format:

cd ../visual_grounding/OFA
python pth2tsv.py --pathfrom ../../instruction_generation/data/pseudo_samples/ENV1_train/ENV1_train.pth --pathto ../../data/train --name ENV1_train

From here, you can either follow the original OFA github repository or follow the instructions below.

To train on the top of the pretrained OFA model, download the model checkpoint file provided here. Download the model checkpoint Finetuned checkpoint for REFCOCO file to GVCCI/data/OFA_checkpoints/. Then, the following script will train on the top of the pretrained OFA model.

cd run_scripts
nohup sh train_refcoco.sh

Evaluate the trained model:

cd ..
python evaluation.py --modelpath YOUR_MODEL_PATH_HERE

You can also visualize the model's output with visualize.ipynb.

The pre-trained checkpoints of GVCCI can be found below.

GVCCI checkpoints

ENV1(8) ENV1(33) ENV1(135) ENV1(540) ENV2(8) ENV2(33) ENV2(135)
Download Download Download Download Download Download Download

Language-Guided Robotic Manipulation

A robot arm is required for the Language-Guided Robotic Manipulation. The robot we used is Kinova Gen3. Run the following code in your remote server:

python LGRM_server.py

The code for robot will be provided soon.


Experimental Results

Offline Experiments (Localization)



Online Experiments (LGRM)



Acknowledgements

This repo is built on Bottom-Up Attention, Pseudo-Q, OFA, and MDETR.

About

[IROS 2023] GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published