Skip to content
/ PGA Public

Under review. [IROS 2024] PGA: Personalizing Grasping Agents with Single Human-Robot Interaction

Notifications You must be signed in to change notification settings

JHKim-snu/PGA

Repository files navigation

Overview





Citation

If you use this code or data in your research, please consider citing:

@article{kim2023pga,
  title={PGA: Personalizing Grasping Agents with Single Human-Robot Interaction},
  author={Kim, Junghyun and Kang, Gi-Cheon and Kim, Jaein and Yang, Seoyun and Jung, Minjoon and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2310.12547},
  year={2023}
}


Table of Contents



Environment Setup

Python 3.7+, PyTorch v1.9.1+, CUDA 11+ and CuDNN 7+, Anaconda/Miniconda (recommended)

  1. Install Anaconda or Miniconda from here.
  2. Clone this repository and create an environment:
git clone https://www.github.com/JHKim-snu/PGA
conda create -n pga python=3.8
conda activate pga
  1. Install all dependencies:
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt


GraspMine Dataset

GraspMine is an LCRG (Language-Guided Robotic Grasping) dataset collected to validate the grasping agent's personalization capability. GraspMine aims to locate and grasp personal objects given a personal indicator, e.g., "my sleeping pills." GraspMine is built upon 96 personal objects, 100+ everyday objects.


Training Set

Each sample in the training set includes:

  1. An image containing a personal object.
  2. A natural language description.
Name Content Examples Size Link
HRI.zip Images from human-robot interaction 96 37.4 MBytes Download
HRI.json personal object descriptions (annotations). Keys are the image_ids in HRI.zip and Values consists of [{general indicator}, {persoanl indicator}] 96 8 KBytes Download
HRI.tsv preprocessed data for HRI. This consists of a image, a personal indicator, and the location of the object 96 50.3 MBytes Download

Each element in HRI.json is as shown below.

"0.png": ["White bottle in front","my sleeping pills"]

Each element in HRI.tsv consists of a unique_id, image_id (do not use this), personal indicator, bounding box coordinates, image in string as shown below.

0	38.png	the flowers for my bedroom	252.41,314.63,351.07,418.89	iVBORw0KGgoAAA....

Reminiscence

The reminiscence consists of 400 raw images of the environment. This raw images can be utilized in learning process, but annotations CANNOT be used in GraspMine.

Name Content Examples Size Link
Reminiscence.zip Unlabeled images of Reminiscence 400 129.4 MBytes Download
Reminiscence_nodes.zip Cropped object images of Reminiscence. All objects detected from the Object Detector are saved as a cropped image 8270 61 MBytes Download
R_object_features.json Visual features of cropped images. The features were extracted through DINO 8270 124 MBytes Download
Reminiscence_annotations.xlsx Annotations of Reminiscence nodes. Each personal indicators are annotated with the {image_id}_{object_id} in the above Reminiscence_nodes.zip 8270 4.4 MBytes Download

Test Set

Each sample in the test set includes:

  1. Images containing multiple objects.
  2. A natural language personal indicator.
  3. Associated object coordinates.
Name Content Examples Size Link Description
heterogeneous.zip Images of Heterogeneous split 60 19.1 MBytes Download Scenes with randomly selected objects
homogeneous.zip Images of Homogeneous split 60 18.6 MBytes Download Scenes with similar-looking objects of the same category
cluttered.zip Images of Cluttered split 106 36.6 MBytes Download highly cluttered objects. Sourced from the IM-Dial dataset
heterogeneous.pth Annotations for Heterogeneous images 120 12 KBytes Download
homogeneous.pth Annotations for Homogeneous images 120 12 KBytes Download
cluttered.pth Annotations for Cluttered images 106 32 KBytes Download
paraphrased.pth Paraphrased annotations for all splits 346 49 KBytes Download Each personal indicator paraphrased by annotators

Each line in heterogeneous.pth, homogeneous.pth, cluttered.pth, paraphrased.pth is as shown below.

This will be provided soon. You can either check on your own by downloading the above links


Reminiscence Construction

For Reminiscence Construction, we leverage the pretrained classifiers and object detecto from Bottom-Up Attention for detecting every objects in the scene. The code is originated and modified from this repository.

We strongly recommend you to use a separate environment for the visual feature extraction. Please follow the Prerequisites here.

You can either detect objects by your own, but we provide the detected results (i.e., cropped object images) in Reminiscence_nodes.zip.



Object Information Acquisition

You need a physical robot to run this part

To initiate an interaction with the robot, position the personal object in front of it, alongside both the general and personal indicators, e.g., "the object in front is my sleeping pills". Utilizing the general indicator (the object placed in front), our system employs GVCCI to determine the location of the object. This process will automatically record crucial labels from the interaction, including:

  • an initial image
  • images of the robot-object interaction
  • personal indicator
  • and the object bounding box coordinate.

Download the GVCCI model, ENV2(135).

OMP_NUM_THREADS=4 CUDA_VISIBLE_DEVICES=0,1,2,3 python OIA_interaction.py

Upon completion, the following data is automatically saved:

  1. Image Files: A set of {object_num}_{interaction_num}.png files are generated, capturing each interaction uniquely.

  2. Dictionary: A dictionary is created, with keys represented as {object_num}, and corresponding values as lists containing lists of (general_indicator, personal_indicator).

OMP_NUM_THREADS=4 CUDA_VISIBLE_DEVICES=0,1,2,3 python OIA_postprocess.py --gvcci_path YOUR_GVCCI_PATH --save_path PATH_TO_SAVE --hri_path YOUR_HRI.json_PATH --cropped_img_path PATH_TO_SAVE_IMGS --raw_img_path YOUR_HRI.zip_PATH --xlsx_path YOUR_Reminiscence_annotations.xlsx.PATH

If you do not have the robot to perform, the results can be alternatively downloaded from here that contains the following information.

img_id: [personal indicator, bounding box coordinates]


Propagation through Reminiscence

Utilizing the information obtained from Object Information Acquisition, unlabelled images from the Reminiscence dataset are pseudo-labeled using the Propagation through Reminiscence. To execute this, run the following script:

CUDA_VISIBLE_DEVICES=0 python label_propagation.py --model 'vanilla' --thresh 0.55 --iter 3 --save_nodes True --sample_n 400 --ignore_interaction True --seed 777

The .pth file will be saved that consists of a list, each element representing each object node. Each object nodes are a dictionary tagged with following informations:

Items Content
visual feature 512 dimension feature vector extracted from DINO
category category of the object
label personal indicator
img_id Reminiscence image id
obj_id object id
known whether if the node is from OIA, True or False
labelled whether if the node is labeled or not (including pseudo-labels), True or False


Personalized Object Grounding Model

Our Personalized Object Grounding Model is based on OFA, the state-of-the-art vision-and-language foundation model.

Process Data

You first need to post-process the training, test data for the Grounding model. By running the following scripts, you can acquire datasets in .tsv format.

python postprocess_all.py
python postprocess_size.py

Training

With the processed data that comprise of image, personal indicator, and object coordinate, you can train the grounding model with the following script:

cd run_scripts
nohup sh train.sh

The pre-trained checkpoints of PGA can be found below.

Baseline checkpoints

OFA GVCCI Direct PassivePGA PGA Supervised
Download Download Download Download Download Download

PGA checkpoints

0 25 100 400
Download Download Download Download

Visualization

If you have the pretrained grounding model, you can visualize the prediction results with the following script:

python visualization.py

Evaluation

You can evaluate your model on the test sets with the following script:

python evaluation.py

for the demonstration (visualization of the inference results), run the following script:

python evaluation.py --demo


Personalized Object Grasping

If you want to reproduce the images of the testset and try on your own model, run the following script:

python online_experiment_server.py

Alongside, the code for the robot vg_client.py is provided.

If you just want to do the demonstration of the received image from the robot and your own query:

python online_experiment_server.py --demo


Experimental Results

We assessed the Personalized Grasping Agent (PGA) on our proposed dataset, \textsc{GraspMine}, benchmarking it against various baselines. The offline experiment measured PGA's efficacy in Personalized Object Grounding, how well PGA identifies an object given its natural language indicators. Meanwhile, the online experiment probed its real-world performance in Personalized Language-Conditioned Robotic Grasping (LCRG) using a robot arm.

Offline Experiments (Localization)



Online Experiments (Real-world)



Please check on our paper for more detailed explanation.

Acknowledgements

This repo is built upon OFA, a vision-and-language foundation model. Thank you.