Skip to content

Code and created datasets for our ACL 2022 paper: "Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations"

Notifications You must be signed in to change notification settings

lemuria-wchen/CFC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CFC

Code and created datasets for our ACL 2022 paper: Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations.

News

  • Reddit and Twitter datasets released ! (2022.06.20)

TODO

  • The codes will be released in the near future.

Dataset

We created two datasets in the paper for contextual matching, i.e., Reddit and Twitter, and the statistics and interpretations are shown in the table below.

Datasets Train set Dev set Test set Database
MC SC
Reddit 300K 20K 20K 20K 10M
Twitter 20K 2K 2K - 1M
File Role Explaination
database.json Database Each instance contains three fields, where `ctx` represents the context, `rsp` represents the response, and `rid` represents the ID of the response.
train.json Trainset Each instance contains a   response and a context list corresponding to the response.
dev.json Devset Same as training set.
test_mc.json MC testset Same as database. Each response in MC test set has multiple contexts, which ensures that there exits other contexts in the database that also correspond to this response.
test_sc.json SC testset Same as database. Each response in SC test set has only one context, i.e., there is no context in the database that exactly corresponds to the response.

Build Dataset

Instead of providing the data directly, we provide a script to make the data in consideration of copyright issues.

To download raw reddit dataset train.tsv, following https://github.com/microsoft/DialoGPT and run python demo.py --data full. Raw twitter dataset is available in https://github.com/Marsan-Ma-zz/chat_corpus.

  • build context-response pairs
python build_data.py
  • build training set
python build_trainset.py
  • build test set
python build_testset.py

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@inproceedings{chen-etal-2022-contextual,
    title = "Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations",
    author = "Chen, Wei and Gong, Yeyun and Xu, Can and Hu, Huang and Yao, Bolun and Wei, Zhongyu and Fan, Zhihao and Hu, Xiaowu and Zhou, Bartuer and Cheng, Biao and Jiang, Daxin and Duan, Nan",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.334",
    doi = "10.18653/v1/2022.acl-long.334",
    pages = "4865--4877",
    abstract = "We study the problem of coarse-grained response selection in retrieval-based dialogue systems. The problem is equally important with fine-grained response selection, but is less explored in existing literature. In this paper, we propose a Contextual Fine-to-Coarse (CFC) distilled model for coarse-grained response selection in open-domain conversations. In our CFC model, dense representations of query, candidate contexts and responses is learned based on the multi-tower architecture using contextual matching, and richer knowledge learned from the one-tower architecture (fine-grained) is distilled into the multi-tower architecture (coarse-grained) to enhance the performance of the retriever. To evaluate the performance of the proposed model, we construct two new datasets based on the Reddit comments dump and Twitter corpus. Extensive experimental results on the two datasets show that the proposed method achieves huge improvement over all evaluation metrics compared with traditional baseline methods.",
}

About

Code and created datasets for our ACL 2022 paper: "Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages