• Description:

DocNLI is a large-scale dataset for document-level natural language inference (NLI). DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. In contrast to some existing sentence-level NLI datasets, DocNLI has pretty limited artifacts.

Split Examples
'test' 267,086
'train' 942,314
'validation' 234,258
  • Feature structure:
    'hypothesis': Text(shape=(), dtype=string),
    'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
    'premise': Text(shape=(), dtype=string),
  • Feature documentation:
Feature Class Shape Dtype Description
hypothesis Text string
label ClassLabel int64
premise Text string
  • Citation:
    title={DocNLI: A Large-scale Dataset for Document-level Natural Language Inference},
    author={Wenpeng Yin and Dragomir Radev and Caiming Xiong},
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",