salient_span_wikipedia

  • Description:

Wikipedia sentences with labeled salient spans.

@article{guu2020realm,
    title={REALM: Retrieval-Augmented Language Model Pre-Training},
    author={Kelvin Guu and Kenton Lee and Zora Tung and Panupong Pasupat and Ming-Wei Chang},
    year={2020},
    journal = {arXiv e-prints},
    archivePrefix = {arXiv},
    eprint={2002.08909},
}

salient_span_wikipedia/sentences (default config)

  • Config description: Examples are individual sentences containing entities.

  • Dataset size: 20.57 GiB

  • Splits:

Split Examples
'train' 82,291,706
  • Feature structure:
FeaturesDict({
    'spans': Sequence({
        'limit': int32,
        'start': int32,
        'type': string,
    }),
    'text': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
spans Sequence
spans/limit Tensor int32
spans/start Tensor int32
spans/type Tensor string
text Text string
title Text string

salient_span_wikipedia/documents

  • Config description: Examples re full documents.

  • Dataset size: 16.52 GiB

  • Splits:

Split Examples
'train' 13,353,718
  • Feature structure:
FeaturesDict({
    'sentences': Sequence({
        'limit': int32,
        'start': int32,
    }),
    'spans': Sequence({
        'limit': int32,
        'start': int32,
        'type': string,
    }),
    'text': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
sentences Sequence
sentences/limit Tensor int32
sentences/start Tensor int32
spans Sequence
spans/limit Tensor int32
spans/start Tensor int32
spans/type Tensor string
text Text string
title Text string