big_patent

  • Description:

BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Each US patent application is filed under a Cooperative Patent Classification (CPC) code. There are nine such classification categories:

  • A (Human Necessities),
  • B (Performing Operations; Transporting),
  • C (Chemistry; Metallurgy),
  • D (Textiles; Paper),
  • E (Fixed Constructions),
  • F (Mechanical Engineering; Lightning; Heating; Weapons; Blasting),
  • G (Physics),
  • H (Electricity), and
  • Y (General tagging of new or cross-sectional technology)

There are two features:

FeaturesDict({
    'abstract': Text(shape=(), dtype=string),
    'description': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
abstract Text string
description Text string
@misc{sharma2019bigpatent,
    title={BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization},
    author={Eva Sharma and Chen Li and Lu Wang},
    year={2019},
    eprint={1906.03741},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

big_patent/all (default config)

  • Config description: Patents under all categories.

  • Dataset size: 35.17 GiB

  • Splits:

Split Examples
'test' 67,072
'train' 1,207,222
'validation' 67,068

big_patent/a

  • Config description: Patents under Cooperative Patent Classification (CPC)a: Human Necessities

  • Dataset size: 5.16 GiB

  • Splits:

Split Examples
'test' 9,675
'train' 174,134
'validation' 9,674

big_patent/b

  • Config description: Patents under Cooperative Patent Classification (CPC)b: Performing Operations; Transporting

  • Dataset size: 4.06 GiB

  • Splits:

Split Examples
'test' 8,974
'train' 161,520
'validation' 8,973

big_patent/c

  • Config description: Patents under Cooperative Patent Classification (CPC)c: Chemistry; Metallurgy

  • Dataset size: 3.63 GiB

  • Splits:

Split Examples
'test' 5,614
'train' 101,042
'validation' 5,613

big_patent/d

  • Config description: Patents under Cooperative Patent Classification (CPC)d: Textiles; Paper

  • Dataset size: 255.56 MiB

  • Splits:

Split Examples
'test' 565
'train' 10,164
'validation' 565

big_patent/e

  • Config description: Patents under Cooperative Patent Classification (CPC)e: Fixed Constructions

  • Dataset size: 871.40 MiB

  • Splits:

Split Examples
'test' 1,914
'train' 34,443
'validation' 1,914

big_patent/f

  • Config description: Patents under Cooperative Patent Classification (CPC)f: Mechanical Engineering; Lightning; Heating; Weapons; Blasting

  • Dataset size: 2.06 GiB

  • Splits:

Split Examples
'test' 4,754
'train' 85,568
'validation' 4,754

big_patent/g

  • Config description: Patents under Cooperative Patent Classification (CPC)g: Physics

  • Dataset size: 8.19 GiB

  • Splits:

Split Examples
'test' 14,386
'train' 258,935
'validation' 14,385

big_patent/h

  • Config description: Patents under Cooperative Patent Classification (CPC)h: Electricity

  • Dataset size: 7.50 GiB

  • Splits:

Split Examples
'test' 14,279
'train' 257,019
'validation' 14,279

big_patent/y

  • Config description: Patents under Cooperative Patent Classification (CPC)y: General tagging of new or cross-sectional technology

  • Dataset size: 3.46 GiB

  • Splits:

Split Examples
'test' 6,911
'train' 124,397
'validation' 6,911