TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

wikihow

Description:

WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/mahnazkoupaee/WikiHow-Dataset
Source code: tfds.summarization.Wikihow
Versions:
- 1.2.0 (default): No release notes.
Download size: 5.21 MiB
Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
Links to files can be found on https://github.com/mahnazkoupaee/WikiHow-Dataset Please download both wikihowAll.csv and wikihowSep.csv.
Auto-cached (documentation): No
Supervised keys (See as_supervised doc): ('text', 'headline')
Figure (tfds.show_examples): Not supported.
Citation:

@misc{koupaee2018wikihow,
    title={WikiHow: A Large Scale Text Summarization Dataset},
    author={Mahnaz Koupaee and William Yang Wang},
    year={2018},
    eprint={1810.09305},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

wikihow/all (default config)

Config description: Use the concatenation of all paragraphs as the articles and the bold lines as the reference summaries
Dataset size: 531.56 MiB
Splits:

Split	Examples
`'test'`	5,577
`'train'`	157,252
`'validation'`	5,599

Feature structure:

FeaturesDict({
    'headline': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
headline	Text	string
text	Text	string
title	Text	string

Examples (tfds.as_dataframe):

wikihow/sep

Config description: use each paragraph and its summary.
Dataset size: 1.07 GiB
Splits:

Split	Examples
`'test'`	37,800
`'train'`	1,060,732
`'validation'`	37,932

Feature structure:

FeaturesDict({
    'headline': Text(shape=(), dtype=string),
    'overview': Text(shape=(), dtype=string),
    'sectionLabel': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
headline	Text	string
overview	Text	string
sectionLabel	Text	string
text	Text	string
title	Text	string

Examples (tfds.as_dataframe):

wikihow Stay organized with collections Save and categorize content based on your preferences.

wikihow/all (default config)

wikihow/sep

wikihow