reddit
Stay organized with collections
Save and categorize content based on your preferences.
This corpus contains preprocessed posts from the Reddit dataset. The dataset
consists of 3,848,330 posts with an average length of 270 words for content, and
28 words for the summary.
Features includes strings: author, body, normalizedBody, content, summary,
subreddit, subreddit_id. Content is used as document and summary is used as
summary.
Split |
Examples |
'train' |
3,848,330 |
FeaturesDict({
'author': string,
'body': string,
'content': string,
'id': string,
'normalizedBody': string,
'subreddit': string,
'subreddit_id': string,
'summary': string,
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
author |
Tensor |
|
string |
|
body |
Tensor |
|
string |
|
content |
Tensor |
|
string |
|
id |
Tensor |
|
string |
|
normalizedBody |
Tensor |
|
string |
|
subreddit |
Tensor |
|
string |
|
subreddit_id |
Tensor |
|
string |
|
summary |
Tensor |
|
string |
|
@inproceedings{volske-etal-2017-tl,
title = "{TL};{DR}: Mining {R}eddit to Learn Automatic Summarization",
author = {V{\"o}lske, Michael and
Potthast, Martin and
Syed, Shahbaz and
Stein, Benno},
booktitle = "Proceedings of the Workshop on New Frontiers in Summarization",
month = sep,
year = "2017",
address = "Copenhagen, Denmark",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W17-4508",
doi = "10.18653/v1/W17-4508",
pages = "59--63",
abstract = "Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a {``}TL;DR{''} to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.",
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-20 UTC.
[null,null,["Last updated 2022-12-20 UTC."],[],[],null,["# reddit\n\n\u003cbr /\u003e\n\n- **Description**:\n\nThis corpus contains preprocessed posts from the Reddit dataset. The dataset\nconsists of 3,848,330 posts with an average length of 270 words for content, and\n28 words for the summary.\n\nFeatures includes strings: author, body, normalizedBody, content, summary,\nsubreddit, subreddit_id. Content is used as document and summary is used as\nsummary.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/reddit)\n\n- **Homepage** :\n \u003chttps://github.com/webis-de/webis-tldr-17-corpus\u003e\n\n- **Source code** :\n [`tfds.datasets.reddit.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/reddit/reddit_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): No release notes.\n- **Download size** : `2.93 GiB`\n\n- **Dataset size** : `18.09 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|-----------|\n| `'train'` | 3,848,330 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'author': string,\n 'body': string,\n 'content': string,\n 'id': string,\n 'normalizedBody': string,\n 'subreddit': string,\n 'subreddit_id': string,\n 'summary': string,\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|----------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| author | Tensor | | string | |\n| body | Tensor | | string | |\n| content | Tensor | | string | |\n| id | Tensor | | string | |\n| normalizedBody | Tensor | | string | |\n| subreddit | Tensor | | string | |\n| subreddit_id | Tensor | | string | |\n| summary | Tensor | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('content', 'summary')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @inproceedings{volske-etal-2017-tl,\n title = \"{TL};{DR}: Mining {R}eddit to Learn Automatic Summarization\",\n author = {V{\\\"o}lske, Michael and\n Potthast, Martin and\n Syed, Shahbaz and\n Stein, Benno},\n booktitle = \"Proceedings of the Workshop on New Frontiers in Summarization\",\n month = sep,\n year = \"2017\",\n address = \"Copenhagen, Denmark\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/W17-4508\",\n doi = \"10.18653/v1/W17-4508\",\n pages = \"59--63\",\n abstract = \"Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a {``}TL;DR{''} to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.\",\n }"]]