tff.simulation.datasets.stackoverflow.load_data
Stay organized with collections
Save and categorize content based on your preferences.
Loads the federated Stack Overflow dataset.
tff.simulation.datasets.stackoverflow.load_data(
cache_dir=None
)
Downloads and caches the dataset locally. If previously downloaded, tries to
load the dataset from cache.
This dataset is derived from the Stack Overflow Data hosted by kaggle.com and
available to query through Kernels using the BigQuery API:
https://www.kaggle.com/stackoverflow/stackoverflow The Stack Overflow Data
is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to
Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
The data consists of the body text of all questions and answers. The bodies
were parsed into sentences, and any user with fewer than 100 sentences was
expunged from the data. Minimal preprocessing was performed as follows:
- Lowercase the text,
- Unescape HTML symbols,
- Remove non-ascii symbols,
- Separate punctuation as individual tokens (except apostrophes and hyphens),
- Removing extraneous whitespace,
- Replacing URLS with a special token.
In addition the following metadata is available:
- Creation date
- Question title
- Question tags
- Question score
- Type ('question' or 'answer')
The data is divided into three sets:
- Train: Data before 2018-01-01 UTC except the held-out users. 342,477
unique users with 135,818,730 examples.
- Held-out: All examples from users with user_id % 10 == 0 (all dates).
38,758 unique users with 16,491,230 examples.
- Test: All examples after 2018-01-01 UTC except from held-out users.
204,088 unique users with 16,586,035 examples.
The tf.data.Datasets
returned by
tff.simulation.datasets.ClientData.create_tf_dataset_for_client
will yield
collections.OrderedDict
objects at each iteration, with the following keys
and values, in lexicographic order by key:
'creation_date'
: a tf.Tensor
with dtype=tf.string
and shape []
containing the date/time of the question or answer in UTC format.
'score'
: a tf.Tensor
with dtype=tf.int64
and shape [] containing
the score of the question.
'tags'
: a tf.Tensor
with dtype=tf.string
and shape [] containing
the tags of the question, separated by '|' characters.
'title'
: a tf.Tensor
with dtype=tf.string
and shape [] containing
the title of the question.
'tokens'
: a tf.Tensor
with dtype=tf.string
and shape []
containing the tokens of the question/answer, separated by space (' ')
characters.
'type'
: a tf.Tensor
with dtype=tf.string
and shape []
containing either the string 'question' or 'answer'.
Args |
cache_dir
|
(Optional) directory to cache the downloaded file. If None ,
caches in Keras' default cache directory.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-09-20 UTC.
[null,null,["Last updated 2024-09-20 UTC."],[],[],null,["# tff.simulation.datasets.stackoverflow.load_data\n\n\u003cbr /\u003e\n\n|-------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/federated/blob/v0.87.0 Version 2.0, January 2004 Licensed under the Apache License, Version 2.0 (the) |\n\nLoads the federated Stack Overflow dataset. \n\n tff.simulation.datasets.stackoverflow.load_data(\n cache_dir=None\n )\n\nDownloads and caches the dataset locally. If previously downloaded, tries to\nload the dataset from cache.\n\nThis dataset is derived from the Stack Overflow Data hosted by kaggle.com and\navailable to query through Kernels using the BigQuery API:\n\u003chttps://www.kaggle.com/stackoverflow/stackoverflow\u003e The Stack Overflow Data\nis licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported\nLicense. To view a copy of this license, visit\n\u003chttp://creativecommons.org/licenses/by-sa/3.0/\u003e or send a letter to\nCreative Commons, PO Box 1866, Mountain View, CA 94042, USA.\n\nThe data consists of the body text of all questions and answers. The bodies\nwere parsed into sentences, and any user with fewer than 100 sentences was\nexpunged from the data. Minimal preprocessing was performed as follows:\n\n1. Lowercase the text,\n2. Unescape HTML symbols,\n3. Remove non-ascii symbols,\n4. Separate punctuation as individual tokens (except apostrophes and hyphens),\n5. Removing extraneous whitespace,\n6. Replacing URLS with a special token.\n\nIn addition the following metadata is available:\n\n1. Creation date\n2. Question title\n3. Question tags\n4. Question score\n5. Type ('question' or 'answer')\n\nThe data is divided into three sets:\n\n- Train: Data before 2018-01-01 UTC except the held-out users. 342,477 unique users with 135,818,730 examples.\n- Held-out: All examples from users with user_id % 10 == 0 (all dates). 38,758 unique users with 16,491,230 examples.\n- Test: All examples after 2018-01-01 UTC except from held-out users. 204,088 unique users with 16,586,035 examples.\n\nThe `tf.data.Datasets` returned by\n[`tff.simulation.datasets.ClientData.create_tf_dataset_for_client`](../../../../tff/simulation/datasets/ClientData#create_tf_dataset_for_client) will yield\n`collections.OrderedDict` objects at each iteration, with the following keys\nand values, in lexicographic order by key:\n\n- `'creation_date'`: a [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.string` and shape \\[\\] containing the date/time of the question or answer in UTC format.\n- `'score'`: a [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.int64` and shape \\[\\] containing the score of the question.\n- `'tags'`: a [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.string` and shape \\[\\] containing the tags of the question, separated by '\\|' characters.\n- `'title'`: a [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.string` and shape \\[\\] containing the title of the question.\n- `'tokens'`: a [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.string` and shape \\[\\] containing the tokens of the question/answer, separated by space (' ') characters.\n- `'type'`: a [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.string` and shape \\[\\] containing either the string 'question' or 'answer'.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|-------------|---------------------------------------------------------------------------------------------------------|\n| `cache_dir` | (Optional) directory to cache the downloaded file. If `None`, caches in Keras' default cache directory. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| Tuple of (train, held_out, test) where the tuple elements are [`tff.simulation.datasets.ClientData`](../../../../tff/simulation/datasets/ClientData) objects. ||\n\n\u003cbr /\u003e"]]