speech_commands
Stay organized with collections
Save and categorize content based on your preferences.
An audio dataset of spoken words designed to help train and evaluate keyword
spotting systems. Its primary goal is to provide a way to build and test small
models that detect when a single word is spoken, from a set of ten target words,
with as few false positives as possible from background noise or unrelated
speech. Note that in the train and validation set, the label "unknown" is much
more prevalent than the labels of the target words or background noise. One
difference from the release version is the handling of silent segments. While in
the test set the silence segments are regular 1 second files, in the training
they are provided as long segments under "background_noise" folder. Here we
split these background noise into 1 second clips, and also keep one of the files
for the validation set.
Split |
Examples |
'test' |
4,890 |
'train' |
85,511 |
'validation' |
10,102 |
FeaturesDict({
'audio': Audio(shape=(None,), dtype=int16),
'label': ClassLabel(shape=(), dtype=int64, num_classes=12),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
audio |
Audio |
(None,) |
int16 |
|
label |
ClassLabel |
|
int64 |
|
@article{speechcommandsv2,
author = { {Warden}, P.},
title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
year = 2018,
month = apr,
url = {https://arxiv.org/abs/1804.03209},
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2023-01-13 UTC.
[null,null,["Last updated 2023-01-13 UTC."],[],[],null,["# speech_commands\n\n\u003cbr /\u003e\n\n- **Description**:\n\nAn audio dataset of spoken words designed to help train and evaluate keyword\nspotting systems. Its primary goal is to provide a way to build and test small\nmodels that detect when a single word is spoken, from a set of ten target words,\nwith as few false positives as possible from background noise or unrelated\nspeech. Note that in the train and validation set, the label \"unknown\" is much\nmore prevalent than the labels of the target words or background noise. One\ndifference from the release version is the handling of silent segments. While in\nthe test set the silence segments are regular 1 second files, in the training\nthey are provided as long segments under \"background_noise\" folder. Here we\nsplit these background noise into 1 second clips, and also keep one of the files\nfor the validation set.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/speech-commands)\n\n- **Homepage** :\n \u003chttps://arxiv.org/abs/1804.03209\u003e\n\n- **Source code** :\n [`tfds.datasets.speech_commands.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/speech_commands/speech_commands_dataset_builder.py)\n\n- **Versions**:\n\n - **`0.0.3`** (default): Fix audio data type with dtype=tf.int16.\n- **Download size** : `2.37 GiB`\n\n- **Dataset size** : `8.17 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 4,890 |\n| `'train'` | 85,511 |\n| `'validation'` | 10,102 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'audio': Audio(shape=(None,), dtype=int16),\n 'label': ClassLabel(shape=(), dtype=int64, num_classes=12),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------|--------------|---------|-------|-------------|\n| | FeaturesDict | | | |\n| audio | Audio | (None,) | int16 | |\n| label | ClassLabel | | int64 | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('audio', 'label')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @article{speechcommandsv2,\n author = { {Warden}, P.},\n title = \"{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}\",\n journal = {ArXiv e-prints},\n archivePrefix = \"arXiv\",\n eprint = {1804.03209},\n primaryClass = \"cs.CL\",\n keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},\n year = 2018,\n month = apr,\n url = {https://arxiv.org/abs/1804.03209},\n }"]]