TensorFlow Datasets

TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.

It handles downloading and preparing the data deterministically and constructing a tf.data.Dataset (or np.array).

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 Github 上查看源代码 下载笔记本

安装

TFDS 存在于两个软件包中:

  • pip install tensorflow-datasets:稳定版,数月发行一次。
  • pip install tfds-nightly:每天发行,包含最近版本的数据集。

此 colab 使用 tfds-nightly

pip install -q tfds-nightly tensorflow matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

查找可用的数据集

所有数据集构建器都是 tfds.core.DatasetBuilder 的子类。要获取可用构建器的列表,请使用 tfds.list_builders() 或查看我们的目录

tfds.list_builders()
['abstract_reasoning',
 'accentdb',
 'aeslc',
 'aflw2k3d',
 'ag_news_subset',
 'ai2_arc',
 'ai2_arc_with_ir',
 'amazon_us_reviews',
 'anli',
 'answer_equivalence',
 'arc',
 'asqa',
 'asset',
 'assin2',
 'bair_robot_pushing_small',
 'bccd',
 'beans',
 'bee_dataset',
 'beir',
 'big_patent',
 'bigearthnet',
 'billsum',
 'binarized_mnist',
 'binary_alpha_digits',
 'ble_wind_field',
 'blimp',
 'booksum',
 'bool_q',
 'c4',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cardiotox',
 'cars196',
 'cassava',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'cfq',
 'cherry_blossoms',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_1',
 'cifar10_corrupted',
 'citrus_leaves',
 'cityscapes',
 'civil_comments',
 'clevr',
 'clic',
 'clinc_oos',
 'cmaterdb',
 'cnn_dailymail',
 'coco',
 'coco_captions',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'common_voice',
 'coqa',
 'cos_e',
 'cosmos_qa',
 'covid19',
 'covid19sum',
 'crema_d',
 'criteo',
 'cs_restaurants',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'd4rl_adroit_door',
 'd4rl_adroit_hammer',
 'd4rl_adroit_pen',
 'd4rl_adroit_relocate',
 'd4rl_antmaze',
 'd4rl_mujoco_ant',
 'd4rl_mujoco_halfcheetah',
 'd4rl_mujoco_hopper',
 'd4rl_mujoco_walker2d',
 'dart',
 'davis',
 'deep1b',
 'deep_weeds',
 'definite_pronoun_resolution',
 'dementiabank',
 'diabetic_retinopathy_detection',
 'diamonds',
 'div2k',
 'dmlab',
 'doc_nli',
 'dolphin_number_word',
 'domainnet',
 'downsampled_imagenet',
 'drop',
 'dsprites',
 'dtd',
 'duke_ultrasound',
 'e2e_cleaned',
 'efron_morris75',
 'emnist',
 'eraser_multi_rc',
 'esnli',
 'eurosat',
 'fashion_mnist',
 'flic',
 'flores',
 'food101',
 'forest_fires',
 'fuss',
 'gap',
 'geirhos_conflict_stimuli',
 'gem',
 'genomics_ood',
 'german_credit_numeric',
 'gigaword',
 'glove100_angular',
 'glue',
 'goemotions',
 'gov_report',
 'gpt3',
 'gref',
 'groove',
 'grounded_scan',
 'gsm8k',
 'gtzan',
 'gtzan_music_speech',
 'hellaswag',
 'higgs',
 'hillstrom',
 'horses_or_humans',
 'howell',
 'i_naturalist2017',
 'i_naturalist2018',
 'imagenet2012',
 'imagenet2012_corrupted',
 'imagenet2012_fewshot',
 'imagenet2012_multilabel',
 'imagenet2012_real',
 'imagenet2012_subset',
 'imagenet_a',
 'imagenet_lt',
 'imagenet_r',
 'imagenet_resized',
 'imagenet_sketch',
 'imagenet_v2',
 'imagenette',
 'imagewang',
 'imdb_reviews',
 'irc_disentanglement',
 'iris',
 'istella',
 'kddcup99',
 'kitti',
 'kmnist',
 'lambada',
 'lfw',
 'librispeech',
 'librispeech_lm',
 'libritts',
 'ljspeech',
 'lm1b',
 'locomotion',
 'lost_and_found',
 'lsun',
 'lvis',
 'malaria',
 'math_dataset',
 'math_qa',
 'mctaco',
 'media_sum',
 'mlqa',
 'mnist',
 'mnist_corrupted',
 'movie_lens',
 'movie_rationales',
 'movielens',
 'moving_mnist',
 'mrqa',
 'mslr_web',
 'mt_opt',
 'multi_news',
 'multi_nli',
 'multi_nli_mismatch',
 'natural_questions',
 'natural_questions_open',
 'newsroom',
 'nsynth',
 'nyu_depth_v2',
 'ogbg_molpcba',
 'omniglot',
 'open_images_challenge2019_detection',
 'open_images_v4',
 'openbookqa',
 'opinion_abstracts',
 'opinosis',
 'opus',
 'oxford_flowers102',
 'oxford_iiit_pet',
 'para_crawl',
 'pass',
 'patch_camelyon',
 'paws_wiki',
 'paws_x_wiki',
 'penguins',
 'pet_finder',
 'pg19',
 'piqa',
 'places365_small',
 'plant_leaves',
 'plant_village',
 'plantae_k',
 'protein_net',
 'qa4mre',
 'qasc',
 'quac',
 'quality',
 'quickdraw_bitmap',
 'race',
 'radon',
 'reddit',
 'reddit_disentanglement',
 'reddit_tifu',
 'ref_coco',
 'resisc45',
 'rlu_atari',
 'rlu_atari_checkpoints',
 'rlu_atari_checkpoints_ordered',
 'rlu_control_suite',
 'rlu_dmlab_explore_object_rewards_few',
 'rlu_dmlab_explore_object_rewards_many',
 'rlu_dmlab_rooms_select_nonmatching_object',
 'rlu_dmlab_rooms_watermaze',
 'rlu_dmlab_seekavoid_arena01',
 'rlu_locomotion',
 'rlu_rwrl',
 'robomimic_ph',
 'robonet',
 'robosuite_panda_pick_place_can',
 'rock_paper_scissors',
 'rock_you',
 's3o4d',
 'salient_span_wikipedia',
 'samsum',
 'savee',
 'scan',
 'scene_parse150',
 'schema_guided_dialogue',
 'sci_tail',
 'scicite',
 'scientific_papers',
 'scrolls',
 'sentiment140',
 'shapes3d',
 'sift1m',
 'simpte',
 'siscore',
 'smallnorb',
 'smartwatch_gestures',
 'snli',
 'so2sat',
 'speech_commands',
 'spoken_digit',
 'squad',
 'squad_question_generation',
 'stanford_dogs',
 'stanford_online_products',
 'star_cfq',
 'starcraft_video',
 'stl10',
 'story_cloze',
 'summscreen',
 'sun397',
 'super_glue',
 'svhn_cropped',
 'symmetric_solids',
 'tao',
 'ted_hrlr_translate',
 'ted_multi_translate',
 'tedlium',
 'tf_flowers',
 'the300w_lp',
 'tiny_shakespeare',
 'titanic',
 'trec',
 'trivia_qa',
 'tydi_qa',
 'uc_merced',
 'ucf101',
 'unified_qa',
 'vctk',
 'visual_domain_decathlon',
 'voc',
 'voxceleb',
 'voxforge',
 'waymo_open_dataset',
 'web_graph',
 'web_nlg',
 'web_questions',
 'wider_face',
 'wiki40b',
 'wiki_auto',
 'wiki_bio',
 'wiki_dialog',
 'wiki_table_questions',
 'wiki_table_text',
 'wikiann',
 'wikihow',
 'wikipedia',
 'wikipedia_toxicity_subtypes',
 'wine_quality',
 'winogrande',
 'wit',
 'wit_kaggle',
 'wmt13_translate',
 'wmt14_translate',
 'wmt15_translate',
 'wmt16_translate',
 'wmt17_translate',
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'wordnet',
 'wsc273',
 'xnli',
 'xquad',
 'xsum',
 'xtreme_pawsx',
 'xtreme_s',
 'xtreme_xnli',
 'yelp_polarity_reviews',
 'yes_no',
 'youtube_vis',
 'huggingface:acronym_identification',
 'huggingface:ade_corpus_v2',
 'huggingface:adversarial_qa',
 'huggingface:aeslc',
 'huggingface:afrikaans_ner_corpus',
 'huggingface:ag_news',
 'huggingface:ai2_arc',
 'huggingface:air_dialogue',
 'huggingface:ajgt_twitter_ar',
 'huggingface:allegro_reviews',
 'huggingface:allocine',
 'huggingface:alt',
 'huggingface:amazon_polarity',
 'huggingface:amazon_reviews_multi',
 'huggingface:amazon_us_reviews',
 'huggingface:ambig_qa',
 'huggingface:americas_nli',
 'huggingface:ami',
 'huggingface:amttl',
 'huggingface:anli',
 'huggingface:app_reviews',
 'huggingface:aqua_rat',
 'huggingface:aquamuse',
 'huggingface:ar_cov19',
 'huggingface:ar_res_reviews',
 'huggingface:ar_sarcasm',
 'huggingface:arabic_billion_words',
 'huggingface:arabic_pos_dialect',
 'huggingface:arabic_speech_corpus',
 'huggingface:arcd',
 'huggingface:arsentd_lev',
 'huggingface:art',
 'huggingface:arxiv_dataset',
 'huggingface:ascent_kb',
 'huggingface:aslg_pc12',
 'huggingface:asnq',
 'huggingface:asset',
 'huggingface:assin',
 'huggingface:assin2',
 'huggingface:atomic',
 'huggingface:autshumato',
 'huggingface:babi_qa',
 'huggingface:banking77',
 'huggingface:bbaw_egyptian',
 'huggingface:bbc_hindi_nli',
 'huggingface:bc2gm_corpus',
 'huggingface:beans',
 'huggingface:best2009',
 'huggingface:bianet',
 'huggingface:bible_para',
 'huggingface:big_patent',
 'huggingface:billsum',
 'huggingface:bing_coronavirus_query_set',
 'huggingface:biomrc',
 'huggingface:biosses',
 'huggingface:blbooks',
 'huggingface:blbooksgenre',
 'huggingface:blended_skill_talk',
 'huggingface:blimp',
 'huggingface:blog_authorship_corpus',
 'huggingface:bn_hate_speech',
 'huggingface:bnl_newspapers',
 'huggingface:bookcorpus',
 'huggingface:bookcorpusopen',
 'huggingface:boolq',
 'huggingface:bprec',
 'huggingface:break_data',
 'huggingface:brwac',
 'huggingface:bsd_ja_en',
 'huggingface:bswac',
 'huggingface:c3',
 'huggingface:c4',
 'huggingface:cail2018',
 'huggingface:caner',
 'huggingface:capes',
 'huggingface:casino',
 'huggingface:catalonia_independence',
 'huggingface:cats_vs_dogs',
 'huggingface:cawac',
 'huggingface:cbt',
 'huggingface:cc100',
 'huggingface:cc_news',
 'huggingface:ccaligned_multilingual',
 'huggingface:cdsc',
 'huggingface:cdt',
 'huggingface:cedr',
 'huggingface:cfq',
 'huggingface:chr_en',
 'huggingface:cifar10',
 'huggingface:cifar100',
 'huggingface:circa',
 'huggingface:civil_comments',
 'huggingface:clickbait_news_bg',
 'huggingface:climate_fever',
 'huggingface:clinc_oos',
 'huggingface:clue',
 'huggingface:cmrc2018',
 'huggingface:cmu_hinglish_dog',
 'huggingface:cnn_dailymail',
 'huggingface:coached_conv_pref',
 'huggingface:coarse_discourse',
 'huggingface:codah',
 'huggingface:code_search_net',
 'huggingface:code_x_glue_cc_clone_detection_big_clone_bench',
 'huggingface:code_x_glue_cc_clone_detection_poj104',
 'huggingface:code_x_glue_cc_cloze_testing_all',
 'huggingface:code_x_glue_cc_cloze_testing_maxmin',
 'huggingface:code_x_glue_cc_code_completion_line',
 'huggingface:code_x_glue_cc_code_completion_token',
 'huggingface:code_x_glue_cc_code_refinement',
 'huggingface:code_x_glue_cc_code_to_code_trans',
 'huggingface:code_x_glue_cc_defect_detection',
 'huggingface:code_x_glue_ct_code_to_text',
 'huggingface:code_x_glue_tc_nl_code_search_adv',
 'huggingface:code_x_glue_tc_text_to_code',
 'huggingface:code_x_glue_tt_text_to_text',
 'huggingface:com_qa',
 'huggingface:common_gen',
 'huggingface:common_language',
 'huggingface:common_voice',
 'huggingface:commonsense_qa',
 'huggingface:competition_math',
 'huggingface:compguesswhat',
 'huggingface:conceptnet5',
 'huggingface:conll2000',
 'huggingface:conll2002',
 'huggingface:conll2003',
 'huggingface:conll2012_ontonotesv5',
 'huggingface:conllpp',
 'huggingface:consumer-finance-complaints',
 'huggingface:conv_ai',
 'huggingface:conv_ai_2',
 'huggingface:conv_ai_3',
 'huggingface:conv_questions',
 'huggingface:coqa',
 'huggingface:cord19',
 'huggingface:cornell_movie_dialog',
 'huggingface:cos_e',
 'huggingface:cosmos_qa',
 'huggingface:counter',
 'huggingface:covid_qa_castorini',
 'huggingface:covid_qa_deepset',
 'huggingface:covid_qa_ucsd',
 'huggingface:covid_tweets_japanese',
 'huggingface:covost2',
 'huggingface:cppe-5',
 'huggingface:craigslist_bargains',
 'huggingface:crawl_domain',
 'huggingface:crd3',
 'huggingface:crime_and_punish',
 'huggingface:crows_pairs',
 'huggingface:cryptonite',
 'huggingface:cs_restaurants',
 'huggingface:cuad',
 'huggingface:curiosity_dialogs',
 'huggingface:daily_dialog',
 'huggingface:dane',
 'huggingface:danish_political_comments',
 'huggingface:dart',
 'huggingface:datacommons_factcheck',
 'huggingface:dbpedia_14',
 'huggingface:dbrd',
 'huggingface:deal_or_no_dialog',
 'huggingface:definite_pronoun_resolution',
 'huggingface:dengue_filipino',
 'huggingface:dialog_re',
 'huggingface:diplomacy_detection',
 'huggingface:disaster_response_messages',
 'huggingface:discofuse',
 'huggingface:discovery',
 'huggingface:disfl_qa',
 'huggingface:doc2dial',
 'huggingface:docred',
 'huggingface:doqa',
 'huggingface:dream',
 'huggingface:drop',
 'huggingface:duorc',
 'huggingface:dutch_social',
 'huggingface:dyk',
 'huggingface:e2e_nlg',
 'huggingface:e2e_nlg_cleaned',
 'huggingface:ecb',
 'huggingface:ecthr_cases',
 'huggingface:eduge',
 'huggingface:ehealth_kd',
 'huggingface:eitb_parcc',
 'huggingface:electricity_load_diagrams',
 'huggingface:eli5',
 'huggingface:eli5_category',
 'huggingface:elkarhizketak',
 'huggingface:emea',
 'huggingface:emo',
 'huggingface:emotion',
 'huggingface:emotone_ar',
 'huggingface:empathetic_dialogues',
 'huggingface:enriched_web_nlg',
 'huggingface:eraser_multi_rc',
 'huggingface:esnli',
 'huggingface:eth_py150_open',
 'huggingface:ethos',
 'huggingface:eu_regulatory_ir',
 'huggingface:eurlex',
 'huggingface:euronews',
 'huggingface:europa_eac_tm',
 'huggingface:europa_ecdc_tm',
 'huggingface:europarl_bilingual',
 'huggingface:event2Mind',
 'huggingface:evidence_infer_treatment',
 'huggingface:exams',
 'huggingface:factckbr',
 'huggingface:fake_news_english',
 'huggingface:fake_news_filipino',
 'huggingface:farsi_news',
 'huggingface:fashion_mnist',
 'huggingface:fever',
 'huggingface:few_rel',
 'huggingface:financial_phrasebank',
 'huggingface:finer',
 'huggingface:flores',
 'huggingface:flue',
 'huggingface:food101',
 'huggingface:fquad',
 'huggingface:freebase_qa',
 'huggingface:gap',
 'huggingface:gem',
 'huggingface:generated_reviews_enth',
 'huggingface:generics_kb',
 'huggingface:german_legal_entity_recognition',
 'huggingface:germaner',
 'huggingface:germeval_14',
 'huggingface:giga_fren',
 'huggingface:gigaword',
 'huggingface:glucose',
 'huggingface:glue',
 'huggingface:gnad10',
 'huggingface:go_emotions',
 'huggingface:gooaq',
 'huggingface:google_wellformed_query',
 'huggingface:grail_qa',
 'huggingface:great_code',
 'huggingface:greek_legal_code',
 'huggingface:guardian_authorship',
 'huggingface:gutenberg_time',
 'huggingface:hans',
 'huggingface:hansards',
 'huggingface:hard',
 'huggingface:harem',
 'huggingface:has_part',
 'huggingface:hate_offensive',
 'huggingface:hate_speech18',
 'huggingface:hate_speech_filipino',
 'huggingface:hate_speech_offensive',
 'huggingface:hate_speech_pl',
 'huggingface:hate_speech_portuguese',
 'huggingface:hatexplain',
 'huggingface:hausa_voa_ner',
 'huggingface:hausa_voa_topics',
 'huggingface:hda_nli_hindi',
 'huggingface:head_qa',
 'huggingface:health_fact',
 'huggingface:hebrew_projectbenyehuda',
 'huggingface:hebrew_sentiment',
 'huggingface:hebrew_this_world',
 'huggingface:hellaswag',
 'huggingface:hendrycks_test',
 'huggingface:hind_encorp',
 'huggingface:hindi_discourse',
 'huggingface:hippocorpus',
 'huggingface:hkcancor',
 'huggingface:hlgd',
 'huggingface:hope_edi',
 'huggingface:hotpot_qa',
 'huggingface:hover',
 'huggingface:hrenwac_para',
 'huggingface:hrwac',
 'huggingface:humicroedit',
 'huggingface:hybrid_qa',
 'huggingface:hyperpartisan_news_detection',
 'huggingface:iapp_wiki_qa_squad',
 'huggingface:id_clickbait',
 'huggingface:id_liputan6',
 'huggingface:id_nergrit_corpus',
 'huggingface:id_newspapers_2018',
 'huggingface:id_panl_bppt',
 'huggingface:id_puisi',
 'huggingface:igbo_english_machine_translation',
 'huggingface:igbo_monolingual',
 'huggingface:igbo_ner',
 'huggingface:ilist',
 'huggingface:imdb',
 'huggingface:imdb_urdu_reviews',
 'huggingface:imppres',
 'huggingface:indic_glue',
 'huggingface:indonli',
 'huggingface:indonlu',
 'huggingface:inquisitive_qg',
 'huggingface:interpress_news_category_tr',
 'huggingface:interpress_news_category_tr_lite',
 'huggingface:irc_disentangle',
 'huggingface:isixhosa_ner_corpus',
 'huggingface:isizulu_ner_corpus',
 'huggingface:iwslt2017',
 'huggingface:jeopardy',
 'huggingface:jfleg',
 'huggingface:jigsaw_toxicity_pred',
 'huggingface:jigsaw_unintended_bias',
 'huggingface:jnlpba',
 'huggingface:journalists_questions',
 'huggingface:kan_hope',
 'huggingface:kannada_news',
 'huggingface:kd_conv',
 'huggingface:kde4',
 'huggingface:kelm',
 'huggingface:kilt_tasks',
 'huggingface:kilt_wikipedia',
 'huggingface:kinnews_kirnews',
 'huggingface:klue',
 'huggingface:kor_3i4k',
 'huggingface:kor_hate',
 'huggingface:kor_ner',
 'huggingface:kor_nli',
 'huggingface:kor_nlu',
 'huggingface:kor_qpair',
 'huggingface:kor_sae',
 'huggingface:kor_sarcasm',
 'huggingface:labr',
 'huggingface:lama',
 'huggingface:lambada',
 'huggingface:large_spanish_corpus',
 'huggingface:laroseda',
 'huggingface:lc_quad',
 'huggingface:lener_br',
 'huggingface:lex_glue',
 'huggingface:liar',
 'huggingface:librispeech_asr',
 'huggingface:librispeech_lm',
 'huggingface:limit',
 'huggingface:lince',
 'huggingface:linnaeus',
 'huggingface:liveqa',
 'huggingface:lj_speech',
 'huggingface:lm1b',
 'huggingface:lst20',
 'huggingface:m_lama',
 'huggingface:mac_morpho',
 'huggingface:makhzan',
 'huggingface:masakhaner',
 'huggingface:math_dataset',
 'huggingface:math_qa',
 'huggingface:matinf',
 'huggingface:mbpp',
 'huggingface:mc4',
 'huggingface:mc_taco',
 'huggingface:md_gender_bias',
 'huggingface:mdd',
 'huggingface:med_hop',
 'huggingface:medal',
 'huggingface:medical_dialog',
 'huggingface:medical_questions_pairs',
 'huggingface:menyo20k_mt',
 'huggingface:meta_woz',
 'huggingface:metooma',
 'huggingface:metrec',
 'huggingface:miam',
 'huggingface:mkb',
 'huggingface:mkqa',
 'huggingface:mlqa',
 'huggingface:mlsum',
 'huggingface:mnist',
 'huggingface:mocha',
 'huggingface:monash_tsf',
 'huggingface:moroco',
 'huggingface:movie_rationales',
 'huggingface:mrqa',
 'huggingface:ms_marco',
 'huggingface:ms_terms',
 'huggingface:msr_genomics_kbcomp',
 'huggingface:msr_sqa',
 'huggingface:msr_text_compression',
 'huggingface:msr_zhen_translation_parity',
 'huggingface:msra_ner',
 'huggingface:mt_eng_vietnamese',
 'huggingface:muchocine',
 'huggingface:multi_booked',
 'huggingface:multi_eurlex',
 'huggingface:multi_news',
 'huggingface:multi_nli',
 'huggingface:multi_nli_mismatch',
 'huggingface:multi_para_crawl',
 'huggingface:multi_re_qa',
 'huggingface:multi_woz_v22',
 'huggingface:multi_x_science_sum',
 'huggingface:multidoc2dial',
 'huggingface:multilingual_librispeech',
 'huggingface:mutual_friends',
 'huggingface:mwsc',
 'huggingface:myanmar_news',
 'huggingface:narrativeqa',
 'huggingface:narrativeqa_manual',
 'huggingface:natural_questions',
 'huggingface:ncbi_disease',
 'huggingface:nchlt',
 'huggingface:ncslgr',
 'huggingface:nell',
 'huggingface:neural_code_search',
 'huggingface:news_commentary',
 'huggingface:newsgroup',
 'huggingface:newsph',
 'huggingface:newsph_nli',
 'huggingface:newspop',
 'huggingface:newsqa',
 'huggingface:newsroom',
 'huggingface:nkjp-ner',
 'huggingface:nli_tr',
 'huggingface:nlu_evaluation_data',
 'huggingface:norec',
 'huggingface:norne',
 'huggingface:norwegian_ner',
 'huggingface:nq_open',
 'huggingface:nsmc',
 'huggingface:numer_sense',
 'huggingface:numeric_fused_head',
 'huggingface:oclar',
 'huggingface:offcombr',
 'huggingface:offenseval2020_tr',
 'huggingface:offenseval_dravidian',
 'huggingface:ofis_publik',
 'huggingface:ohsumed',
 'huggingface:ollie',
 'huggingface:omp',
 'huggingface:onestop_english',
 'huggingface:onestop_qa',
 'huggingface:open_subtitles',
 'huggingface:openai_humaneval',
 'huggingface:openbookqa',
 'huggingface:openslr',
 'huggingface:openwebtext',
 'huggingface:opinosis',
 'huggingface:opus100',
 'huggingface:opus_books',
 'huggingface:opus_dgt',
 'huggingface:opus_dogc',
 'huggingface:opus_elhuyar',
 'huggingface:opus_euconst',
 'huggingface:opus_finlex',
 'huggingface:opus_fiskmo',
 'huggingface:opus_gnome',
 'huggingface:opus_infopankki',
 'huggingface:opus_memat',
 'huggingface:opus_montenegrinsubs',
 'huggingface:opus_openoffice',
 'huggingface:opus_paracrawl',
 'huggingface:opus_rf',
 'huggingface:opus_tedtalks',
 'huggingface:opus_ubuntu',
 'huggingface:opus_wikipedia',
 'huggingface:opus_xhosanavy',
 'huggingface:orange_sum',
 'huggingface:oscar',
 'huggingface:para_crawl',
 'huggingface:para_pat',
 'huggingface:parsinlu_reading_comprehension',
 'huggingface:pass',
 'huggingface:paws',
 'huggingface:paws-x',
 'huggingface:pec',
 'huggingface:peer_read',
 'huggingface:peoples_daily_ner',
 'huggingface:per_sent',
 'huggingface:persian_ner',
 'huggingface:pg19',
 'huggingface:php',
 'huggingface:piaf',
 'huggingface:pib',
 'huggingface:piqa',
 'huggingface:pn_summary',
 'huggingface:poem_sentiment',
 'huggingface:polemo2',
 'huggingface:poleval2019_cyberbullying',
 'huggingface:poleval2019_mt',
 'huggingface:polsum',
 'huggingface:polyglot_ner',
 'huggingface:prachathai67k',
 'huggingface:pragmeval',
 'huggingface:proto_qa',
 'huggingface:psc',
 'huggingface:ptb_text_only',
 'huggingface:pubmed',
 'huggingface:pubmed_qa',
 'huggingface:py_ast',
 'huggingface:qa4mre',
 'huggingface:qa_srl',
 'huggingface:qa_zre',
 'huggingface:qangaroo',
 'huggingface:qanta',
 'huggingface:qasc',
 'huggingface:qasper',
 'huggingface:qed',
 'huggingface:qed_amara',
 'huggingface:quac',
 'huggingface:quail',
 'huggingface:quarel',
 'huggingface:quartz',
 'huggingface:quora',
 'huggingface:quoref',
 'huggingface:race',
 'huggingface:re_dial',
 'huggingface:reasoning_bg',
 'huggingface:recipe_nlg',
 'huggingface:reclor',
 'huggingface:red_caps',
 'huggingface:reddit',
 'huggingface:reddit_tifu',
 'huggingface:refresd',
 'huggingface:reuters21578',
 'huggingface:riddle_sense',
 'huggingface:ro_sent',
 'huggingface:ro_sts',
 'huggingface:ro_sts_parallel',
 'huggingface:roman_urdu',
 'huggingface:ronec',
 'huggingface:ropes',
 'huggingface:rotten_tomatoes',
 'huggingface:russian_super_glue',
 'huggingface:s2orc',
 'huggingface:samsum',
 'huggingface:sanskrit_classic',
 'huggingface:saudinewsnet',
 'huggingface:sberquad',
 'huggingface:scan',
 'huggingface:scb_mt_enth_2020',
 'huggingface:scene_parse_150',
 'huggingface:schema_guided_dstc8',
 'huggingface:scicite',
 'huggingface:scielo',
 'huggingface:scientific_papers',
 'huggingface:scifact',
 'huggingface:sciq',
 'huggingface:scitail',
 'huggingface:scitldr',
 'huggingface:search_qa',
 'huggingface:sede',
 'huggingface:selqa',
 'huggingface:sem_eval_2010_task_8',
 'huggingface:sem_eval_2014_task_1',
 'huggingface:sem_eval_2018_task_1',
 'huggingface:sem_eval_2020_task_11',
 'huggingface:sent_comp',
 'huggingface:senti_lex',
 'huggingface:senti_ws',
 'huggingface:sentiment140',
 'huggingface:sepedi_ner',
 'huggingface:sesotho_ner_corpus',
 'huggingface:setimes',
 'huggingface:setswana_ner_corpus',
 'huggingface:sharc',
 'huggingface:sharc_modified',
 'huggingface:sick',
 'huggingface:silicone',
 'huggingface:simple_questions_v2',
 'huggingface:siswati_ner_corpus',
 'huggingface:smartdata',
 'huggingface:sms_spam',
 'huggingface:snips_built_in_intents',
 'huggingface:snli',
 'huggingface:snow_simplified_japanese_corpus',
 'huggingface:so_stacksample',
 'huggingface:social_bias_frames',
 'huggingface:social_i_qa',
 'huggingface:sofc_materials_articles',
 'huggingface:sogou_news',
 'huggingface:spanish_billion_words',
 'huggingface:spc',
 'huggingface:species_800',
 'huggingface:speech_commands',
 'huggingface:spider',
 'huggingface:squad',
 'huggingface:squad_adversarial',
 'huggingface:squad_es',
 'huggingface:squad_it',
 'huggingface:squad_kor_v1',
 'huggingface:squad_kor_v2',
 'huggingface:squad_v1_pt',
 'huggingface:squad_v2',
 'huggingface:squadshifts',
 'huggingface:srwac',
 'huggingface:sst',
 'huggingface:stereoset',
 'huggingface:story_cloze',
 'huggingface:stsb_mt_sv',
 'huggingface:stsb_multi_mt',
 'huggingface:style_change_detection',
 'huggingface:subjqa',
 'huggingface:super_glue',
 'huggingface:superb',
 'huggingface:svhn',
 'huggingface:swag',
 'huggingface:swahili',
 'huggingface:swahili_news',
 'huggingface:swda',
 'huggingface:swedish_medical_ner',
 'huggingface:swedish_ner_corpus',
 'huggingface:swedish_reviews',
 'huggingface:swiss_judgment_prediction',
 'huggingface:tab_fact',
 'huggingface:tamilmixsentiment',
 'huggingface:tanzil',
 'huggingface:tapaco',
 'huggingface:tashkeela',
 'huggingface:taskmaster1',
 'huggingface:taskmaster2',
 'huggingface:taskmaster3',
 'huggingface:tatoeba',
 'huggingface:ted_hrlr',
 'huggingface:ted_iwlst2013',
 'huggingface:ted_multi',
 'huggingface:ted_talks_iwslt',
 'huggingface:telugu_books',
 'huggingface:telugu_news',
 'huggingface:tep_en_fa_para',
 'huggingface:text2log',
 'huggingface:thai_toxicity_tweet',
 'huggingface:thainer',
 'huggingface:thaiqa_squad',
 'huggingface:thaisum',
 'huggingface:the_pile',
 'huggingface:the_pile_books3',
 'huggingface:the_pile_openwebtext2',
 'huggingface:the_pile_stack_exchange',
 'huggingface:tilde_model',
 'huggingface:time_dial',
 'huggingface:times_of_india_news_headlines',
 'huggingface:timit_asr',
 'huggingface:tiny_shakespeare',
 'huggingface:tlc',
 'huggingface:tmu_gfm_dataset',
 'huggingface:told-br',
 'huggingface:totto',
 'huggingface:trec',
 'huggingface:trivia_qa',
 'huggingface:tsac',
 'huggingface:ttc4900',
 'huggingface:tunizi',
 'huggingface:tuple_ie',
 'huggingface:turk',
 'huggingface:turkic_xwmt',
 'huggingface:turkish_movie_sentiment',
 'huggingface:turkish_ner',
 'huggingface:turkish_product_reviews',
 'huggingface:turkish_shrinked_ner',
 'huggingface:turku_ner_corpus',
 'huggingface:tweet_eval',
 'huggingface:tweet_qa',
 'huggingface:tweets_ar_en_parallel',
 'huggingface:tweets_hate_speech_detection',
 'huggingface:twi_text_c3',
 'huggingface:twi_wordsim353',
 'huggingface:tydiqa',
 'huggingface:ubuntu_dialogs_corpus',
 'huggingface:udhr',
 'huggingface:um005',
 'huggingface:un_ga',
 'huggingface:un_multi',
 'huggingface:un_pc',
 'huggingface:universal_dependencies',
 'huggingface:universal_morphologies',
 'huggingface:urdu_fake_news',
 'huggingface:urdu_sentiment_corpus',
 ...]

加载数据集

tfds.load

加载数据集最简单的方法是 tfds.load。它将执行以下操作:

  1. 下载数据并将其存储为 tfrecord 文件。
  2. 加载 tfrecord 并创建 tf.data.Dataset
ds = tfds.load('mnist', split='train', shuffle_files=True)
assert isinstance(ds, tf.data.Dataset)
print(ds)
<PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

一些常见的参数:

  • split=:要读取的拆分(例如 'train'['train', 'test']'train[80%:]'…)。请参阅我们的拆分 API 指南
  • shuffle_files=:控制是否打乱每个周期间的文件顺序(TFDS 以多个较小的文件存储大数据集)
  • data_dir=:数据集存储的位置(默认为 ~/tensorflow_datasets/
  • with_info=True:返回包含数据集元数据的 tfds.core.DatasetInfo
  • download=False:停用下载

tfds.builder

tfds.loadtfds.core.DatasetBuilder 的瘦封装容器。您可以使用 tfds.core.DatasetBuilder API 获得相同的输出:

builder = tfds.builder('mnist')
# 1. Create the tfrecord files (no-op if already exists)
builder.download_and_prepare()
# 2. Load the `tf.data.Dataset`
ds = builder.as_dataset(split='train', shuffle_files=True)
print(ds)
<PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

tfds build CLI

如果您希望生成一个特定的数据集,可以使用 tfds 命令行。例如:

tfds build mnist

请参阅文档查看可用标志。

迭代数据集

作为字典

默认情况下,tf.data.Dataset 对象包含 tf.Tensordict

ds = tfds.load('mnist', split='train')
ds = ds.take(1)  # Only take a single example

for example in ds:  # example is `{'image': tf.Tensor, 'label': tf.Tensor}`
  print(list(example.keys()))
  image = example["image"]
  label = example["label"]
  print(image.shape, label)
['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
2022-06-04 01:42:18.953468: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

要找出 dict 键名和结构,请查看我们目录中的数据集文档。例如:mnist 文档

作为元组(as_supervised=True

使用 as_supervised=True,您可以获取 (features, label) 元组作为替代的监督数据集。

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in ds:  # example is (image, label)
  print(image.shape, label)
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
2022-06-04 01:42:19.814894: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

作为 numpy(tfds.as_numpy

使用 tfds.as_numpy 进行以下转换:

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in tfds.as_numpy(ds):
  print(type(image), type(label), label)
<class 'numpy.ndarray'> <class 'numpy.int64'> 4
2022-06-04 01:42:20.644584: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

作为 batched tf.Tensor(batch_size=-1

使用 batch_size=-1,您可以在单个批次中加载完整的数据集。

这可与 as_supervised=Truetfds.as_numpy 结合使用以获取 (np.array, np.array) 形式的数据:

image, label = tfds.as_numpy(tfds.load(
    'mnist',
    split='test',
    batch_size=-1,
    as_supervised=True,
))

print(type(image), image.shape)
<class 'numpy.ndarray'> (10000, 28, 28, 1)

请注意,您的数据集可以放入内存,并且所有样本都具有相同的形状。

对您的数据集进行基准分析

对数据集进行基准分析是对任何可迭代对象(例如 tf.data.Datasettfds.as_numpy…)的简单 tfds.benchmark 调用。

ds = tfds.load('mnist', split='train')
ds = ds.batch(32).prefetch(1)

tfds.benchmark(ds, batch_size=32)
tfds.benchmark(ds, batch_size=32)  # Second epoch much faster due to auto-caching
************ Summary ************

Examples/sec (First included) 25550.77 ex/sec (total: 60000 ex, 2.35 sec)
Examples/sec (First only) 118.76 ex/sec (total: 32 ex, 0.27 sec)
Examples/sec (First excluded) 28847.20 ex/sec (total: 59968 ex, 2.08 sec)

************ Summary ************

Examples/sec (First included) 222130.99 ex/sec (total: 60000 ex, 0.27 sec)
Examples/sec (First only) 1808.32 ex/sec (total: 32 ex, 0.02 sec)
Examples/sec (First excluded) 237577.05 ex/sec (total: 59968 ex, 0.25 sec)
  • 不要忘记使用 batch_size= kwarg 对每个批次大小的结果进行归一化。
  • 总之,第一个预热批次与其他预热批次分开以捕获 tf.data.Dataset 额外的设置时间(例如缓冲区初始化…)。
  • 请注意,由于 TFDS 自动缓存功能,第二次迭代的速度要快得多。
  • tfds.benchmark 会返回 tfds.core.BenchmarkResult ,可以检查它以进行进一步分析。

构建端到端流水线

要想深入一点,您可以查看:

呈现

tfds.as_dataframe

使用 tfds.as_dataframe,可以将 tf.data.Dataset 对象转换为 pandas.DataFrame 以在 Colab 上呈现。

  • 添加 tfds.core.DatasetInfo 作为 tfds.as_dataframe 的第二个参数以呈现图像、音频、文本、视频…
  • 使用 ds.take(x) 仅显示前 x 个样本。pandas.DataFrame 将在内存中加载完整数据集,并且显示开销可能非常高。
ds, info = tfds.load('mnist', split='train', with_info=True)

tfds.as_dataframe(ds.take(4), info)
2022-06-04 01:42:25.724077: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

tfds.show_examples

tfds.show_examples 返回 matplotlib.figure.Figure(现在只支持图像数据集):

ds, info = tfds.load('mnist', split='train', with_info=True)

fig = tfds.show_examples(ds, info)
2022-06-04 01:42:26.911049: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

png

访问数据集元数据

所有构建器都包括一个包含数据集元数据的 tfds.core.DatasetInfo 对象。

可以通过以下方式访问:

ds, info = tfds.load('mnist', with_info=True)
builder = tfds.builder('mnist')
info = builder.info

数据集信息包含有关数据集的附加信息(版本、引用、首页、描述…)。

print(info)
tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='gs://tensorflow-datasets/datasets/mnist/3.0.1',
    file_format=tfrecord,
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)

特征元数据(标签名称、图像形状…)

访问 tfds.features.FeatureDict

info.features
FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

类、标签名的数量:

print(info.features["label"].num_classes)
print(info.features["label"].names)
print(info.features["label"].int2str(7))  # Human readable version (8 -> 'cat')
print(info.features["label"].str2int('7'))
10
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
7
7

形状、数据类型:

print(info.features.shape)
print(info.features.dtype)
print(info.features['image'].shape)
print(info.features['image'].dtype)
{'image': (28, 28, 1), 'label': ()}
{'image': tf.uint8, 'label': tf.int64}
(28, 28, 1)
<dtype: 'uint8'>

拆分元数据(例如拆分名称、样本数量…)

访问 tfds.core.SplitDict

print(info.splits)
{'test': <SplitInfo num_examples=10000, num_shards=1>, 'train': <SplitInfo num_examples=60000, num_shards=1>}

可用拆分:

print(list(info.splits.keys()))
['test', 'train']

获取有关个别拆分的信息:

print(info.splits['train'].num_examples)
print(info.splits['train'].filenames)
print(info.splits['train'].num_shards)
60000
['mnist-train.tfrecord-00000-of-00001']
1

它也适用于 subsplit API:

print(info.splits['train[15%:75%]'].num_examples)
print(info.splits['train[15%:75%]'].file_instructions)
36000
[FileInstruction(filename='gs://tensorflow-datasets/datasets/mnist/3.0.1/mnist-train.tfrecord-00000-of-00001', skip=9000, take=36000, num_examples=36000)]

问题排查

手动下载(如果下载失败)

如果由于某种原因下载失败(例如离线…),那么您始终可以自己手动下载数据并将其放置在 manual_dir 中(默认为 ~/tensorflow_datasets/download/manual/)。

要找到下载网址,请查看:

修正 NonMatchingChecksumError

TFDS 通过验证下载网址的校验和来确保确定性。如果引发 NonMatchingChecksumError,则可能表示:

  • 网站可能宕机(如 503 status code)。请检查网址。
  • 对于 Google 云端硬盘网址,请稍后再试。当很多人访问同一网址时云端硬盘有时拒绝下载。请参阅错误
  • 原始数据集文件可能已更新。在这种情况下,应当更新 TFDS 数据集构建器。请打开一个新的 Github 议题或拉取请求:
    • 使用 tfds build --register_checksums 注册新的校验和
    • 逐步更新数据集生成代码。
    • 更新数据集 VERSION
    • 更新数据集 RELEASE_NOTES:是什么导致校验和发生变化?一些样本发生了改变吗?
    • 确保数据集仍能够构建。
    • 向我们发送拉取请求

注:您也可以检查 ~/tensorflow_datasets/download/ 中的下载文件。

引用

如果您在论文中使用 tensorflow-datasets,除了特定于所用数据集(可以在数据集目录中找到)的任何引用之外,请包含以下引用。

@misc{TFDS,
  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://tensorflow.google.cn/datasets}},
}