TFDS はCroissant 🥐 形式をサポートするようになりました。詳細については、ドキュメントをお読みください。

このページは Cloud Translation API によって翻訳されました。

wiki_auto

説明:

WikiAuto は、文簡略化システムをトレーニングするためのリソースとして、英語版ウィキペディアと簡易英語版ウィキペディアから整列された一連の文を提供します。著者はまず、Simple English Wikipedia のサブセット内のセンテンスと英語版 Wikipedia の対応するバージョン (これはmanual構成に対応する) との間の一連の手動アラインメントをクラウドソーシングし、次にこれらのアラインメントを予測するようにニューラル CRF システムをトレーニングしました。次に、訓練されたモデルをシンプルな英語のウィキペディアの他の記事に適用し、対応する英語の記事を作成して、整列された文のより大きなコーパスを作成しました (ここのauto 、 auto_acl 、 auto_full_no_split 、およびauto_full_with_split構成に対応します)。

ホームページ: https://github.com/chaojiang06/wiki-auto
ソースコード: tfds.text_simplification.wiki_auto.WikiAuto
バージョン:
- 1.0.0 (デフォルト): 初期リリース。
監視されたキー( as_supervised docを参照): None
図( tfds.show_examples ): サポートされていません。
引用：

@inproceedings{acl/JiangMLZX20,
  author    = {Chao Jiang and
               Mounica Maddela and
               Wuwei Lan and
               Yang Zhong and
               Wei Xu},
  editor    = {Dan Jurafsky and
               Joyce Chai and
               Natalie Schluter and
               Joel R. Tetreault},
  title     = {Neural {CRF} Model for Sentence Alignment in Text Simplification},
  booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational
               Linguistics, {ACL} 2020, Online, July 5-10, 2020},
  pages     = {7943--7960},
  publisher = {Association for Computational Linguistics},
  year      = {2020},
  url       = {https://www.aclweb.org/anthology/2020.acl-main.709/}
}

wiki_auto/manual (デフォルト設定)

構成の説明: クラウドワーカーによって整列された 10,000 のウィキペディア文のペアのセット。
ダウンロードサイズ: 53.47 MiB
データセットのサイズ: 76.87 MiB
自動キャッシュ(ドキュメント): はい
スプリット:

スプリット	例
`'dev'`	73,249
`'test'`	118,074

機能構造:

FeaturesDict({
    'GLEU-score': float64,
    'alignment_label': ClassLabel(shape=(), dtype=int64, num_classes=3),
    'normal_sentence': Text(shape=(), dtype=string),
    'normal_sentence_id': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
    'simple_sentence_id': Text(shape=(), dtype=string),
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
GLEUスコア	テンソル	float64
配置ラベル	クラスラベル	int64
normal_sentence	文章	ストリング
normal_sentence_id	文章	ストリング
シンプルな文	文章	ストリング
simple_sentence_id	文章	ストリング

例( tfds.as_dataframe ):

wiki_auto/auto_acl

構成の説明: ACL2020 システムをトレーニングするために配置された文のペア。
ダウンロードサイズ: 112.60 MiB
データセットのサイズ: 138.83 MiB
自動キャッシュ(ドキュメント): shuffle_files=False (full) の場合のみ
スプリット:

スプリット	例
`'full'`	488,332

機能構造:

FeaturesDict({
    'normal_sentence': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
normal_sentence	文章	ストリング
シンプルな文	文章	ストリング

例( tfds.as_dataframe ):

wiki_auto/auto_full_no_split

構成の説明: 文を分割せずに、すべての文のペアを自動的に並べます。
ダウンロードサイズ: 135.02 MiB
データセットのサイズ: 166.78 MiB
自動キャッシュ(ドキュメント): shuffle_files=False (full) の場合のみ
スプリット:

スプリット	例
`'full'`	591,994

機能構造:

FeaturesDict({
    'normal_sentence': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
normal_sentence	文章	ストリング
シンプルな文	文章	ストリング

例( tfds.as_dataframe ):

wiki_auto/auto_full_with_split

構成の説明: 文の分割を使用して、すべての文のペアを自動的に並べます。
ダウンロードサイズ: 115.09 MiB
データセットサイズ: 141.20 MiB
自動キャッシュ(ドキュメント): shuffle_files=False (full) の場合のみ
スプリット:

スプリット	例
`'full'`	483,801

機能構造:

FeaturesDict({
    'normal_sentence': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
normal_sentence	文章	ストリング
シンプルな文	文章	ストリング

例( tfds.as_dataframe ):

wiki_auto/auto

構成の説明: 自動的に整列された文のペアの大規模なセット。
ダウンロードサイズ: 2.01 GiB
データセットサイズ: 1.76 GiB
自動キャッシュ(ドキュメント): いいえ
スプリット:

スプリット	例
`'part_1'`	125,059
`'part_2'`	13,036

機能構造:

FeaturesDict({
    'example_id': Text(shape=(), dtype=string),
    'normal': FeaturesDict({
        'normal_article_content': Sequence({
            'normal_sentence': Text(shape=(), dtype=string),
            'normal_sentence_id': Text(shape=(), dtype=string),
        }),
        'normal_article_id': int32,
        'normal_article_title': Text(shape=(), dtype=string),
        'normal_article_url': Text(shape=(), dtype=string),
    }),
    'paragraph_alignment': Sequence({
        'normal_paragraph_id': Text(shape=(), dtype=string),
        'simple_paragraph_id': Text(shape=(), dtype=string),
    }),
    'sentence_alignment': Sequence({
        'normal_sentence_id': Text(shape=(), dtype=string),
        'simple_sentence_id': Text(shape=(), dtype=string),
    }),
    'simple': FeaturesDict({
        'simple_article_content': Sequence({
            'simple_sentence': Text(shape=(), dtype=string),
            'simple_sentence_id': Text(shape=(), dtype=string),
        }),
        'simple_article_id': int32,
        'simple_article_title': Text(shape=(), dtype=string),
        'simple_article_url': Text(shape=(), dtype=string),
    }),
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
example_id	文章	ストリング
正常	特徴辞書
normal/normal_article_content	順序
normal/normal_article_content/normal_sentence	文章	ストリング
normal/normal_article_content/normal_sentence_id	文章	ストリング
normal/normal_article_id	テンソル	int32
normal/normal_article_title	文章	ストリング
normal/normal_article_url	文章	ストリング
段落配置	順序
paragraph_alignment/normal_paragraph_id	文章	ストリング
paragraph_alignment/simple_paragraph_id	文章	ストリング
文の配置	順序
文の配置/normal_sentence_id	文章	ストリング
文の配置/simple_sentence_id	文章	ストリング
単純	特徴辞書
simple/simple_article_content	順序
シンプル/シンプル_記事_コンテンツ/シンプル_文	文章	ストリング
simple/simple_article_content/simple_sentence_id	文章	ストリング
simple/simple_article_id	テンソル	int32
simple/simple_article_title	文章	ストリング
simple/simple_article_url	文章	ストリング

例( tfds.as_dataframe ):

wiki_auto コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。