参考文献:
すべての言語
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/all_languages')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1926192 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
AF
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/af')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 307 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
あーる
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ar')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 6446 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
az
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/az')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 624 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
なれ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/be')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1512 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ベル
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ber')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 67484 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
バックグラウンド
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/bg')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 6324 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ブン
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/bn')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1440 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
br
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/br')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2536 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
およそ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ca')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 518 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
CBK
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/cbk')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 262 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
cmn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/cmn')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 12549 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
cs
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/cs')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 6659 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
だ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/da')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 11220 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
デ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/de')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 125091 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
エル
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/el')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 10072 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
jp
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/en')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 158053 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
エオ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/eo')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 207105 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
エス
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/es')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 85064 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
など
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/et')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 241 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
欧州連合
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/eu')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 573 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
フィ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/fi')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 31753 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
フランス
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/fr')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 116733 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
GL
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/gl')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 351 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
行きます
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/gos')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 279 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
彼
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/he')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 68350 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
こんにちは
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/hi')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1913年 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
時
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/hr')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 505 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ふー
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/hu')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 67964 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
こんにちは。
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/hy')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 603 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ああ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ia')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2548 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ID
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/id')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1602 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
つまり
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ie')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 488 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
イオ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/io')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 480 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
は
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/is')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1641年 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
それ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/it')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 198919 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
じゃ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ja')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 44267 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ジボ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/jbo')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2704 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
カブ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/kab')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 15944 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
こ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ko')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 503 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
キロワット
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/kw')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1328 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ラ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/la')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 6889 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ふん
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/lfn')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2313 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
それ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/lt')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 8042 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
mk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/mk')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 14678 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
氏
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/mr')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 16413 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
注意
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/nb')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1094 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
NDS
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/nds')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2633 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
nl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/nl')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 23561 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
オルブ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/orv')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 471 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
太田
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ota')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 486 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ペス
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/pes')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 4285 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
お願いします
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/pl')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 22391 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ポイント
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/pt')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 78430 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ん
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/rn')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 648 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ロ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ro')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2092年 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
る
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ru')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 251263 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
sl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/sl')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 706 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
sr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/sr')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 8175 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
SV
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/sv')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 7005 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
TK
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/tk')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1165 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
TL
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/tl')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1017 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ああ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/tlh')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2804 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
トキ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/toki')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 3738 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/tr')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 142088 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
って
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/tt')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 2398 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
うーん
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ug')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 1183 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
英国
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/uk')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 54431 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
あなた
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/ur')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 252 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ヴィ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/vi')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 962 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ヴォ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/vo')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 328 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
戦争
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/war')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 327 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
うーん
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/wuu')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 408 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ユエ
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:tapaco/yue')
- 説明:
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- ライセンス: クリエイティブ・コモンズ 表示 2.0 汎用
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'train' | 561 |
- 特徴:
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}