타파코

참고자료:

모든_언어

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/all_languages')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1926192
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

아프

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/af')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 307
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

아르

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ar')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 6446
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

아즈

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/az')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 624
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

BE

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/be')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1512
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

베르

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ber')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 67484
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

bg

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/bg')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 6324
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/bn')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1440
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

br

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/br')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2536
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

캘리포니아

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ca')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 518
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

CBK

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/cbk')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 262
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

cmn

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/cmn')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 12549
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

CS

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/cs')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 6659
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/da')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 11220
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/de')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 125091
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

엘자

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/el')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 10072
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ko

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/en')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 158053
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/eo')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 207105
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/es')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 85064
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/et')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 241
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/eu')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 573
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

fi

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/fi')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 31753
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

정말로

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/fr')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 116733
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/gl')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 351
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

고스

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/gos')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 279
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/he')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 68350
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

안녕

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/hi')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1913년
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

시간

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/hr')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 505
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/hu')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 67964
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

안녕

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/hy')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 603
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

아이아

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ia')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2548
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ID

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/id')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1602
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ie')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 488
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

이오

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/io')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 480
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

~이다

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/is')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1641년
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

그것

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/it')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 198919
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ja')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 44267
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

제이보

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/jbo')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2704
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

카브

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/kab')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 15944
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ko')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 503
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

kw

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/kw')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1328
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/la')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 6889
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

lfn

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/lfn')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2313
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

lt

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/lt')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 8042
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

MK

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/mk')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 14678
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

~ 씨

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/mr')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 16413
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

주의

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/nb')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1094
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

nds

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/nds')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2633
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

NL

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/nl')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 23561
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

오르브

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/orv')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 471
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

오타

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ota')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 486
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

페스

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/pes')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 4285
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

pl

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/pl')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 22391
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

태평양 표준시

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/pt')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 78430
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

rn

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/rn')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 648
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ro')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2092
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ru')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 251263
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

sl

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/sl')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 706
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

아저씨

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/sr')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 8175
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

세인트

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/sv')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 7005
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ㅋㅋㅋ

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/tk')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1165
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tl

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/tl')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1017
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ㅋㅋㅋ

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/tlh')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2804
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

따오기

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/toki')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 3738
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tr

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/tr')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 142088
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ㅜㅜ

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/tt')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 2398
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ug')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 1183
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

영국

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/uk')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 54431
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

당신의

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/ur')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 252
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

vi

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/vi')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 962
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/vo')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 328
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

전쟁

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/war')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 327
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

우우

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/wuu')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 408
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

유에

TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:tapaco/yue')
  • 설명 :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • 라이센스 : 크리에이티브 커먼즈 저작자표시 2.0 일반
  • 버전 : 1.0.0
  • 분할 :
나뉘다
'train' 561
  • 특징 :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}