tapaco

Referências:

todos_idiomas

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/all_languages')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1926192
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

af

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/af')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 307
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ar

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ar')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 6446
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

az

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/az')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 624
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ser

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/be')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1512
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ber

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ber')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 67484
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

obrigado

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/bg')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 6324
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

bilhões

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/bn')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1440
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

br

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/br')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2536
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ca

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ca')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 518
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

cbk

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/cbk')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 262
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

cmn

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/cmn')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 12549
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

CS

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/cs')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 6659
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

pai

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/da')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 11220
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

de

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/de')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 125091
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

el

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/el')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 10072
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

pt

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/en')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 158053
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

eo

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/eo')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 207105
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

é

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/es')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 85064
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

et

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/et')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 241
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UE

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/eu')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 573
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

fi

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/fi')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 31753
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

franco

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/fr')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 116733
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

gl

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/gl')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 351
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

vai

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/gos')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 279
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ele

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/he')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 68350
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

oi

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/hi')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1913
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

horas

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/hr')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 505
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

hein

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/hu')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 67964
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

oi

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/hy')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 603
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

eu

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ia')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2548
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

eu ia

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/id')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1602
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ou seja

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ie')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 488
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

eu

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/io')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 480
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

é

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/is')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1641
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

isto

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/it')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 198919
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

sim

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ja')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 44267
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

jbo

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/jbo')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2704
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

kab

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/kab')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 15944
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ko

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ko')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 503
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

kw

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/kw')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1328
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

la

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/la')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 6889
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

lfn

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/lfn')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2313
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

isso

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/lt')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 8042
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

mk

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/mk')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 14678
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

senhor

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/mr')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 16413
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

obs.

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/nb')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1094
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

nds

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/nds')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2633
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

nl

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/nl')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 23561
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

orv

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/orv')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 471
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ota

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ota')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 486
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

pes

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/pes')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 4285
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

por favor

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/pl')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 22391
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ponto

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/pt')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 78430
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

rn

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/rn')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 648
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ro

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ro')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2092
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ru

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ru')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 251263
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

sl

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/sl')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 706
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

senhor

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/sr')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 8175
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

SV

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/sv')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 7005
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

obrigado

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/tk')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1165
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tl

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/tl')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1017
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tlh

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/tlh')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2804
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

toki

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/toki')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 3738
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tr

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/tr')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 142088
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tt

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/tt')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 2398
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ei

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ug')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 1183
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Reino Unido

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/uk')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 54431
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

você

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/ur')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 252
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

vi

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/vi')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 962
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

você

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/vo')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 328
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

guerra

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/war')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 327
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

uau

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/wuu')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 408
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

sim

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:tapaco/yue')
  • Descrição :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • Licença : Creative Commons Atribuição 2.0 Genérica
  • Versão : 1.0.0
  • Divisões :
Dividir Exemplos
'train' 561
  • Características :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}