- Mô tả:
GEM là một môi trường chuẩn cho Ngôn ngữ tự nhiên thế hệ với một tập trung vào đánh giá của nó, cả hai đều thông qua các chú thích của con người và Metrics tự động.
GEM nhằm mục đích: (1) đo lường tiến trình NLG trên 13 bộ dữ liệu bao gồm nhiều nhiệm vụ và ngôn ngữ NLG. (2) cung cấp phân tích chuyên sâu về dữ liệu và mô hình được trình bày thông qua các báo cáo dữ liệu và bộ thách thức. (3) phát triển các tiêu chuẩn để đánh giá văn bản được tạo bằng cách sử dụng cả thước đo tự động và con người.
Thông tin chi tiết có thể được tìm thấy tại https://gem-benchmark.com .
Trang chủ: https://gem-benchmark.com
Source code:
tfds.text.gem.Gem
phiên bản:
-
1.0.0
: Phiên bản ban đầu -
1.0.1
: Cập nhật liên kết xấu lọc cho MLSum -
1.1.0
(mặc định): Phát hành của Bộ Challenge
-
Phím giám sát (Xem
as_supervised
doc ):None
Hình ( tfds.show_examples ): Không được hỗ trợ.
gem / common_gen (cấu hình mặc định)
Config mô tả: CommonGen là một nhiệm vụ thế hệ văn bản ràng buộc, liên kết với một tập dữ liệu chuẩn, để rõ ràng các máy thử nghiệm cho khả năng lập luận commonsense sinh sản. Đưa ra một tập hợp các khái niệm chung; nhiệm vụ là tạo ra một câu mạch lạc mô tả một tình huống hàng ngày bằng cách sử dụng các khái niệm này.
Dung lượng tải về:
1.84 MiB
Dataset kích thước:
16.84 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 1.497 |
'train' | 67.389 |
'validation' | 993 |
- Các tính năng:
FeaturesDict({
'concept_set_id': tf.int32,
'concepts': Sequence(tf.string),
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{lin2020commongen,
title = "CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
author = "Lin, Bill Yuchen and
Zhou, Wangchunshu and
Shen, Ming and
Zhou, Pei and
Bhagavatula, Chandra and
Choi, Yejin and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
pages = "1823--1840",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / cs_restaurants
Config mô tả: Nhiệm vụ được tạo ra phản ứng trong bối cảnh của một (giả thuyết) Hệ thống đối thoại cung cấp thông tin về các nhà hàng. Đầu vào là loại hành động ý định / đối thoại cơ bản và danh sách các vị trí (thuộc tính) và giá trị của chúng. Đầu ra là một câu ngôn ngữ tự nhiên.
Dung lượng tải về:
1.46 MiB
Dataset kích thước:
2.71 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 842 |
'train' | 3.569 |
'validation' | 781 |
- Các tính năng:
FeaturesDict({
'dialog_act': tf.string,
'dialog_act_delexicalized': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'target_delexicalized': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{cs_restaurants,
address = {Tokyo, Japan},
title = {Neural {Generation} for {Czech}: {Data} and {Baselines} },
shorttitle = {Neural {Generation} for {Czech} },
url = {https://www.aclweb.org/anthology/W19-8670/},
urldate = {2019-10-18},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Jurčíček, Filip},
month = oct,
year = {2019},
pages = {563--574}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
đá quý / phi tiêu
Config mô tả: DART là một lớn và mở miền cấu trúc dữ liệu Ghi Text hệ corpus với các chú thích câu chất lượng cao với mỗi đầu vào là một tập hợp các thực thể gấp ba-mối quan hệ sau một cây cấu trúc ontology.
Dung lượng tải về:
28.01 MiB
Dataset kích thước:
33.78 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 6.959 |
'train' | 62.659 |
'validation' | 2.768 |
- Các tính năng:
FeaturesDict({
'dart_id': tf.int32,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'subtree_was_extended': tf.bool,
'target': tf.string,
'target_sources': Sequence(tf.string),
'tripleset': Sequence(tf.string),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@article{radev2020dart,
title=Dart: Open-domain structured data record to text generation,
author={Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Rajani, Nazneen Fatema and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and others},
journal={arXiv preprint arXiv:2007.02871},
year={2020}
}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / e2e_nlg
Config Mô tả: Tập dữ liệu E2E được thiết kế cho một nhiệm vụ giới hạn miền dữ liệu-to-text - thế hệ của giới thiệu nhà hàng / khuyến nghị dựa trên lên đến 8 thuộc tính khác nhau (tên, khu vực, phạm vi giá, vv)
Dung lượng tải về:
13.99 MiB
Dataset kích thước:
16.92 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 4.693 |
'train' | 33.525 |
'validation' | 4.299 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'meaning_representation': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{e2e_cleaned,
address = {Tokyo, Japan},
title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation} },
url = {https://www.aclweb.org/anthology/W19-8652/},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
year = {2019},
pages = {421--426},
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / mlsum_de
Config mô tả: MLSum là một quy mô lớn đa ngôn ngữ tổng hợp dữ liệu. Nó được lan truyền từ các trang tin tức trực tuyến, sự phân chia này tập trung vào tiếng Đức.
Dung lượng tải về:
345.98 MiB
Dataset kích thước:
963.60 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_covid' | 5,058 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 10.695 |
'train' | 220.748 |
'validation' | 11.392 |
- Các tính năng:
FeaturesDict({
'date': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'text': tf.string,
'title': tf.string,
'topic': tf.string,
'url': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{scialom-etal-2020-mlsum,
title = "{MLSUM}: The Multilingual Summarization Corpus",
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / mlsum_es
Config mô tả: MLSum là một quy mô lớn đa ngôn ngữ tổng hợp dữ liệu. Nó được lan truyền từ các trang tin tức trực tuyến, sự phân chia này tập trung vào tiếng Tây Ban Nha.
Dung lượng tải về:
501.27 MiB
Kích thước tập dữ liệu:
1.29 GiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_covid' | 1.938 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 13.366 |
'train' | 259.888 |
'validation' | 9,977 |
- Các tính năng:
FeaturesDict({
'date': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'text': tf.string,
'title': tf.string,
'topic': tf.string,
'url': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{scialom-etal-2020-mlsum,
title = "{MLSUM}: The Multilingual Summarization Corpus",
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / schema_guided_dialog
Config mô tả: Các Schema-Guided Dialogue (SGD) dữ liệu chứa 18K đa miền đối thoại nhiệm vụ theo định hướng giữa một con người và một trợ lý ảo, trong đó bao gồm 17 lĩnh vực khác nhau, từ các ngân hàng và các sự kiện truyền thông, lịch, du lịch, và thời tiết.
Dung lượng tải về:
17.00 MiB
Dataset kích thước:
201.19 MiB
Tự động lưu trữ ( tài liệu ): Có (challenge_test_backtranslation, challenge_test_bfp02, challenge_test_bfp05, challenge_test_nopunc, challenge_test_scramble, challenge_train_sample, challenge_validation_sample, kiểm tra, xác nhận), Chỉ khi
shuffle_files=False
(tàu)tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_backtranslation' | 500 |
'challenge_test_bfp02' | 500 |
'challenge_test_bfp05' | 500 |
'challenge_test_nopunc' | 500 |
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 10.000 |
'train' | 164,982 |
'validation' | 10.000 |
- Các tính năng:
FeaturesDict({
'context': Sequence(tf.string),
'dialog_acts': Sequence({
'act': ClassLabel(shape=(), dtype=tf.int64, num_classes=18),
'slot': tf.string,
'values': Sequence(tf.string),
}),
'dialog_id': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'prompt': tf.string,
'references': Sequence(tf.string),
'service': tf.string,
'target': tf.string,
'turn_id': tf.int32,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@article{rastogi2019towards,
title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset},
author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
journal={arXiv preprint arXiv:1909.05855},
year={2019}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
đá quý / totto
Config mô tả: Totto là một nhiệm vụ NLG Bảng-to-Text. Nhiệm vụ như sau: Cho một bảng Wikipedia với tên hàng, tên cột và ô bảng, với một tập hợp con các ô được đánh dấu, tạo mô tả ngôn ngữ tự nhiên cho phần được đánh dấu của bảng.
Dung lượng tải về:
180.75 MiB
Dataset kích thước:
645.86 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 7.700 |
'train' | 121.153 |
'validation' | 7.700 |
- Các tính năng:
FeaturesDict({
'example_id': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'highlighted_cells': Sequence(Sequence(tf.int32)),
'overlap_subset': tf.string,
'references': Sequence(tf.string),
'sentence_annotations': Sequence({
'final_sentence': tf.string,
'original_sentence': tf.string,
'sentence_after_ambiguity': tf.string,
'sentence_after_deletion': tf.string,
}),
'table': Sequence(Sequence({
'column_span': tf.int32,
'is_header': tf.bool,
'row_span': tf.int32,
'value': tf.string,
})),
'table_page_title': tf.string,
'table_section_text': tf.string,
'table_section_title': tf.string,
'table_webpage_url': tf.string,
'target': tf.string,
'totto_id': tf.int32,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{parikh2020totto,
title=ToTTo: A Controlled Table-To-Text Generation Dataset,
author={Parikh, Ankur and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={1173--1186},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / web_nlg_en
Config mô tả: WebNLG là một tập dữ liệu song ngữ (tiếng Anh, tiếng Nga) của dbpedia song song bộ ba và đoạn văn ngắn mà bìa khoảng 450 tính dbpedia khác nhau. Dữ liệu WebNLG ban đầu được tạo ra để thúc đẩy sự phát triển của các máy nói RDF có thể tạo văn bản ngắn và xử lý việc lập kế hoạch vi mô.
Dung lượng tải về:
12.57 MiB
Dataset kích thước:
19.91 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_numbers' | 500 |
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 502 |
'challenge_validation_sample' | 499 |
'test' | 1.779 |
'train' | 35.426 |
'validation' | 1.667 |
- Các tính năng:
FeaturesDict({
'category': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'input': Sequence(tf.string),
'references': Sequence(tf.string),
'target': tf.string,
'webnlg_id': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{gardent2017creating,
author = "Gardent, Claire
and Shimorina, Anastasia
and Narayan, Shashi
and Perez-Beltrachini, Laura",
title = "Creating Training Corpora for NLG Micro-Planners",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "179--188",
location = "Vancouver, Canada",
doi = "10.18653/v1/P17-1017",
url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / web_nlg_ru
Config mô tả: WebNLG là một tập dữ liệu song ngữ (tiếng Anh, tiếng Nga) của dbpedia song song bộ ba và đoạn văn ngắn mà bìa khoảng 450 tính dbpedia khác nhau. Dữ liệu WebNLG ban đầu được tạo ra để thúc đẩy sự phát triển của các máy nói RDF có thể tạo văn bản ngắn và xử lý việc lập kế hoạch vi mô.
Dung lượng tải về:
7.49 MiB
Dataset kích thước:
11.30 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 501 |
'challenge_validation_sample' | 500 |
'test' | 1.102 |
'train' | 14.630 |
'validation' | 790 |
- Các tính năng:
FeaturesDict({
'category': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'input': Sequence(tf.string),
'references': Sequence(tf.string),
'target': tf.string,
'webnlg_id': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{gardent2017creating,
author = "Gardent, Claire
and Shimorina, Anastasia
and Narayan, Shashi
and Perez-Beltrachini, Laura",
title = "Creating Training Corpora for NLG Micro-Planners",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "179--188",
location = "Vancouver, Canada",
doi = "10.18653/v1/P17-1017",
url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_auto_asset_turk
Config mô tả: WikiAuto cung cấp một bộ câu thẳng từ tiếng Anh Wikipedia và Wiktionary tiếng Wikipedia như một nguồn lực để đào tạo các hệ thống đơn giản hóa câu. ASSET và TURK là bộ dữ liệu đơn giản hóa chất lượng cao được sử dụng để thử nghiệm.
Dung lượng tải về:
121.01 MiB
Dataset kích thước:
202.40 MiB
Tự động lưu trữ ( tài liệu ): Có (challenge_test_asset_backtranslation, challenge_test_asset_bfp02, challenge_test_asset_bfp05, challenge_test_asset_nopunc, challenge_test_turk_backtranslation, challenge_test_turk_bfp02, challenge_test_turk_bfp05, challenge_test_turk_nopunc, challenge_train_sample, challenge_validation_sample, test_asset, test_turk, xác nhận), Chỉ khi
shuffle_files=False
(tàu)tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_asset_backtranslation' | 359 |
'challenge_test_asset_bfp02' | 359 |
'challenge_test_asset_bfp05' | 359 |
'challenge_test_asset_nopunc' | 359 |
'challenge_test_turk_backtranslation' | 359 |
'challenge_test_turk_bfp02' | 359 |
'challenge_test_turk_bfp05' | 359 |
'challenge_test_turk_nopunc' | 359 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test_asset' | 359 |
'test_turk' | 359 |
'train' | 483.801 |
'validation' | 20.000 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'target': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{jiang-etal-2020-neural,
title = "Neural {CRF} Model for Sentence Alignment in Text Simplification",
author = "Jiang, Chao and
Maddela, Mounica and
Lan, Wuwei and
Zhong, Yang and
Xu, Wei",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.709",
doi = "10.18653/v1/2020.acl-main.709",
pages = "7943--7960",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / xsum
Config Mô tả: Tập dữ liệu là dành cho nhiệm vụ tóm tắt trừu tượng trong hình thức cực đoan của nó, nó về tóm tắt một tài liệu trong một câu duy nhất.
Dung lượng tải về:
246.31 MiB
Dataset kích thước:
78.89 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'challenge_test_backtranslation' | 500 |
'challenge_test_bfp_02' | 500 |
'challenge_test_bfp_05' | 500 |
'challenge_test_covid' | 401 |
'challenge_test_nopunc' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 1.166 |
'train' | 23.206 |
'validation' | 1.117 |
- Các tính năng:
FeaturesDict({
'document': tf.string,
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'target': tf.string,
'xsum_id': tf.string,
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{Narayan2018dont,
author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
year = "2018",
address = "Brussels, Belgium",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_arabic_ar
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
56.25 MiB
Dataset kích thước:
291.42 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 5.841 |
'train' | 20.441 |
'validation' | 2.919 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'ar': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'ar': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_chinese_zh
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
31.38 MiB
Dataset kích thước:
122.06 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 3.775 |
'train' | 13.211 |
'validation' | 1.886 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'zh': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'zh': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_czech_cs
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
13.84 MiB
Dataset kích thước:
58.05 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 1,438 |
'train' | 5,033 |
'validation' | 718 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'cs': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'cs': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_dutch_nl
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
53.88 MiB
Dataset kích thước:
237.97 MiB
Tự động lưu trữ ( tài liệu ): Có (kiểm tra, xác nhận), Chỉ khi
shuffle_files=False
(tàu)tách:
Tách ra | Các ví dụ |
---|---|
'test' | 6.248 |
'train' | 21.866 |
'validation' | 3.123 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'nl': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'nl': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_english_en
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
112.56 MiB
Dataset kích thước:
657.51 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 28.614 |
'train' | 99.020 |
'validation' | 13.823 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_french_fr
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
113.26 MiB
Dataset kích thước:
522.28 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 12.731 |
'train' | 44.556 |
'validation' | 6.364 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'fr': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'fr': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_german_de
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
102.65 MiB
Dataset kích thước:
452.46 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 11.669 |
'train' | 40.839 |
'validation' | 5,833 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'de': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'de': Text(shape=(), dtype=tf.string),
'en': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_hindi_hi
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
20.07 MiB
Dataset kích thước:
138.06 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 1.984 |
'train' | 6.942 |
'validation' | 991 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'hi': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'hi': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_indonesian_id
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
80.08 MiB
Dataset kích thước:
370.63 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 9.497 |
'train' | 33,237 |
'validation' | 4,747 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'id': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'id': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_italian_it
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
84.80 MiB
Dataset kích thước:
374.40 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 10.189 |
'train' | 35.661 |
'validation' | 5,093 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'it': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'it': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_japanese_ja
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
21.75 MiB
Dataset kích thước:
103.19 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 2,530 |
'train' | 8.853 |
'validation' | 1.264 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ja': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ja': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_korean_ko
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
22.26 MiB
Dataset kích thước:
102.35 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 2.436 |
'train' | 8.524 |
'validation' | 1,216 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ko': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ko': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_portuguese_pt
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
131.17 MiB
Dataset kích thước:
570.46 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 16.331 |
'train' | 57.159 |
'validation' | 8.165 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'pt': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'pt': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_russian_ru
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
101.36 MiB
Dataset kích thước:
564.69 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 10,580 |
'train' | 37.028 |
'validation' | 5.288 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ru': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'ru': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_spanish_es
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
189.06 MiB
Dataset kích thước:
849.75 MiB
Tự động lưu trữ ( tài liệu ): Không
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 22.632 |
'train' | 79.212 |
'validation' | 11.316 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'es': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'es': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_thai_th
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
28.60 MiB
Dataset kích thước:
193.77 MiB
Tự động lưu trữ ( tài liệu ): Có (kiểm tra, xác nhận), Chỉ khi
shuffle_files=False
(tàu)tách:
Tách ra | Các ví dụ |
---|---|
'test' | 2.950 |
'train' | 10.325 |
'validation' | 1,475 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'th': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'th': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_turkish_tr
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
6.73 MiB
Dataset kích thước:
30.75 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 900 |
'train' | 3.148 |
'validation' | 449 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'tr': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'tr': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem / wiki_lingua_vietnamese_vi
Config mô tả: Wikilingua là một quy mô lớn, tập dữ liệu đa ngôn ngữ cho việc đánh giá các hệ thống tổng hợp trừu tượng chéo ngôn ngữ ..
Dung lượng tải về:
36.27 MiB
Dataset kích thước:
179.77 MiB
Tự động lưu trữ ( tài liệu ): Có
tách:
Tách ra | Các ví dụ |
---|---|
'test' | 3.917 |
'train' | 13.707 |
'validation' | 1.957 |
- Các tính năng:
FeaturesDict({
'gem_id': tf.string,
'gem_parent_id': tf.string,
'references': Sequence(tf.string),
'source': tf.string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'vi': Text(shape=(), dtype=tf.string),
}),
'target': tf.string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=tf.string),
'vi': Text(shape=(), dtype=tf.string),
}),
})
- Ví dụ ( tfds.as_dataframe ):
- Trích dẫn:
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."