- توضیحات :
GEM یک محیط معیار برای تولید زبان طبیعی با تمرکز بر ارزیابی آن، هم از طریق حاشیه نویسی های انسانی و هم از طریق معیارهای خودکار است.
هدف GEM این است: (1) پیشرفت NLG را در 13 مجموعه داده که بسیاری از وظایف و زبان های NLG را در بر می گیرد، اندازه گیری کند. (2) یک تجزیه و تحلیل عمیق از داده ها و مدل های ارائه شده از طریق بیانیه های داده و مجموعه های چالش ارائه می کند. (3) استانداردهایی را برای ارزیابی متن تولید شده با استفاده از معیارهای خودکار و انسانی ایجاد کنید.
اطلاعات بیشتر را می توانید در https://gem-benchmark.com بیابید .
اسناد اضافی : کاوش در کاغذها با کد
صفحه اصلی : https://gem-benchmark.com
کد منبع :
tfds.text.gem.Gem
نسخه ها :
-
1.0.0
: نسخه اولیه -
1.0.1
: به روز رسانی فیلتر لینک های بد برای MLSum -
1.1.0
(پیشفرض): انتشار مجموعههای چالش
-
کلیدهای نظارت شده (به
as_supervised
doc مراجعه کنید):None
شکل ( tfds.show_examples ): پشتیبانی نمی شود.
gem/common_gen (پیکربندی پیشفرض)
توضیحات پیکربندی : CommonGen یک وظیفه تولید متن محدود است که با مجموعه داده های معیار مرتبط است تا به طور صریح ماشین ها را برای توانایی استدلال مولد عقل سلیم آزمایش کند. با توجه به مجموعه ای از مفاهیم رایج؛ وظیفه تولید یک جمله منسجم برای توصیف یک سناریوی روزمره با استفاده از این مفاهیم است.
حجم دانلود :
1.84 MiB
حجم مجموعه داده :
16.84 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 1,497 |
'train' | 67,389 |
'validation' | 993 |
- ساختار ویژگی :
FeaturesDict({
'concept_set_id': int32,
'concepts': Sequence(string),
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'target': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
concept_set_id | تانسور | int32 | ||
مفاهیم | دنباله (تنسور) | (هیچ یک،) | رشته | |
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{lin2020commongen,
title = "CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
author = "Lin, Bill Yuchen and
Zhou, Wangchunshu and
Shen, Ming and
Zhou, Pei and
Bhagavatula, Chandra and
Choi, Yejin and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
pages = "1823--1840",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/cs_restaurants
شرح پیکربندی : وظیفه ایجاد پاسخ در زمینه یک سیستم گفتگوی (فرضی) است که اطلاعاتی در مورد رستوران ها ارائه می دهد. ورودی یک نوع عمل هدف/گفتگوی اساسی و فهرستی از اسلات ها (ویژگی ها) و مقادیر آنهاست. خروجی یک جمله زبان طبیعی است.
حجم دانلود :
1.46 MiB
حجم مجموعه داده :
2.71 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 842 |
'train' | 3,569 |
'validation' | 781 |
- ساختار ویژگی :
FeaturesDict({
'dialog_act': string,
'dialog_act_delexicalized': string,
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'target': string,
'target_delexicalized': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
دیالوگ_عمل | تانسور | رشته | ||
dialog_act_delexicalized | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته | ||
target_delexicalized | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{cs_restaurants,
address = {Tokyo, Japan},
title = {Neural {Generation} for {Czech}: {Data} and {Baselines} },
shorttitle = {Neural {Generation} for {Czech} },
url = {https://www.aclweb.org/anthology/W19-8670/},
urldate = {2019-10-18},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Jurčíček, Filip},
month = oct,
year = {2019},
pages = {563--574}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
گوهر/دارت
توضیحات پیکربندی : DART یک مجموعه بزرگ و با دامنه باز ساختار دادهای است که ضبط به متن تولید میکند با حاشیهنویسی جملات با کیفیت بالا که هر ورودی مجموعهای از سهگانههای رابطه موجودیت است که به دنبال یک هستیشناسی با ساختار درختی هستند.
حجم دانلود :
28.01 MiB
حجم مجموعه داده :
33.78 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 6959 |
'train' | 62659 |
'validation' | 2768 |
- ساختار ویژگی :
FeaturesDict({
'dart_id': int32,
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'subtree_was_extended': bool,
'target': string,
'target_sources': Sequence(string),
'tripleset': Sequence(string),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
dart_id | تانسور | int32 | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
زیردرخت_توسعه_شد | تانسور | بوول | ||
هدف | تانسور | رشته | ||
target_sources | دنباله (تنسور) | (هیچ یک،) | رشته | |
سه گانه | دنباله (تنسور) | (هیچ یک،) | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@article{radev2020dart,
title=Dart: Open-domain structured data record to text generation,
author={Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Rajani, Nazneen Fatema and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and others},
journal={arXiv preprint arXiv:2007.02871},
year={2020}
}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/e2e_nlg
توضیحات پیکربندی : مجموعه داده E2E برای یک کار داده به متن با دامنه محدود طراحی شده است - تولید توضیحات/توصیه های رستوران بر اساس حداکثر 8 ویژگی مختلف (نام، منطقه، محدوده قیمت و غیره)
حجم دانلود :
13.99 MiB
حجم مجموعه داده :
16.92 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 4693 |
'train' | 33,525 |
'validation' | 4299 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'meaning_representation': string,
'references': Sequence(string),
'target': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
معنی_نمایش | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{e2e_cleaned,
address = {Tokyo, Japan},
title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation} },
url = {https://www.aclweb.org/anthology/W19-8652/},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
year = {2019},
pages = {421--426},
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/mlsum_de
توضیحات پیکربندی : MLSum یک مجموعه داده خلاصه چند زبانه در مقیاس بزرگ است. این از رسانه های خبری آنلاین ساخته شده است، این تقسیم بر آلمانی تمرکز دارد.
حجم دانلود :
345.98 MiB
حجم مجموعه داده :
963.60 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_covid' | 5,058 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 10695 |
'train' | 220,748 |
'validation' | 11,392 |
- ساختار ویژگی :
FeaturesDict({
'date': string,
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'target': string,
'text': string,
'title': string,
'topic': string,
'url': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
تاریخ | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته | ||
متن | تانسور | رشته | ||
عنوان | تانسور | رشته | ||
موضوع | تانسور | رشته | ||
آدرس اینترنتی | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{scialom-etal-2020-mlsum,
title = "{MLSUM}: The Multilingual Summarization Corpus",
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/mlsum_es
توضیحات پیکربندی : MLSum یک مجموعه داده خلاصه چند زبانه در مقیاس بزرگ است. این از رسانه های خبری آنلاین ساخته شده است، این تقسیم بر اسپانیایی تمرکز دارد.
حجم دانلود :
501.27 MiB
حجم مجموعه داده :
1.29 GiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_covid' | 1,938 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 13,366 |
'train' | 259,888 |
'validation' | 9,977 |
- ساختار ویژگی :
FeaturesDict({
'date': string,
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'target': string,
'text': string,
'title': string,
'topic': string,
'url': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
تاریخ | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته | ||
متن | تانسور | رشته | ||
عنوان | تانسور | رشته | ||
موضوع | تانسور | رشته | ||
آدرس اینترنتی | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{scialom-etal-2020-mlsum,
title = "{MLSUM}: The Multilingual Summarization Corpus",
author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/schema_guided_dialog
توضیحات پیکربندی : مجموعه داده گفتگوی طرحواره (SGD) شامل دیالوگ های 18K چند دامنه ای وظیفه محور بین یک انسان و یک دستیار مجازی است که 17 دامنه از بانک ها و رویدادها گرفته تا رسانه، تقویم، سفر و آب و هوا را پوشش می دهد.
حجم دانلود :
17.00 MiB
حجم مجموعه داده :
201.19 MiB
ذخیره خودکار ( مستندات ): بله (challenge_test_backtranslation، challenge_test_bfp02، challenge_test_bfp05، challenge_test_nopunc، challenge_test_scramble، challenge_train_sample، challenge_validation_sample، تست، اعتبارسنجی)، فقط زمانی که
shuffle_files=False
(train)تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_backtranslation' | 500 |
'challenge_test_bfp02' | 500 |
'challenge_test_bfp05' | 500 |
'challenge_test_nopunc' | 500 |
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 10000 |
'train' | 164982 |
'validation' | 10000 |
- ساختار ویژگی :
FeaturesDict({
'context': Sequence(string),
'dialog_acts': Sequence({
'act': ClassLabel(shape=(), dtype=int64, num_classes=18),
'slot': string,
'values': Sequence(string),
}),
'dialog_id': string,
'gem_id': string,
'gem_parent_id': string,
'prompt': string,
'references': Sequence(string),
'service': string,
'target': string,
'turn_id': int32,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
متن نوشته | دنباله (تنسور) | (هیچ یک،) | رشته | |
دیالوگ_عمل ها | توالی | |||
dialog_acts/act | ClassLabel | int64 | ||
dialog_acts/slot | تانسور | رشته | ||
دیالوگ_عمل ها/ ارزش ها | دنباله (تنسور) | (هیچ یک،) | رشته | |
dialog_id | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
سریع | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
سرویس | تانسور | رشته | ||
هدف | تانسور | رشته | ||
turn_id | تانسور | int32 |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@article{rastogi2019towards,
title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset},
author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
journal={arXiv preprint arXiv:1909.05855},
year={2019}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
گوهر/تاتو
توضیحات پیکربندی : ToTTo یک کار NLG جدول به متن است. وظیفه به شرح زیر است: با توجه به یک جدول ویکیپدیا با نام ردیفها، نام ستونها و سلولهای جدول، با زیرمجموعهای از سلولهای برجسته، یک توصیف زبان طبیعی برای قسمت برجستهشده جدول ایجاد کنید.
حجم دانلود :
180.75 MiB
حجم مجموعه داده :
645.86 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 7700 |
'train' | 121,153 |
'validation' | 7700 |
- ساختار ویژگی :
FeaturesDict({
'example_id': string,
'gem_id': string,
'gem_parent_id': string,
'highlighted_cells': Sequence(Sequence(int32)),
'overlap_subset': string,
'references': Sequence(string),
'sentence_annotations': Sequence({
'final_sentence': string,
'original_sentence': string,
'sentence_after_ambiguity': string,
'sentence_after_deletion': string,
}),
'table': Sequence(Sequence({
'column_span': int32,
'is_header': bool,
'row_span': int32,
'value': string,
})),
'table_page_title': string,
'table_section_text': string,
'table_section_title': string,
'table_webpage_url': string,
'target': string,
'totto_id': int32,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
example_id | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
هایلایت شده_سلول ها | دنباله (سکانس (تنسور)) | (هیچ، هیچکدام) | int32 | |
همپوشانی_زیر مجموعه | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
جمله_حاشیه ها | توالی | |||
جمله_حواشی/جمله_پایانی | تانسور | رشته | ||
جمله_حواشی/جمله_اصلی | تانسور | رشته | ||
جمله_حواشی/جمله_بعد_ابهام | تانسور | رشته | ||
جمله_حواشی/جمله_پس از_حذف | تانسور | رشته | ||
جدول | توالی | |||
جدول / دهانه_ستون | تانسور | int32 | ||
table/is_header | تانسور | بوول | ||
table/row_span | تانسور | int32 | ||
جدول/مقدار | تانسور | رشته | ||
table_page_title | تانسور | رشته | ||
جدول_بخش_متن | تانسور | رشته | ||
table_section_title | تانسور | رشته | ||
table_webpage_url | تانسور | رشته | ||
هدف | تانسور | رشته | ||
totto_id | تانسور | int32 |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{parikh2020totto,
title=ToTTo: A Controlled Table-To-Text Generation Dataset,
author={Parikh, Ankur and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={1173--1186},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/web_nlg_en
توضیحات پیکربندی : WebNLG یک مجموعه داده دوزبانه (انگلیسی، روسی) از مجموعه های سه گانه موازی DBpedia و متون کوتاه است که حدود 450 ویژگی مختلف DBpedia را پوشش می دهد. دادههای WebNLG در ابتدا برای ترویج توسعه کلامیکنندههای RDF که قادر به تولید متن کوتاه و مدیریت برنامهریزی خرد بودند، ایجاد شد.
حجم دانلود :
12.57 MiB
حجم مجموعه داده :
19.91 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_numbers' | 500 |
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 502 |
'challenge_validation_sample' | 499 |
'test' | 1779 |
'train' | 35,426 |
'validation' | 1667 |
- ساختار ویژگی :
FeaturesDict({
'category': string,
'gem_id': string,
'gem_parent_id': string,
'input': Sequence(string),
'references': Sequence(string),
'target': string,
'webnlg_id': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
دسته بندی | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
ورودی | دنباله (تنسور) | (هیچ یک،) | رشته | |
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته | ||
webnlg_id | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{gardent2017creating,
author = "Gardent, Claire
and Shimorina, Anastasia
and Narayan, Shashi
and Perez-Beltrachini, Laura",
title = "Creating Training Corpora for NLG Micro-Planners",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "179--188",
location = "Vancouver, Canada",
doi = "10.18653/v1/P17-1017",
url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/web_nlg_ru
توضیحات پیکربندی : WebNLG یک مجموعه داده دوزبانه (انگلیسی، روسی) از مجموعه های سه گانه موازی DBpedia و متون کوتاه است که حدود 450 ویژگی مختلف DBpedia را پوشش می دهد. دادههای WebNLG در ابتدا برای ترویج توسعه کلامیکنندههای RDF که قادر به تولید متن کوتاه و مدیریت برنامهریزی خرد بودند، ایجاد شد.
حجم دانلود :
7.49 MiB
حجم مجموعه داده :
11.30 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_scramble' | 500 |
'challenge_train_sample' | 501 |
'challenge_validation_sample' | 500 |
'test' | 1,102 |
'train' | 14630 |
'validation' | 790 |
- ساختار ویژگی :
FeaturesDict({
'category': string,
'gem_id': string,
'gem_parent_id': string,
'input': Sequence(string),
'references': Sequence(string),
'target': string,
'webnlg_id': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
دسته بندی | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
ورودی | دنباله (تنسور) | (هیچ یک،) | رشته | |
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته | ||
webnlg_id | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{gardent2017creating,
author = "Gardent, Claire
and Shimorina, Anastasia
and Narayan, Shashi
and Perez-Beltrachini, Laura",
title = "Creating Training Corpora for NLG Micro-Planners",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "179--188",
location = "Vancouver, Canada",
doi = "10.18653/v1/P17-1017",
url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_auto_asset_turk
توضیحات پیکربندی : WikiAuto مجموعهای از جملات تراز شده را از ویکیپدیای انگلیسی و ویکیپدیای ساده انگلیسی به عنوان منبعی برای آموزش سیستمهای سادهسازی جملات ارائه میکند. ASSET و TURK مجموعه داده های ساده سازی با کیفیتی هستند که برای آزمایش استفاده می شوند.
حجم دانلود :
121.01 MiB
حجم مجموعه داده :
202.40 MiB
Auto-cached ( documentation ): Yes (challenge_test_asset_backtranslation, challenge_test_asset_bfp02, challenge_test_asset_bfp05, challenge_test_asset_nopunc, challenge_test_turk_backtranslation, challenge_test_turk_bfp02, challenge_test_turk_bfp05, challenge_test_turk_nopunc, challenge_train_sample, challenge_validation_sample, test_asset, test_turk, validation), Only when
shuffle_files=False
(train)تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_asset_backtranslation' | 359 |
'challenge_test_asset_bfp02' | 359 |
'challenge_test_asset_bfp05' | 359 |
'challenge_test_asset_nopunc' | 359 |
'challenge_test_turk_backtranslation' | 359 |
'challenge_test_turk_bfp02' | 359 |
'challenge_test_turk_bfp05' | 359 |
'challenge_test_turk_nopunc' | 359 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test_asset' | 359 |
'test_turk' | 359 |
'train' | 483801 |
'validation' | 20000 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'target': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
هدف | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{jiang-etal-2020-neural,
title = "Neural {CRF} Model for Sentence Alignment in Text Simplification",
author = "Jiang, Chao and
Maddela, Mounica and
Lan, Wuwei and
Zhong, Yang and
Xu, Wei",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.709",
doi = "10.18653/v1/2020.acl-main.709",
pages = "7943--7960",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/xsum
توضیحات پیکربندی : مجموعه داده برای وظیفه خلاصهسازی انتزاعی در شکل شدید آن است، یعنی خلاصه کردن یک سند در یک جمله.
حجم دانلود :
246.31 MiB
حجم مجموعه داده :
78.89 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'challenge_test_backtranslation' | 500 |
'challenge_test_bfp_02' | 500 |
'challenge_test_bfp_05' | 500 |
'challenge_test_covid' | 401 |
'challenge_test_nopunc' | 500 |
'challenge_train_sample' | 500 |
'challenge_validation_sample' | 500 |
'test' | 1,166 |
'train' | 23,206 |
'validation' | 1,117 |
- ساختار ویژگی :
FeaturesDict({
'document': string,
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'target': string,
'xsum_id': string,
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
سند | تانسور | رشته | ||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
هدف | تانسور | رشته | ||
xsum_id | تانسور | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{Narayan2018dont,
author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
year = "2018",
address = "Brussels, Belgium",
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_arabic_ar
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
56.25 MiB
حجم مجموعه داده :
291.42 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 5,841 |
'train' | 20,441 |
'validation' | 2919 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'ar': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'ar': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/ar | متن | رشته | ||
source_aligned/en | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/ar | متن | رشته | ||
target_aligned/en | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_chinese_zh
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
31.38 MiB
حجم مجموعه داده :
122.06 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 3775 |
'train' | 13,211 |
'validation' | 1,886 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'zh': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'zh': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/zh | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/zh | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_czech_cs
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
13.84 MiB
حجم مجموعه داده :
58.05 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 1,438 |
'train' | 5,033 |
'validation' | 718 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'cs': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'cs': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/cs | متن | رشته | ||
source_aligned/en | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/cs | متن | رشته | ||
target_aligned/en | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_dutch_nl
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
53.88 MiB
حجم مجموعه داده :
237.97 MiB
ذخیره خودکار ( مستندات ): بله (تست، اعتبارسنجی)، فقط زمانی که
shuffle_files=False
(قطار)تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 6,248 |
'train' | 21,866 |
'validation' | 3,123 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'nl': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'nl': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/nl | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/nl | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_english_en
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
112.56 MiB
حجم مجموعه داده :
657.51 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 28614 |
'train' | 99,020 |
'validation' | 13,823 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_french_fr
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
113.26 MiB
حجم مجموعه داده :
522.28 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 12731 |
'train' | 44,556 |
'validation' | 6,364 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'fr': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'fr': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/fr | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/fr | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_german_de
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
102.65 MiB
حجم مجموعه داده :
452.46 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 11669 |
'train' | 40,839 |
'validation' | 5,833 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'de': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'de': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/de | متن | رشته | ||
source_aligned/en | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/de | متن | رشته | ||
target_aligned/en | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_hindi_hi
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
20.07 MiB
حجم مجموعه داده :
138.06 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 1,984 |
'train' | 6942 |
'validation' | 991 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'hi': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'hi': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/سلام | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/سلام | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_i Indonesia_id
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
80.08 MiB
حجم مجموعه داده :
370.63 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 9,497 |
'train' | 33,237 |
'validation' | 4,747 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/id | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/id | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_italian_it
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
84.80 MiB
حجم مجموعه داده :
374.40 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 10,189 |
'train' | 35661 |
'validation' | 5,093 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'it': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'it': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/it | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/it | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_japanese_ja
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
21.75 MiB
حجم مجموعه داده :
103.19 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 2,530 |
'train' | 8853 |
'validation' | 1264 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'ja': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'ja': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/ja | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/ja | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_korean_ko
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
22.26 MiB
حجم مجموعه داده :
102.35 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 2,436 |
'train' | 8,524 |
'validation' | 1,216 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'ko': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'ko': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/ko | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/ko | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_portuguese_pt
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
131.17 MiB
حجم مجموعه داده :
570.46 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 16,331 |
'train' | 57,159 |
'validation' | 8,165 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'pt': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'pt': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/pt | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/pt | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_russian_ru
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
101.36 MiB
حجم مجموعه داده :
564.69 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 10,580 |
'train' | 37,028 |
'validation' | 5,288 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'ru': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'ru': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/ru | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/ru | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_spanish_es
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
189.06 MiB
حجم مجموعه داده :
849.75 MiB
ذخیره خودکار ( اسناد ): خیر
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 22632 |
'train' | 79,212 |
'validation' | 11,316 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'es': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'es': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/es | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/es | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_thai_th
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
28.60 MiB
حجم مجموعه داده :
193.77 MiB
ذخیره خودکار ( مستندات ): بله (تست، اعتبارسنجی)، فقط زمانی که
shuffle_files=False
(قطار)تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 2950 |
'train' | 10,325 |
'validation' | 1,475 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'th': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'th': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/th | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/th | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_turkish_tr
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
6.73 MiB
حجم مجموعه داده :
30.75 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 900 |
'train' | 3,148 |
'validation' | 449 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'tr': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'tr': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/tr | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/tr | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."
gem/wiki_lingua_vietnamese_vi
توضیحات پیکربندی : Wikilingua یک مجموعه داده چندزبانه در مقیاس بزرگ برای ارزیابی سیستمهای خلاصهسازی انتزاعی چند زبانه است.
حجم دانلود :
36.27 MiB
حجم مجموعه داده :
179.77 MiB
ذخیره خودکار ( اسناد ): بله
تقسیم ها :
شکاف | مثال ها |
---|---|
'test' | 3,917 |
'train' | 13707 |
'validation' | 1,957 |
- ساختار ویژگی :
FeaturesDict({
'gem_id': string,
'gem_parent_id': string,
'references': Sequence(string),
'source': string,
'source_aligned': Translation({
'en': Text(shape=(), dtype=string),
'vi': Text(shape=(), dtype=string),
}),
'target': string,
'target_aligned': Translation({
'en': Text(shape=(), dtype=string),
'vi': Text(shape=(), dtype=string),
}),
})
- مستندات ویژگی :
ویژگی | کلاس | شکل | نوع D | شرح |
---|---|---|---|---|
FeaturesDict | ||||
gem_id | تانسور | رشته | ||
gem_parent_id | تانسور | رشته | ||
منابع | دنباله (تنسور) | (هیچ یک،) | رشته | |
منبع | تانسور | رشته | ||
source_aligned | ترجمه | |||
source_aligned/en | متن | رشته | ||
source_aligned/vi | متن | رشته | ||
هدف | تانسور | رشته | ||
target_aligned | ترجمه | |||
target_aligned/en | متن | رشته | ||
target_aligned/vi | متن | رشته |
- مثالها ( tfds.as_dataframe ):
- نقل قول :
@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
author = {Sebastian Gehrmann and
Tosin P. Adewumi and
Karmanya Aggarwal and
Pawan Sasanka Ammanamanchi and
Aremu Anuoluwapo and
Antoine Bosselut and
Khyathi Raghavi Chandu and
Miruna{-}Adriana Clinciu and
Dipanjan Das and
Kaustubh D. Dhole and
Wanyu Du and
Esin Durmus and
Ondrej Dusek and
Chris Emezue and
Varun Gangal and
Cristina Garbacea and
Tatsunori Hashimoto and
Yufang Hou and
Yacine Jernite and
Harsh Jhamtani and
Yangfeng Ji and
Shailza Jolly and
Dhruv Kumar and
Faisal Ladhak and
Aman Madaan and
Mounica Maddela and
Khyati Mahajan and
Saad Mahamood and
Bodhisattwa Prasad Majumder and
Pedro Henrique Martins and
Angelina McMillan{-}Major and
Simon Mille and
Emiel van Miltenburg and
Moin Nadeem and
Shashi Narayan and
Vitaly Nikolaev and
Rubungo Andre Niyongabo and
Salomey Osei and
Ankur P. Parikh and
Laura Perez{-}Beltrachini and
Niranjan Ramesh Rao and
Vikas Raunak and
Juan Diego Rodriguez and
Sashank Santhanam and
Jo{\~{a} }o Sedoc and
Thibault Sellam and
Samira Shaikh and
Anastasia Shimorina and
Marco Antonio Sobrevilla Cabezudo and
Hendrik Strobelt and
Nishant Subramani and
Wei Xu and
Diyi Yang and
Akhila Yerukola and
Jiawei Zhou},
title = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
Metrics},
journal = {CoRR},
volume = {abs/2102.01672},
year = {2021},
url = {https://arxiv.org/abs/2102.01672},
archivePrefix = {arXiv},
eprint = {2102.01672}
}
Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."