civil_comments

  • Description:

This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

  • Homepage: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data

  • Source code: tfds.text.CivilComments

  • Versions:

    • 1.0.0: Initial full release.
    • 1.0.1: Added a unique id for each comment.
    • 1.1.0: Added CivilCommentsCovert config.
    • 1.1.1: Added CivilCommentsCovert config with correct checksum.
    • 1.1.2: Added separate citation for CivilCommentsCovert dataset.
    • 1.1.3: Corrected id types from float to string.
    • 1.2.0: Add toxic spans, context, and parent comment text features.
    • 1.2.1: Fix incorrect formatting in context splits.
    • 1.2.2: Update to reflect context only having a train split.
    • 1.2.3: Add warning to CivilCommentsCovert as we fix a data issue.
    • 1.2.4 (default): Add publication IDs and comment timestamps.
  • Download size: 427.41 MiB

  • Figure (tfds.show_examples): Not supported.

civil_comments/CivilComments (default config)

  • Config description: The CivilComments set here includes all the data, but only the basic seven labels (toxicity, severe_toxicity, obscene, threat, insult, identity_attack, and sexual_explicit).

  • Dataset size: 1.54 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'test' 97,320
'train' 1,804,874
'validation' 97,320
  • Feature structure:
FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
article_id Tensor int32
created_date Tensor string
id Tensor string
identity_attack Tensor float32
insult Tensor float32
obscene Tensor float32
parent_id Tensor int32
parent_text Text string
publication_id Tensor string
severe_toxicity Tensor float32
sexual_explicit Tensor float32
text Text string
threat Tensor float32
toxicity Tensor float32
  • Citation:
@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

civil_comments/CivilCommentsIdentities

  • Config description: The CivilCommentsIdentities set here includes an extended set of identity labels in addition to the basic seven labels. However, it only includes the subset (roughly a quarter) of the data with all these features.

  • Dataset size: 654.97 MiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'test' 21,577
'train' 405,130
'validation' 21,293
  • Feature structure:
FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'created_date': string,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
article_id Tensor int32
asian Tensor float32
atheist Tensor float32
bisexual Tensor float32
black Tensor float32
buddhist Tensor float32
christian Tensor float32
created_date Tensor string
female Tensor float32
heterosexual Tensor float32
hindu Tensor float32
homosexual_gay_or_lesbian Tensor float32
id Tensor string
identity_attack Tensor float32
insult Tensor float32
intellectual_or_learning_disability Tensor float32
jewish Tensor float32
latino Tensor float32
male Tensor float32
muslim Tensor float32
obscene Tensor float32
other_disability Tensor float32
other_gender Tensor float32
other_race_or_ethnicity Tensor float32
other_religion Tensor float32
other_sexual_orientation Tensor float32
parent_id Tensor int32
parent_text Text string
physical_disability Tensor float32
psychiatric_or_mental_illness Tensor float32
publication_id Tensor string
severe_toxicity Tensor float32
sexual_explicit Tensor float32
text Text string
threat Tensor float32
toxicity Tensor float32
transgender Tensor float32
white Tensor float32
  • Citation:
@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

civil_comments/CivilCommentsCovert

  • Config description: WARNING: there's a potential data quality issue with CivilCommentsCovert that we're actively working on fixing (06/28/22); the underlying data may change!

The CivilCommentsCovert set is a subset of CivilCommentsIdentities with ~20% of the train and test splits further annotated for covert offensiveness, in addition to the toxicity and identity labels. Raters were asked to categorize comments as one of explicitly, implicitly, not, or not sure if offensive, as well as whether it contained different types of covert offensiveness. The full annotation procedure is detailed in a forthcoming paper at https://sites.google.com/corp/view/hciandnlp/accepted-papers

  • Dataset size: 97.83 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'test' 2,455
'train' 48,074
  • Feature structure:
FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'covert_emoticons_emojis': float32,
    'covert_humor': float32,
    'covert_masked_harm': float32,
    'covert_microaggression': float32,
    'covert_obfuscation': float32,
    'covert_political': float32,
    'covert_sarcasm': float32,
    'created_date': string,
    'explicitly_offensive': float32,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'implicitly_offensive': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'not_offensive': float32,
    'not_sure_offensive': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
article_id Tensor int32
asian Tensor float32
atheist Tensor float32
bisexual Tensor float32
black Tensor float32
buddhist Tensor float32
christian Tensor float32
covert_emoticons_emojis Tensor float32
covert_humor Tensor float32
covert_masked_harm Tensor float32
covert_microaggression Tensor float32
covert_obfuscation Tensor float32
covert_political Tensor float32
covert_sarcasm Tensor float32
created_date Tensor string
explicitly_offensive Tensor float32
female Tensor float32
heterosexual Tensor float32
hindu Tensor float32
homosexual_gay_or_lesbian Tensor float32
id Tensor string
identity_attack Tensor float32
implicitly_offensive Tensor float32
insult Tensor float32
intellectual_or_learning_disability Tensor float32
jewish Tensor float32
latino Tensor float32
male Tensor float32
muslim Tensor float32
not_offensive Tensor float32
not_sure_offensive Tensor float32
obscene Tensor float32
other_disability Tensor float32
other_gender Tensor float32
other_race_or_ethnicity Tensor float32
other_religion Tensor float32
other_sexual_orientation Tensor float32
parent_id Tensor int32
parent_text Text string
physical_disability Tensor float32
psychiatric_or_mental_illness Tensor float32
publication_id Tensor string
severe_toxicity Tensor float32
sexual_explicit Tensor float32
text Text string
threat Tensor float32
toxicity Tensor float32
transgender Tensor float32
white Tensor float32
  • Citation:
@inproceedings{lees-etal-2021-capturing,
    title = "Capturing Covertly Toxic Speech via Crowdsourcing",
    author = "Lees, Alyssa  and
      Borkan, Daniel  and
      Kivlichan, Ian  and
      Nario, Jorge  and
      Goyal, Tesh",
    booktitle = "Proceedings of the First Workshop on Bridging Human{--}Computer Interaction and Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.hcinlp-1.3",
    pages = "14--20"
}

civil_comments/CivilCommentsToxicSpans

  • Config description: The CivilComments Toxic Spans are a subset of CivilComments that is labeled at the span level - the indices of all character (unicode codepoints) boundaries that were tagged as toxic by a majority of the annotators is returned in a 'spans' feature.

  • Dataset size: 5.81 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'test' 2,000
'train' 7,939
'validation' 682
  • Feature structure:
FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'spans': Tensor(shape=(None,), dtype=int32),
    'text': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
article_id Tensor int32
created_date Tensor string
id Tensor string
parent_id Tensor int32
parent_text Text string
publication_id Tensor string
spans Tensor (None,) int32
text Text string
  • Citation:
@inproceedings{pavlopoulos-etal-2021-semeval,
    title = "{S}em{E}val-2021 Task 5: Toxic Spans Detection",
    author = "Pavlopoulos, John  and Sorensen, Jeffrey  and Laugier, L{'e}o and Androutsopoulos, Ion",
    booktitle = "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.semeval-1.6",
    doi = "10.18653/v1/2021.semeval-1.6",
    pages = "59--69",
}

civil_comments/CivilCommentsInContext

  • Config description: The CivilComments in Context is a subset of CivilComments that was labeled by making available to the labelers the parent_text. It includes a contextual_toxicity feature.

  • Dataset size: 9.63 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 9,969
  • Feature structure:
FeaturesDict({
    'article_id': int32,
    'contextual_toxicity': float32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
article_id Tensor int32
contextual_toxicity Tensor float32
created_date Tensor string
id Tensor string
identity_attack Tensor float32
insult Tensor float32
obscene Tensor float32
parent_id Tensor int32
parent_text Text string
publication_id Tensor string
severe_toxicity Tensor float32
sexual_explicit Tensor float32
text Text string
threat Tensor float32
toxicity Tensor float32
  • Citation:
@misc{pavlopoulos2020toxicity,
    title={Toxicity Detection: Does Context Really Matter?},
    author={John Pavlopoulos and Jeffrey Sorensen and Lucas Dixon and Nithum Thain and Ion Androutsopoulos},
    year={2020}, eprint={2006.00998}, archivePrefix={arXiv}, primaryClass={cs.CL}
}