CDL Misinfo Datasets | Documentation for the CDL Misinfo Datasaets.

Misinformation is a challenging societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this problem, we have curated a growing collection of (mis)information datasets in the literature. From these, we evaluated the quality of all of the 36 datasets that consist of statements or claims. If you would like to contribute a novel dataset or report any issues, please email us, visit our Hugging Face, or GitHub.

A survey of multiple (mis)information datasets

A curated collection of misinformation datasets, and a unified setup to work with the claim and statement datasets, available here.

Dataset Quality Assessment

We evaluated the quality of the datasets in the survey, identifying potential flaws such as insufficient label quality and spurious correlations. This helps researchers select datasets that are suitable for their work.

Evaluation of Detection Models

Our paper provides state-of-the-art baselines for misinformation detection models on these datasets, demonstrating the limitations of categorical labels and suggesting alternative evaluation methods.