Datasets and Research for Online Abuse

Over the past few years, a number of datasets have been released in a multitude of languages, we recommend that you consult the Hate Speech Data Repository to identify the best dataset for your research purposes.

The Alan Turing Institute has also created a Online Hate Research Hub, which contains a large number of (mostly UK focussed) resources for research, policymaking and civil society efforts to understand, analyse and counter online hate.

Stack Overflow Dataset

This year, we are repeating our data collaboration with Stack Overflow, who have released a dataset along with anonymized annotator information, to help investigate any potential annotation biases in their data.

To acquire the dataset, fulfil the following steps:

  • Go to the StackOverflow Academic Partnership Programme Page, read the information on the website and open the application form.

  • You can speed up the process of the application form by using our boilerplate text (below) for the question "What research is being proposed? What are the specific requirements of the project? What datasets are you interested in, if any?"

“We are requesting comment data along with anonymised annotator information, without user information for comments from Stack Exchange which have been subject to content moderation. The data will be used to research computational methods for detection of abusive or harmful content, with the aims of submitting our work to the 4th Workshop on Online Abuse and Harms (WOAH). To investigate automated methods for abuse detection, we require a dataset which contains the labels available in the Stack Overflow flagging structure - though a part may be unlabelled as to experiment with empirical evaluation on the trained methods.”

[1] Given the current situation with COVID-19, we understand that people may not be able to print out and scan a signed copy. Instead digital signatures will suffice using tools such as DocuSign.