In the ongoing battle against disinformation, TITAN is developing an innovative solution which aims to help users navigate the flood of information online. The TITAN Socratic chatbot is specifically designed to engage users in meaningful conversations to enhance critical thinking and enable them to better spot false information. However, to develop this kind of advanced tool, robust and carefully curated datasets are crucial. This post explores our journey of building high-quality datasets that fuel TITAN's AI algorithms, enabling them to detect disinformation and foster insightful dialogues with users.
The Role of Reliable Datasets in the TITAN Platform
For the TITAN platform, acquiring and curating datasets goes beyond just gathering data—it’s about ensuring the data is relevant, high-quality, and structured in a way that respects user privacy. TITAN’s dataset creation process utilises a privacy-preserving co-creation approach to minimise human intervention, creating data that reliably trains and adapts the chatbot's AI. Through careful data curation, TITAN aims to encourage further research within the AI community by publishing high-quality datasets focused on Socratic dialogues, micro-lessons, and disinformation detection.
These datasets support a variety of objectives. First, they provide content for the Socratic chatbot to engage in structured educational dialogues, allowing it to foster critical thinking. They also help detect various forms of disinformation, including hate speech, clickbait, and logical fallacies, and they enable the chatbot to assess the spread of disinformation.
Ultimately, TITAN aims to not only create an engaging AI chatbot but also to make a broader impact by advancing AI capabilities for critical disinformation detection.
Identifying Disinformation Signals: Hate Speech, Offensive Language, and Clickbait etc.
One of TITAN’s primary focuses is identifying harmful disinformation signals, including hate speech, offensive language, and clickbait. In developing datasets for these purposes, TITAN’s team began by gathering open-source datasets, focusing on hate speech and offensive language in English texts. These initial datasets were cleaned and processed, removing duplicates to enhance data quality. Once ready, the datasets were used to fine-tune TITAN's language models, equipping them to flag hate speech, offensive language, and clickbait in real-time.
The team expanded its approach to tackle fact-checking and logical fallacies, gathering additional datasets to refine the chatbot’s capabilities in identifying misleading claims and flawed logic. This systematic approach aims to arm TITAN with nuanced, multi-faceted knowledge that allows it to address a broad range of disinformation challenges.
Tackling Hate Speech and Offensive Language
The rise of hate speech and offensive language online has necessitated specific models to detect these harmful signals. Given the contextual nature of these communications, separate classification models were created for hate speech and offensive language. By collecting data from various sources, such as Twitter, Reddit, and YouTube comments, TITAN’s team compiled a comprehensive dataset for both hate speech and offensive language detection.
This rigorous curation process involved analysing multiple datasets, merging relevant data, and eliminating duplicates. The resulting data collection, which included inputs from widely respected sources, was pre-processed and tokenised, removing any irrelevant content to create a reliable base for TITAN's models.
Developing a Clickbait Dataset
Clickbait content is another form of misleading online material, often characterised by exaggerated, sensationalist headlines that aim to attract clicks. For TITAN to recognise clickbait effectively, the team gathered data from various public sources, including news articles, YouTube video titles, and online posts, resulting in a dataset of over 37,000 examples. These entries were then split into training, test, and validation sets to train models capable of distinguishing genuine content from clickbait. TITAN’s clickbait detection models can now help users identify sensationalised content and make more informed decisions about the material they consume.
Creating Context-Rich Logical Fallacy Datasets
Detecting logical fallacies, e.g., errors in reasoning that undermine an argument, is essential for identifying misleading information. For this, TITAN uses a unique dataset derived from an open-source collection of logical fallacies. The data was expanded with contextual content, embedding fallacious statements into news-like articles to simulate real-life scenarios. By presenting fallacies within realistic contexts, TITAN’s models can now analyse arguments as they would appear in genuine media content, enhancing its ability to detect poor reasoning in online conversations.
The Socratic Dialogue Dataset: Engaging Users in Meaningful Conversations
At the heart of TITAN’s mission is its Socratic chatbot, designed to encourage critical thinking through guided dialogue. Experts created a set of dialogues that engage users by posing questions that challenge their assumptions, ask for evidence, and encourage exploring alternative viewpoints. This dialogue structure is based on the Socratic method, a classic technique for fostering critical analysis and deeper reflection.
The team designed 25 initial dialogues, carefully crafted to align with critical thinking skills. Annotations in these dialogues signify disinformation signals, helping the chatbot respond appropriately based on the type of disinformation detected. By guiding users through structured conversations, TITAN's Socratic chatbot serves as both a conversational partner and as a critical thinking coach.
Assessing the Impact of Disinformation: The Propagation Impact Dataset
Understanding how disinformation spreads is crucial for mitigating its impact. For this aspect, the TITAN team built a dataset using data from Mastodon, a decentralised social platform, as a safer, privacy-conscious alternative to other social media. This dataset includes extensive details about user interactions and post-engagement, which help TITAN’s models understand the mechanisms of disinformation spread.
Beyond Mastodon, the team has also been exploring other data sources to expand this propagation impact dataset, contributing insights into how disinformation influences social media conversations over time.
Building Argumentation Mining Datasets
Finally, TITAN aims to help users deconstruct arguments and understand their structure. Using a carefully curated dataset from Greek news reports, TITAN’s team created an argumentation mining model that identifies argumentative structures such as claims, premises, and supporting or contradicting relationships between them. This model offers users a more detailed understanding of argumentative discourse, a valuable skill in assessing the quality of information they encounter.
The comprehensive approach to dataset creation and curation above underscores our commitment to building a powerful, multi-layered chatbot that addresses the many facets of disinformation. Through privacy-respecting data practices and expert-crafted dialogues, TITAN’s Socratic chatbot is not only an educational tool but a significant step forward in AI’s battle against misleading and harmful information online. Each dataset contributes to TITAN’s broader goal: to equip users with the tools needed to navigate information critically, make informed decisions, and ultimately, foster a more informed society.
Comments