We refer to propaganda whenever information is purposefully shaped to foster a predetermined agenda. Propaganda uses psychological and rhetorical techniques to reach its purpose. Such techniques include the use of logical fallacies and appealing to the emotions of the audience. Logical fallacies are usually hard to spot since the argumentation, at first sight, might seem correct and objective. However, a careful analysis shows that the conclusion cannot be drawn from the premise without the misuse of logical rules. Another set of techniques makes use of emotional language to induce the audience to agree with the speaker only on the basis of the emotional bond that is being created, provoking the suspension of any rational analysis of the argumentation. All of these techniques are intended to go unnoticed to achieve maximum effect.
The main task asks to produce models capable of spotting text fragments in which propaganda techniques are used in a news article.
We have compiled a corpus of about 550 news articles
in which fragments containing one out of
18 propaganda techniques
have been annotated.
We have defined the following tasks on the corpus:
We provided a training set and a development set to build your systems locally. We further provide a test set (without annotations) and an online submission website to score your systems. A public leaderboard will show the progress on the task of the researchers involved in the task.
The input for all tasks will be news articles in plain text format. Task TC additionally requires a set of spans as input. Participants will be provided with three folders: train-articles and dev-articles annotated, test-articles for which annotations are not provided. Each article appears in one .txt file. The title is on the first row, followed by an empty row. The content of the article starts from the third row, one sentence per line. Each article has been retrieved with the newspaper3k library and sentence splitting has been performed automatically with NLTK sentence splitter.
Here is an example article (we assume the article id is 123456):
0Manchin says Democrats acted like 34babies40 at the SOTU (video) Personal Liberty Poll Exercise your right to vote. |
Democrat West Virginia Sen. Joe Manchin says his colleagues’ refusal to stand or applaud during President Donald Trump’s State of the Union speech was disrespectful and a signal that 296the party is more concerned with obstruction than it is with progress365. |
In a glaring sign of just how 397stupid and petty413 things have become in Washington these days, Manchin was invited on Fox News Tuesday morning to discuss how he was one of the only Democrats in the chamber for the State of the Union speech 604not looking as though Trump 632killed his grandma650. |
When others in his party declined to applaud even for the most uncontroversial of the president’s remarks, Manchin did. |
He even stood for the president when Trump entered the room, a customary show of respect for the office in which his colleagues declined to participate. |
Notice that superscripts are not present in the original article file,
we have added them here in order to be able to reference text spans.
The first character of the article has index 0.
The indices are the ones reported by the annotation platform Anafora, which, according to all our tests, corresponds to the ones computed by loading the full article into a string (in Python) and then using string indexing.
The text is noisy, which makes the task trickier:
for example in row 1 "Personal Liberty Poll Exercise your right to vote."
is clearly not part of the title.
There are several propaganda techniques that were used in the article above:
The format of a tab-separated line of the gold label and the submission files for task SI is:
id begin_offset end_offset
where id is the identifier of the article, begin_offset is the character where the covered span begins (included) and end_offset is the character where the covered span ends (not included). Therefore, a span ranges from begin_offset to end_offset-1. The first character of an article has index 0. The number of lines in the file corresponds to the number of fragments spotted. Notice that if two techniques overlap, for example "not looking as though Trump killed his grandma" (characters 607-653) and "killed his grandma" (characters 635-653) , they are merged into one fragment (characters 607-653). This is the gold file for the article above, article123456.txt:
123456 34 40 123456 299 368 123456 400 416 123456 607 653
The format of a tab-separated line of the gold label and the submission files for task TC is:
id technique begin_offset end_offset
where id is the identifier of the article, technique is one out of the 14 techniques, begin_offset is the character where the covered span begins (included) and end_offset is the character where the covered span ends (not included). Therefore, a span ranges from begin_offset to end_offset-1. The first character of an article has index 0. The number of lines in the file corresponds to the number of techniques spotted (for this task overlapping techniques are not merged). This is the gold file for the article above, article123456.txt:
123456 Name_Calling,Labeling 34 40 123456 Black-and-White_Fallacy 299 368 123456 Loaded_Language 400 416 123456 Exaggeration,Minimization 607 653 123456 Loaded_Language 635 653
Upon registration, participants will have access to their team page, where they can also download scripts for scoring both tasks. Here is a brief description of the evaluation measures the scorers compute.
SI task consists in the identification of the propagandistic fragments. The evaluation function gives credit to partial matching between two spans. In a nutshell, the partial credit is proportional to the intersection of the two spans, and it is normalized by the length of the two spans. To know more check our detailed description.
While formally TC is a a multilabel multiclass classification problem, we turned it into a multiclass classification problem: if a span is associated with multiple techniques, the input file will have multiple copies of such fragments, so multiclass classification algorithms can be applied. The official evaluation measure for the task is the micro-averaged F1 measure.
FLC is a multi-label multi-class sequence tagging task. We modify the standard micro-averaged F1 to account for partial matching between the spans. In addition, an F1 value is computed for each propaganda technique.
We have created a google group for the task. Join it to ask any question and to interact with other participants.
Follow us on twitter to get the latest updates on the data and the competition!
If you need to contact the organisers only, send us an email.
This initiative is part of the Propaganda Analysis Project