About

Methodology


Publicly available datasets from Twitter’s Transparency Centre Information Operations Archive were downloaded onto a virtual machine hosted on Google’s Cloud Platform. These datasets included tweet data (which included both tweets and retweets), user data and media content. 

The tweet data was analysed to automatically extract the percentage of retweets, percentage of accounts with at least one tweet (including retweets) in the dataset, percentage of tweets with media, percentage of tweets with links, percentage of tweets with hashtags and number of likes. The top 10 most retweeted accounts, most used hashtags, shared links, shared domains, most used Twitter clients and posting patterns of the tweets were also calculated. 

Many of the accounts in the datasets were likely repurposed or purchased and their tweets were usually spammy or contained commercial content that was not of interest in the scope of this work. This often skewed the apparent behaviour of the accounts and could also make it difficult to identify and assess the most significant content shared in the datasets. The datasets that contained such material were flagged as having repurposed accounts, as a notable feature of that campaign. 

To filter out commercial and spammy material and focus analysis on the relevant political tweets, the datasets were filtered to tweets posted within 90 days of an account’s last tweet. The same metrics mentioned above were then extracted after filtering to tweets posted within 90 days of an account’s last tweet. 

The cutoff period of 90 days was chosen after randomly sampling tweets from datasets that were observed to contain mostly spammy and commercial tweets. These samples of tweets were then manually inspected for relevant political tweets by filtering to different time periods. Tweets within the final 90 days, 180 days and 365 days of an account’s last tweet were tested with the percentage of tweets in an account’s final 90 days of activity having the highest percentage of relevant tweets. Filtering to fewer than 90 days of an account’s last tweet resulted in too few tweets to quantitatively analyse. 

One limitation of this methodology is that it doesn’t adequately filter out commercial tweets from commercial activities running concurrently to the influence operations. One solution could be to apply topic classification NLP and filter tweets classified as commercial, although this would increase cost and inevitably result in filtering out some relevant tweets.

Natural language processing was used to extract entities and identify key languages from the filtered tweets in every dataset. Open source pre-trained models were used from Spacy for Chinese, English, Russian, Spanish tweets. Tweets in datasets attributed to Bangladesh, Thailand, Indonesia, Catalan, Turkey and Arabic tweets were translated to English using Google Translate before entity recognition. 

Geographical entities were then extracted from the tweets, with each entity then geocoded to a country to obtain a count of the number of times each country was mentioned in the dataset. This provided an indication of targeted countries in these information campaigns. Results were displayed on a map, showing mentions of each country. 

Accuracy was enhanced by cross-referencing the output against a database of over 120,000 place names. Each result was also manually reviewed over the number of times the geographical entity occurred in the dataset. For example, each geographical entity is mentioned at least once in the Armenia dataset, or at least 20 times in the Serbia dataset.

A literature review was conducted to identify the previous analyses of the takedown datasets. The key takeaways were summarised and referenced. Context of geopolitical tensions was added and additional qualitative analysis was added to complement summaries where relevant. 

User data was also analysed to automatically extract numbers of accounts, account creation dates and self-reported locations. Relevant and illustrative images were also identified but exploration was limited by the volume of media present in the datasets.