Infodemiology: Computational Methodologies for quantifying and visualizing key characteristics of the COVID-19 infodemic Authors Dominic Vincent Ligot, Frances Claire Tayco, Mark Toledo, Carlos Nazareno, and Denise Brennan-Rieder. Correspondence to Dominic Vincent Ligot (email: [email protected]) Abstract Objectives. Infodemics of false information on social media is a growing societal problem, aggravated by the occurrence of the COVID-19 pandemic. The development of infodemics has characteristic resemblances to epidemics of infectious diseases. This paper presents several methodologies which aim to measure the extent and development of infodemics through the lens of epidemiology. Methods. Time varying R was used as a measure for the infectiousness of the infodemic, topic modeling was used to create topic clouds and topic similarity heat maps, while network analysis was used to create directed and undirected graphs to identify super-spreader and multiple carrier communities on social media.
22
Embed
Dominic Vincent Ligot, Frances Claire Tayco, Mark ... - OSF
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Infodemiology: Computational Methodologies for quantifying and
visualizing key characteristics of the COVID-19 infodemic
Authors
Dominic Vincent Ligot, Frances Claire Tayco, Mark Toledo, Carlos Nazareno, and Denise
IR per URL were then represented as colored edges, with all edge values normalized to 1-50.
Color assignment is as follows: blue (1-10), yellow (11-20), orange (21-30), red orange (31-40),
and red (41-50).
Directed graph - grouping cases by sequence
Using the same dataset, cases were grouped by time and edges were created from paths
generated between the nodes in temporal sequence to visualize the v , , ..v )P = ( t vt+1 . t+n
infodemic spread over a time window.
Network Metrics and Visualization
Degree, betweenness, closeness, and page rank centrality values were generated for each node.
Network visualizations were generated using the Force Atlas 2 algorithm (19) with a random
seed of 300.
Results
Topic Modeling
Forty-two (42) latent topics were discovered and visualized as word clouds, which display
clusters of words with varying sizes that represent the weight of the relevant terms in the topics.
The combination of words and their sizes give us an idea of the contents of the posts under each
topic. For labeling, top posts under each topic were examined.
Figure 1 shows some of the topic word clouds generated. For example, word cloud (a) highlights
the terms throat, drink, hot, water, prevent, and kill. Some top posts under this topic include
“Drinking water or hot teas kills the coronavirus, which is weak and cannot resist heat”,
“Drinking water and gargling helps against the coronavirus”, and “The coronavirus, before it
reaches the lungs, remain in the throat for four days…” This topic was labeled Virus in the
throat + water, gargling and tea as COVID treatment.
Figure 1: Example of Latent Topics’ Word Clouds:
(a) Virus in the throat + water, gargling and tea as COVID treatment; (b) Anti-masks; (c) COVID studies +
hydroxychloroquine; (d) Bat + 5G Origin + tracing; (e) President / VP related news; and (f) COVID cases numbers.
Different topic categories also surfaced. Categories include health-related misinformation (e.g.
diagnostics, preventions, treatments, symptoms, transmission), as well as false information
related to governments and public figures, false claims about virus severity, and speculations
about the origin of COVID-19 and conspiracy theories, to name a few.
Infodemic and Pandemic Rt
Time varying reproductive numbers, Rt, were derived based on the number of posts with
misinformation, number of interactions with such posts, number of posts by topic, and number of
COVID-19 cases.
Figure 2: Brazil’s Infodemic Effective Reproductive Number
Figure 2 presents comparisons of the Rt figures using Brazil as an example. The first graph
compares the Rt based on the number of posts and the COVID-19 case counts. In the example,
both series fluctuated mostly above the threshold of 1.0 within the covered period, which
suggests that an infodemic is happening simultaneously with the epidemic.
The second graph compares Rt based on all misinformation against posts about a specific topic.
In this example the Rt on the topic “President and Vice President” was observed to have
significantly higher peaks at 4-5 than the general Rt of misinformation with peaks at 1-3.
Recurring peaks in the Rt of the “President and Vice President” topic were observed from May
2020.
Topic Similarity Study
The topic heat map shows similarity in the content in several of the 42 misinformation themes
identified. Cosine values range from 0 to 0.31, which indicate some shared content between
some pairs of topics.
Figure 3 shows an excerpt of the heatmap. In the example, topic Coronavirus + claim +
evidence + share shares similar features with COVID Pre-2019 and Early News About It, with
cosine value of 0.25. Likewise, the topic Coronavirus / COVID transmission + treatment +
survival in high temperature appeared quite related with the topic Virus in the throat + water,
gargling, and tea as COVID-19 treatment (cosine similarity = 0.29).
Figure 3: Topic Similarity Heat Map
Network Analysis
Figure 4: Undirected Network Graph from Reddit Groups (left)
and Directed Network Graph from Facebook Groups (right)
The left portion of Figure 4 is an undirected network graph that shows how 20 red-listed sites
(red nodes) are being shared across Reddit subgroup communities known as Subreddits (blue
nodes) and how much users in each community interact with the site link posting. There are a
total of 385 groups and a total of 804 connections found within the network. Examining the
graph shows r/Conspiracy and r/Conservative as the largest nodes which are central to the
network, and that these groups are major hotspots for misinformation. It also mirrors the findings
that conservatives are more likely to share articles from fake news domains (14). r/Conservative
also has the most share count totaling to 1,643 shares and most interaction counts totaling
1,063,579 interactions and this data was normalized to the ranges 1-50.
This undirected graph resulted in a clustering of hosts based on the specific URL. Each URL is a
source node (virus) and each social media group is a target node (host), an edge is created by
sharing or interacting with a red-listed site within the group (viral transmission), and a collection
of nodes and edges are clusters of digital hotspots (33) within the network (infected population).
The graph indicates the existence of super-host communities, illustrated by large interaction rates
and shared site postings which further suggests that red-listed site operators or subreddit
members are sharing questionable content to target communities.
The directed graph visualizes the sharing of a red-listed site over time, showing which groups are
more susceptible to infection, what links are driving the infodemic, and which central nodes are
possible superspreaders. Groups with themes such as vaccine, anti-aging, and conservative were
observed to spread and react to content shared by the community more often than other groups.
Figure 5: Network Analysis Metrics Output
Figure 5 shows the network analysis metrics for the reddit network containing the 20 unique
red-listed URLs with the highest interaction rates in the entire dataset. Interactions (likes,
comment, shares) correlate with degree (number of node connections) which indicates that users
were interacting heavily with the red-listed content being shared in the group. Higher share rates
also indicate that more members are actively and deliberately sharing red-listed content to the
group. Nodes with higher closeness centrality are spreading information more efficiently relative
to other nodes.
Discussion
This paper presents methodologies to measure the extent and development of infodemics: Time
varying R to measure the infectiousness of an infodemic, topic clouds and topic similarity heat
maps to analyze content of false information, and directed and undirected graphs to identify
super spreader and multiple carrier accounts on social media. Posts from Facebook and Reddit
extracted through CrowdTangle were used to illustrate the approaches, but other data sets such as
from Twitter, Instagram, and other platforms may be substituted.
Topic modelling was used to describe the major misinformation topics and categories and
encapsulate the major misinformation narratives of the infodemic. Carried out over smaller
windows of time, it may be possible to use this approach to demonstrate evolution of
misinformation (35). In addition, topic modelling provides a means of monitoring the social
media landscape for the appearance of new topics/categories which could possibly enable early
detection of disinformation.
The Rt provides a means to illustrate both the relative popularity of a given piece of
misinformation over time. Additionally, having the ability to estimate topic-level Rt and monitor
the ones exceeding the overall Rt helps in identifying the primary viral themes that are driving
the misinformation spreading during the infodemic. Furthermore, having the ability to visualize
and compare both infodemic and epidemic Rt can serve as a foundation for future research on
their relationship.
Identification of topic similarities can aid in development of communication campaigns that can
counter multiple misinformation themes. Moreover, since cosine values suggest the level of
similarity of the most relevant terms between topics, and automatically generated topics are
assumed to be distinct, identifying topics that are highly related with numerous other topics
might indicate potential cases of deliberate disinformation being engineered towards the
unsuspecting public. The similarity heat map analysis can serve as an alert system if subsequent
investigation would be necessary.
The network analysis identifies the propagators of misinformation, major influencers, the extent
of their networks and the particular misinformation they are pushing. Ranking nodes by graph
metrics and user shares and interactions might identify digital hotspots that tend to be more
susceptible to misinformation, which are candidates for proactive management strategies. Pareto
principle or power law dynamic might apply to infodemics where 80% of new transmissions are
caused by fewer than 20% of the carriers, with individuals or members who are aggressively
spreading false information (25, 34).
The approaches described in the paper are not confined to health misinformation, and may be
applied towards understanding other infodemic themes such as political disinformation, climate
change denial, and conspiracy theories. These approaches empower the analysis and
management of infodemics through an epidemiological lens, and provide bases for strategies to
stem an ongoing spread and mitigate against future infodemics.
Authors
The CirroLytix Research Team includes Dominic Vincent Ligot, Frances Claire Tayco, Mark Toledo, Carlos Nazareno, and Denise Brennan-Rieder. All team members jointly conceptualized the study, analyzed and interpreted the data, wrote and revised the manuscript and decided for submission. Corresponding author: Dominic Vincent Ligot, [email protected].
References
1. Aggarwal, C., & Zhai, C. (2012). Mining Text Data. Springer.
2. Ahmed, S. M., Mushtaq, K., & Al Soub, H. (2020;103(2):). Social Media
Misinformation”—An Epidemic within the COVID-19 Pandemic. In The American
Journal of Tropical Medicine and Hygiene (pp. 920-921).
3. Aletras, N., & Stevenson, M. (2014). Measuring the Similarity between Automatically