Dr. Ming-Hsiang Tsou [email protected]Director of the Center for Human Dynamics in the Mobile Age Professor, Department of Geography San Diego State University February 20, 2015 Tracking Disease Outbreaks and Health Disparities using Big Data, Social Media, GIS, and Social Network Analysis
43
Embed
Tracking Disease Outbreaks and Health Disparities …idash.ucsd.edu/.../files/uploads/Tsou-2015-iDASH-Webinar-final.pdf · Tracking Disease Outbreaks and Health Disparities using
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dr. Ming-Hsiang Tsou
[email protected] of the Center for Human Dynamics in the Mobile Age
Professor, Department of Geography San Diego State University
February 20, 2015
Tracking Disease Outbreaks and Health Disparities using Big Data, Social Media, GIS, and Social
Big Data is dynamic datasets created by or derived from human activities, communications, movements, and behaviors (Tsou, 2015, in review).
The term, big data, should refer to big ideas, big impacts, and big changes for our society rather than only focusing on big volume.
Big Data Category (Tsou, 2015).
Social life data include popular social media services (Twitter, Flickr, Snapchat, YouTube, Foursquare, etc.), online forums, online video games, and web blogs.
Health data include electronic medical records (EMR) from hospitals and health centers, cancer registry data from state and local communities, official disease outbreak tracking and epidemiology data
Business and commercial data include credit card transactions, online business reviews (such as Yelp and Amazon reviews), supermarket membership records, shopping mall transaction records per store, credit card fraud examination data, enterprise management data, and marketing analysis data.
Transportation and traffic data include GPS tracks (from taxi, buses, Uber, bike sharing programs, and mobile phones), traffic censor data (from subways, trolleys, buses, bike lanes, highways), social media data (from check-ins, Waze, and other social media platforms), and mobile phone data (from data transmission records and cellular network data).
Scientific research data include earthquakes sensors, weather sensors, satellite images, crowd sourcing data for biodiversity research, volunteered geographic information, and census data.
Health Data Layer
Data Integration / Data Fusion
Explore their spatiotemporal relationships in both network space and geographical space.
Image provided by Dr. Atshushi Nara (Associate Director of HDMA Center).
Human Dynamic in the Mobile Age (HDMA)
The Challenge of Big Data Analytics:
Big Data are very Messy, Noisy, and Unstructured!
Require collaboration efforts from linguistics, computer scientists, data mining experts, statisticians, physicists, modelers, and domain experts.
Geography (place and time) is the KEY for Understanding and Integrating Big Data
(Tsou and Lietner, 2013)
Research Showcase #1:
Geo-Targeted Social Media Analytics for Tracking Flu Outbreaks in U.S.
Geo-TargetingData Collection(Twitter APIs)
Analysis
Visualization
Filter Machine Learning
Trend Analysis
Spatial Analysis
SMARTDashboard
Human Dynamic in the Mobile Age (HDMA)
Source: Twitter Search APIs to collect public tweets (Geo-targeted)
Keywords: “flu” and “influenza”.
Region: 17 miles radius from the center of 31 U.S. Cities.
Time: September 30, 2013 (Week 40) – March 23, 2014 (Week 12)
Twitter Search API: based on the user profile (locations) and gazetteers(San Diego: include La Jolla, La Mesa, Chula Vista) (FREE for Search APIs)
How to Collect Social Media Data?
Human Dynamic in the Mobile Age (HDMA)
Collect Tweets from Top 31 U.S. Cities (17 miles radius)31 different cities across the United States (chosen based on their population sizes): Atlanta, Austin,
Baltimore, Boston, Chicago, Cleveland, Columbus, Dallas, Denver, Detroit, El Paso, Fort Worth, Houston, Indianapolis, Jacksonville, Los Angeles, Memphis, Milwaukee, Nashville-Davidson, New Orleans, New York, Oklahoma City, Philadelphia, Phoenix, Portland, San Antonio, San Diego, San Francisco, San Jose, Seattle, and Washington, D.C.
Human Dynamic in the Mobile Age (HDMA)
Normalizing City Tweeting Rate by Census Data
City Population = Combine all census track population numbers within the circle.
(Different from the official record of City Population)
Machine Learning
Number of tweets
10,678
5,398
4,947
4,944
3279
Total Flu tweets collected: 307,070.Final valid flu tweets: 88,979.
Filter and Refine Big Data (Remove Noises)
National ILI(Influenza-like illness)Two weeks delay
Real-Time Monitoring of Flu Outbreaks in U.S. (National Scale – combined 31 Cities), 2013 – 2014 flu season
(R) value = 0.8494
National Tweet rate(31 Cities average) Real time update10 days earlier than ILI
Human Dynamic in the Mobile Age (HDMA)
Multi-Level (Scale) Analysis: (Regional and Municipal Scales)
Regional ILI
Region 4 Region 5
Weekly “Valid” Flu Tweeting Rates Compare with Emergency Department ILI in Different Cities
* P < 0.05; ** P < 0.005; *** P < 0.001
Tweet rate per 100,000
Sentinel provided ILI rate
Averaged ILI rate (missing data)
Correlation between City-level Sentinel Provided ILI Rates vs. Tweeting rate is VERY BAD! (SP ILI is not reliable?)
ER ILI data is better than SP ILI
16
Trend Analysis at the Municipal Scale (San Diego)with the Lab-tested confirmed flu cases
San Diego: Lab confirmed Flu Cases vs Tweeting Rate:(R) value = 0.9331
Human Dynamic in the Mobile Age (HDMA)
Two research papers in the Journal of Medical Internet Research
2013
2014
Next Step:
Monitor Flu Outbreaks in Real-Time?
The HDMA Center is developing the SMART Dashboard:
SMART Dashboard: Social Media Analytic and Research Testbedhttp://vision.sdsu.edu/hdma/smart/ (Beta)
CDC Influenza Positive Tests, National Data Summary, through Weeks 40-3, 2014-2015 Season
# of Filtered ILI Tweets, Top 30 US Cities, as of February 9, 2015(from SMART dashboard)
Only 1% -4% tweets has GPS coordinates.
Problems!!! Twitter broke its Search APIs on 11/20/2014 and only returned GPS-tagged tweets only. (Reduce 90% -95% of tweets collected)
Only GPS-tagged Tweets Collected Now
Human Dynamic in the Mobile Age (HDMA)
2014-2015 Comparison between ILI and GPS-only Tweets among 30 U.S. Cities
Red: National ILI, Purple: GPS Only Tweets Tweeting Rate, multiplied by 10 for 2014-2015 flu season
The Limitation of Social MediaFor Flu Outbreak Monitoring
Social Media User Profiles
Social Media messages can NOT represent all population, but it can provide warning signals and real-time updates.
Some flu virus may have bigger impacts to Elder People than Younger People.
2012 Survey
Only 1% - 4% of GPS Tweets within the Free Garden Hose Tweets (1%)
• Only 1% -4% tweets has GPS coordinates.• 80% tweets has city-level location info (user profile).
• Public (Free) Twitter APIs can only collect 1% of total tweets within a region. (Garden Hose). But our unique Search API methods can collect up to 100% of requested tweets using Keywords and region combination.
• Twitter “Firehose” (Massive, real-time stream of tweets): Not-free, expensive. Historical FirehoseTweets can be purchased by Tweet Data-reseller partners: GNIP, SocialFlow, MediaSift, etc. (expensive).
User Privacy Issue
• Concerns about “Big Brother”.
• Although all the tweets collected from APIs are “public tweets” (everyone can search them and retrieve them).
• Some content of tweets may contain personal private information (real names, locations of homes, offices, private conversations, medical situations, etc.)
The top 30 U.S. city “Whooping cough” Twitter message networks (mentions networks) : Social Network Nodes – User ProfilesData collection temporal coverage July 22 to December 9, 2014
• SharylAttikson: Investigative Journalist. Author of Stonewalled. Dreaming of a day when public officials answer questions as if they know they work for the public.
• NFIDvaccines: National foundation for infectious diseases. Non-profit organization dedicated to educating the public and healthcare professionals about the causes, treatment, and prevention of infectious diseases.
• Roseperson: Regional Support manager, aligning force for quality at the AF4Q National Program Office (at George Washington University)
• Jonrappoport: Free-lance investigative reporter for over 30 years.• Acognews: the American Congress of Obstetricians and Gynecologists. The leading
authority on women’s health care.• CDCgov: CDC's official Twitter source for daily credible health & safety updates from
the Centers for Disease Control & Prevention.
• Laurie_Garrett: is a Pulitzer prize-winning science journalist and writer of two bestselling book: I Heard the Sirens Scream (2011), The Coming Plague & Betrayal of Trust
• CFR_org: The Council on Foreign Relations is a resource for foreign policy news and analysis.
Social Network Analysis (SNA)
Top 30 US Cities Clustering Coefficient(sub-groups, communities)
“Whooping cough” message networks in Twitter (among top 30 U.S. cities)
SharylAttkisson
NFIDvaccinesJonrappoport
CDCgov roseperson
acognews
Laurie_garrett
CFR_org
LJbeil
Top 30 US Cities Whooping Cough Grouping Major Nodes (information flow)
“Whooping cough” message flows in Twitter (among top 30 U.S. cities)
Top 30 US Cities In Degrees (Tweets were “retweeted” by others)
Top 30 US Cities Out Degrees(Retweeting other’s messages)