Top Banner
Cheng-Jun Wang Web Data Analysis Department of Communication PhD Student Workshop Web Mining for Communication Research April 22-25, 2014 http://weblab.com.cityu.edu.hk/blog/project/workshops
93
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Cheng-Jun WangWeb Data Analysis

    Department of Communication PhD Student Workshop Web Mining for Communication ResearchApril 22-25, 2014

    http://weblab.com.cityu.edu.hk/blog/project/workshops

  • OutlineKey features of web dataMajor approaches to web data analysisNetwork analysisTemporal analysisSpatial analysisSentiment analysisReflections on web data analysis*

  • FEATURES OF WEB DATA*

  • Traditional vs. Web DataAnalysis of traditional (cross-sectional, fat) data

    Multiple regression Log-linear modelMultilevel analysisStructural equation modelingetc.Analysis of web (longitudinal, tall) dataTime series analysisNetwork analysisSpatial analysisText miningetc.*

    IDV1V2V3...V..1...............2.................................1,000...............

    IDTimeV11......2...............1,000...............10,000........................

  • APPROACHES TO WEB DATA ANALYSIS*

  • What Can We Do with Web Data?Features Temporal features Spatial features Structural/behavioral features (e.g., RT, @) Content features (term/topic/sentiment) ApproachesTime series analysisSpatial analysisNetwork analysisText mining

    *

  • Frequently Used Tools*

    OperationOpen SourceCommercialPull-down menusOpenOfficeGoogle Docs SpreadsheetSPSSExcel

    Programming-basedRStataSAS

  • NETWORK ANALYSIS (ELEMENTARY-LEVEL)*

  • What Is a Network?A network consists of Nodes (actors, agents, etc.)Edges (relations, ties, etc.)The same set of nodes and edges can also be called:a grapha matrixa weba mapetc.A pair of adjacent nodes are neighbors. (Are A and C neighbors?)*

  • Key ConceptsNetworkNodeEdgeEgo-networkComponentTriadic closure

    Individual-level analysis: Centrality metricsGroup-level analysisTransitivityGlobal-level analysisDensityModularity

    *

  • Examples of Nodes and EdgesNodes:Persons (e.g., Facebook users)Organizations (McDonald restaurants)Nations (EU members) Machines (web servers)Locations (airports)Ideas (words in articles)etc. Edges:Kinship links (family ties)Friendship ties (factual or perceived)Business transactionsTravel routes (highways, subways, air flights)Similarities (word co-occurrences in articles)etc.

    *

  • Examples of Innovative Network Analysishttp://www.nature.com/srep/2011/111215/srep00196/full/srep00196.htmlFood Flavor NetworkMusic Notes Networkhttp://www.eie.polyu.edu.hk/~xfliu/publications/LiuXF.2010.physa.Music.pdf*

  • More on EdgesDirected (one-way) vs. undirected (two-way)Observed (directly measured, e.g., hyperlinks) vs. hidden (inferred , e.g., co-occurrences)Formal (institutionally arranged, top-down) vs. informal (self-organized, bottom-up)Static (unchanged over time) vs. dynamic (evolving)Positive (e.g., friending) vs. negative (e.g., de-friending)

    The key challenge to innovative network analysis is to identify hidden, informal, and evolving edges

    *

  • Classification of Online Social NetworksSource: Ackland and Zhu (forthcoming). Social network analysis, Sage.*

    Manifestation of TiesDirection of TiesUndirectedDirectedDirectly ObservedFriendship networks (e.g., Facebook, Google+)Microblog networks (e.g., Twitter, Sina Weibo)Indirectly InferredSemantic networks (e.g., recommendation systems, social tagging systems)Newsgroups, blogs, WWW hyperlink networks

  • ComponentsA component is a subset of a network:every node in the subset has a path to every otherthe subset is not part of some larger setMost online social networks have one (or a few) giant components

    *

  • Components in a High School NetworkSource: Bearman, Moody & Stovel (2004). http://www.jstor.org/discover/10.1086/386272?uid=3738176&uid=2&uid=4&sid=21103 878405327*

  • Components in World Wide WebTeapot Model (Zhu et al., 2008)Daisy Model (Donato et al., 2005)Bowtie Model (Broder et al., 2000)SCC: strongly connected componentIN: unilaterally connected to SCC OUT: unilaterally connected by SCC *

  • Ego-Centric NetworkEgo-network: a subset of a network including a particularly designated node (ego) and its neighbors (alters)For example, followers of a VIP account on Twitter or Sina Weibo form an ego-networkAll snowballing samples of online social networks are ego-networks.An important property of ego-networks is the depth (see next slide). Family tree is a special case of ego-networks (see the second next slide).*

  • The Depth of Ego Networks1.0 Ego Network 1.5 Ego Network 2.0 Ego Network 2.5 Ego Network *

  • Family Tree: Special Ego-NetworksIs it an undirected or directed graph?Are there multiple paths from a parent node to a child node?What are the similarities or differences between family trees and other types of ego-networks?*

  • Triadic Closure (Transitivity)Why are friends (B and C) of a common friend (A) more likely to become friends themselves: 1. chances to meet each other; 2. similarity between them.*t0t1

  • Triads of Undirected Networks*Closed TriadOpen TriadConnected PairUnconnected

  • Triads of Directed NetworksThe 1st number: N of bi-directional edges;The 2nd number: N of uni-directional edges;The 3rd number: N of nonexistent edges;The letter code: directed variations of the same triad, with U for up, D for down, C for circle, and T for transitive (i.e., having 2 paths that lead to the same endpoint).*

  • Measure of Triadic ClosureNodes in a graph usually have multiple triads each.Therefore, there is a need to measure quantitatively the overall degree of triadic closure for each node.Clustering coefficient (CC) is the most frequently used measure for the purpose.*

  • Triadic Closure: Driven by Social Selection or Social Influence?BACPersonPersonPerson(a) Triadic closureBACPersonPersonFocus (e.g., recommended books on Amazon)(b) Focal closureBACPersonFocus (e.g., recommended groups on Facebook) Person(c) Membership closure*

  • Goals of Social Network AnalysisPerer & Shneiderman (2008):Overall network metrics (e.g., number of nodes, number of edges, density, diameter), globalNode rankings (degree, betweenness, closeness centrality), individualEdge rankings (weight, betweenness centrality), localNode rankings in pairs (degree vs. betweenness, plotted on a scatter gram), localEdge rankings in pairs, localCohesive subgroups (finding communities), local Multiplexity (analyzing comparisons between different edge types, such as friends vs. enemies), cross-levels

    *

  • Levels of Network Analysis Individual-level: nodes, focusing on who are the most popular/important/influential nodes in the network?Local-level: groups (or clusters, communities, components, etc.), focusing on how are the nodes clustered/grouped together?Global-level: network, focusing on how densely/closely is the network connected as a whole?*

  • Individual-level AnalysisFind popular/important/influential nodes usually based on centrality metricsDegree centrality: How many nodes are you connected to?Closeness centrality: How close are you to other nodes?Betweenness centrality: How many paths are through you?Eigenvalue: How many important nodes are round you?

    *

  • Interpretation of Centrality ScoresHigh centrality scores:Individuals with high centrality scores are often more likely to be: leaderskey conduits of informationearly adopters of anything that spreads in a networkLow centrality scores: Individuals with low scores are in peripheral positions:who may be protected from negative contagion and influencewho may be associated with less work overload in an organization

    *

  • Example: Krackhardts Kite GraphA network of 10 nodes and 18 edges:Who has the highest degree centrality?Who has the highest betweenness centrality?Who has the highest closeness centrality?*

  • Degree CentralityNumber of neighbors a node is directly connectedIndicates how well the node is connected within the graph

    Degree of G = 6*

  • Betweenness CentralityThe number of shortest paths between pairs of other nodes through a node (as compared with total number of shortest paths in the graph)Indicates how critical the node is to the flow of information or resource in the graph

    Betweenness of H = 14*

  • Closeness CentralityNumber of steps along the shortest path from the focal node to all other nodeIndicates how quickly information travels between the node and anyone else in the graph

    Closeness of D and E = 14, respectively*14 == 1*5 + 2*3 + 3*1

  • Eigenvalue CentralityThe extent to which a node is a big fish connected with other big fish in a big pond.Calculated by assessing how well connected a node is to the parts of the network with the greatest connectivity.Nodes with high eigenvector scores have many connections who have many connections, etc., similar to the logic of Google PageRank.Highly connected individuals within highly interconnected clusters, or big fish in big ponds, have high eigenvector centrality.*

  • Group-level AnalysisCentral Question: How are nodes clustered (grouped) together?based on clustering analysis, a method to merge an n number of nodes into a g number of groups such that:the nodes within the same group are maximally similar or homogeneousthe nodes between the groups are maximally different or heterogeneous*

  • Process of Clustering AnalysisAt step 1, there are 10 clusters, each with a node that is uniquely different from all others.At step 2, nodes 1 and 2 are considered to be similar enough to form a cluster; same goes between nodes 9 and 10. There are now 8 clusters.At step 3, node 3 joins the cluster of 1 and 2, and node 8 joins the cluster of 9 and 10. The process keeps on until every node is included in the one giant cluster at step 6. An optimal solution is to keep a small number of clusters with maximal similarity within and maximal difference between.1234567891 2 3 4 5 6*10

  • Island Method for Group DetectionBy raising the threshold of edge strength (e.g., mean, median, or k standard deviation above the mean), an increasing number of groups (communities) will emerge successively from a giant connected component.*

  • Group-level Metrics in NodeXLVertex countsEdge countsGeodesic distancesGroup density Number of edges between each pair of groups

    *

  • Global-level AnalysisKey question: How densely or closely connected is the network as a whole?

    Fig a (top): connected based on 67% agreementFig b (bottom): connected based on 75% agreement

    *

  • Global-level Metrics in NodeXL (1)*

    Graph TypeDirected or undirected.VerticesThe number of vertices in the graph.Unique EdgesThe number of edges that do not have duplicates.Edges With DuplicatesThe number of edges that have duplicates.Total EdgesThe number of edges in the graph. This is the sum of Unique Edges and Edges With Duplicates.Self-LoopsThe number of edges that connect a vertex to itself.Reciprocated Vertex Pair RatioIn a directed graph, this is the N of vertex pairs that have edges in both directions divided by the N of vertex pairs that are connected by any edge. Duplicate edges and self-loops are ignored. In an undirected graph, this is undefined.Reciprocated Edge RatioIn a directed graph, this is the number of edges that are reciprocated divided by the total number of edges. Duplicate edges and self-loops are ignored. In an undirected graph, this is undefined and is not calculated.Connected ComponentsThe number of connected components in the graph. A connected component is a set of vertices that are connected to each other but not to the rest of the graph.

  • Global-level Metrics in NodeXL (2)*

    Single-Vertex Connected ComponentsThe N of connected components that have only one vertex.Maximum Vertices in a Connected ComponentThe N of vertices in the connected component that has the most vertices.Maximum Edges in a Connected ComponentThe N of edges in the connected component that has the most edges.Maximum Geodesic Distance (Diameter)The maximum geodesic distance among all vertex pairs, where geodesic distance is the shortest path between two vertices.Average Geodesic DistanceThe average geodesic distance among all vertex pairs, where geodesic distance is the distance between two vertices along the shortest path between them.Graph DensityA ratio that compares the N of edges with the maximum N of edges the graph would have if all the vertices were connected to each other. Duplicate edges and self-loops are ignored.ModularityWhen the graph has groups, this is a measure of the "quality" of the grouping. Graphs with high modularity have dense connections among the vertices within the group but sparse connections among vertices in different groups. When the graph does not have groups, this is undefined.

  • HANDS-ON TUTORIALS*

  • use R! aRe you suRe? NodeXL and *Super R logo Source: www.redbubble.com/

  • R for Web Data Analysis*

    Network AnalysisR packagesigraph, Statnet, RsienaSpatial Analysishttp://cran.r-project.org/web/views/Spatial.htmlSp, Spatial, OpenStreetMap, RgoogleMapsTemporal Analysishttp://cran.r-project.org/web/views/TimeSeries.htmlhttp://cran.r-project.org/web/views/SpatioTemporal.htmltseries, forecast, urca, wavelets, SpatioTemporalText Mininghttp://cran.r-project.org/web/views/NaturalLanguageProcessing.htmltm, Rweka, openNLP, wordcloud, topicmodels, RTextTools, sentiment, ReadMeMachine Learninghttp://cran.r-project.org/web/views/MachineLearning.htmlNnet, rpart, trees, party, randomForest, lasso2, gbm, bst, e1071, kernlab, BayesTree, Rgp, arules, frbs, rattle

  • I came, I saw, and I walked away?Plunge into the water!*Figure from the movie Daddy Day Care(2003)http://img0.joyreactor.com/pics/post/gif-eddie-murphy-reaction-gifs-party-394848.gifPicture: Gareth Jenkins/Solenthttp://www.telegraph.co.uk/news/picturegalleries/picturesoftheday/8561204/Pictures-of-the-day-7-June-2011.html?image=6

  • HANDS-ON!

    Let's Rock!

    *

  • Demo 1. Software Installation

    Download and install R, Rstudio, and NodeXL

    Learn the basics of R

    More information https://www.rstudio.com/training/online.htmlhttp://tryr.codeschool.com/http://cran.r-project.org/https://www.rstudio.com/ide/http://nodexl.codeplex.com/*

  • NETWORK ANALYSIS (ADVANCED-LEVEL)*

  • Network Topology*

  • Regular or random? Regular networkNodes are connected in a regular neighborhoodwith a fixed number k of edges per each nodeThey do not exhibit the small world characteristicsThey may exhibit clustering

    Random networkRandom networks have randomly connected edgeseach node has an average edgesThey exhibit the small world characteristicsThey do not exhibit clustering

    *

  • Small-World NetworksBetween order and chaosWatts and Strogatz (1999) propose a model for networks between order and chaosSuch that The network exhibits the small world feature as random networksAnd exhibits clustering, as regular lattices

    Network generationThe model is built by simplyRe-wiring at random a small percentage of the regular edgesWhich dramatically shortens the average path length without destroying clustering

    *Watts and Strogatz (1999)

  • Scale-free networkPower lawLong-tail distributionP(k) ~ k-a, 0
  • Demo2. Generate the NetworkR scriptinstall.packages("igraph")library(igraph)size = 50g = graph.tree(size, children = 2); plot(g) g = graph.star(size); plot(g) g = graph.full(size); plot(g) g = graph.ring(size); plot(g) g = connect.neighborhood(graph.ring(size), 2); plot(g) g = erdos.renyi.game(size, 0.1) # small-world network g = rewire.edges(erdos.renyi.game(size, 0.1), prob = 0.8 ); plot(g) # scale-free network g = barabasi.game(size) ; plot(g)

    *http://chengjun.github.io/web_data_analysis/demo2_simulate_networks/

  • The Political Blogosphere VS. Congressmens Retweet NetworkPeng, Zhu, Liu, Wu, Liu (2014) Friendship, Interaction networks and Vote agreement of congressmen in the United States. 7th APNC, Montreal, CanadaL. A. Adamic and N. Glance, 'The Political Blogosphere and the 2004 U.S. Election: Divided They Blog', LinkKDD 2005*

  • How to Represent a Network?A, BA, DA, CC, DC, EC, F*

  • Demo 3. Describe the Network Compute graph metrics using NodeXL*Step 1 paste the edgelist here

  • Demo 3. Describe the Network NodeXL: Set node attributes*Step 2 paste the node attribute here, and name it as party

  • Demo 3. Describe the Network NodeXL: Calculating graph metrics*Step 3 Click graph metrics here

  • Demo 3. Describe the Network NodeXL: Set vertex color and vertex size*Step 4 Set vertex color and vertex size by click here

  • Demo 3. Describe the NetworkR script

    Graph StatisticsCentrality MeasuresAlgorithms of graphsShortest pathConnected component algorithms

    *http://chengjun.github.io/web_data_analysis/demo3_describe_the_network/

  • The exponential random graph model (p*)An ERGM (p*) model is a statistical model for the ties in a network Independent (pairs of) ties (p1, Holland and Leinhardt, 1981; Fienberg and Wasserman, 1979, 1981)Markov graphs (Frank and Strauss, 1986)Extensions (Pattison & Wasserman, 1999; Robins, Pattison & Wasserman, 1999; Wasserman & Pattison, 1996)New specifications (Snijders et al., 2006; Hunter & Handcock, 2006)*

  • Why do we use stochastic network models?To capture complex social phenomena that caused by regularities and randomness.To infer whether certain network signatures will appear more often than by chanceTo distinguish between different social processes (e.g. homophily vs. structural balance)To better understand the way local social processes interact and combine to shape global network patternsDeterministic approaches are not always good enough

    *

  • Procedures of ERGMAssume we have an observed network of size n. What are the mechanisms driving the formation of our network (e.g. reciprocity, transitivity)?Given those mechanisms, are some network configurations (e.g. mutual dyads, transitive triplets) more common than you would expect by chance?Include a parameter for each configuration in the model. Parameter values will help us identify a probability distribution for all graphs of size n. (e.g. if we have a high value for the reciprocity parameter, graphs that have a lot of mutual dyads will be more probable than ones that do not)Estimate the parameters: find the parameter values that best match the observed network. We do that using MCMC-MLE: Markov Chain Monte Carlo Maximum Likelihood Estimation techniques.Once we have our probability distribution, we can draw random graphs from it and compare any of their characteristics to those of our observed network.*http://www.kateto.net/wordpress/wp-content/uploads/2012/12/COMM%20645%20-%20ERGM.pdf

  • Network Configurations: Undirected Networks*::

    Edge

    4-star

    2-star

    K-star

    3-star

    Triangle

  • Network Configurations:Directed Networks*

    ArcReciprocityisolate2-mixed star2-in star2-out star

    K-in star

    K-out starTransitive triad

    Cyclic triad

  • Exponential Random Graph Models ERGMY: all the possible tiesy: the observed tiesX: node attributesg(y,X): network configurations(a vector) . : a vector of model parametersk(): normalizing constant*

  • One example*Johan Koskinen (2012) An introduction to ERGM. 8th UKSNA Conference, Bristol denotes the vector of change statistics

  • Tie-Network configuration matrix*

    edges2-starK-starTriangleY1,2

    Y1,3

    Y2, 3

    Yn, n-1

  • Online Collective IdentityAckland (2011) Online collective identity. SN*

  • Demo 4. ERGM with RR scriptLoad dataBuild up network objectsSet the node attributesPlot the networkFitting a basic ERG model

    *http://chengjun.github.io/web_data_analysis/demo4_ergm_analysis/

  • TEMPORAL ANALYSIS*

  • Time Series AnalysisTime series data can be analyzed within either time domain or frequency domain.Time domain:ARIMA/VAR analysisSurvival analysisMultilevel analysisFrequency domain:Fourier transformationSpectrum analysis (comparing ak and bk of different time series).

    Where time domain analysis is routinely conducted, frequency domain analysis rarely adopted. *

  • Time Series AnalysisForecasting and Univariate ModelingFrequency analysisDecomposition and FilteringSeasonalityStationarity, Unit Roots, and CointegrationNonlinear Time Series AnalysisDynamic Regression ModelsMultivariate Time Series Models*

  • Survival Analysis of Blogging BehaviorSource: Zhu et al., ICA 2010*

  • SPATIAL ANALYSIS*

  • Spatial AnalysisSpatial Data:Location namesIP addressesMap visitsGPS usageetc.

    Spatial Analysis:Spatial clusters/patterns (by visual inspections)Spatial autocorrelationSpatial RegressionSpatial Dependence (correlation between nearby locations)Spatial interaction (correlation between geo-coded variables)

    Well-developed for offline data but under-developed/utilized for web data beyond visual inspections.*

  • Geospatial Distribution of the Communication on TwitterConover MD, Davis C, Ferrara E, McKelvey K, Menczer F, et al. (2013) The Geospatial Characteristics of a Social Movement Communication Network. PLoS ONE 8(3): e55957. doi:10.1371/journal.pone.0055957*

  • Spatial Distribution of Tweets in Milan kernel smoother of point densityAre the tweets randomly distributed?*

  • Temporal Distribution of Tweets in Milan*

  • SENTIMENT ANALYSIS*

  • Sentiment AnalysisDecompose sentimentEmotionJoysurpriseAngerSadnessFeardisgust

    Polarity PositivityNegativityNeutralLexicon methodCarlo Strapparava and Alessandro Valituttis emotions lexiconJanyce Wiebes subjectivity lexiconLiu Bings polarity lexicon

    Supervised machine learning

    Combine lexicon and machine learning

    *

  • Sentiment in the Tweet StreamMiller (2011) Social scientists wade into the tweet stream. Science*

  • Twitter Mood Predicts the Stock Market?Decompose sentimentEmotionCalmAlertSureVitalKindHappyBollen (2011) Twitter mood predicts the stock market. JOCS*

  • Calm Sentiment Predicts the Stock Market*

  • Demo 5. Sentiment Analysis with Supervised Machine LearningR scripthttp://chengjun.github.io/web_data_analysis/demo5_sentiment_analysis/*Figure source: http://courtneylambert.co/official-twitter-stats-from-chirp

  • REFLECTION ON WEB DATA ANALYSIS*

  • Google Correlate & Google Flu Predictionhttp://www.google.com/trends/correlate/comic?p=2*

  • Lazer et al. (2014) The parable of Google Flu Traps in big data analysis. Science*Naturereported that Google flu trends (GFT) was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1,2).

  • Tweet Sentiment and U. S. Election 2012*Figures source: election.twitter.com

  • Facebook Insight and Twitter MentionResultFacebook Insighthttp://www.huffingtonpost.com/simon-jackman/pollster-predictions_b_2081013.htmlhttp://www.zerogeography.net/2012/11/obama-wins-election-on-twitter.html*http://www.cnn.com/election/2012/facebook-insights/

  • Predict Political Orientation with Machine LearningColleoni et al (2014) Echo chamber or public sphere Predicting political orientation and measuring political homophily in Twitter. JOC*

  • To Move onR Style Guide http://adv-r.had.co.nz/Style.htmlR bloggersstackoverflowgithub*

  • Yet, It Is Not Finished*Creating an R package(intro) http://gastonsanchez.com/teaching/

    *CC = 1/6 in F3a; CC = 3/6 in F3b.*Focal closure: a selection process because the same interest in A leads to the friendship between B and C; Membership closure: a social influence process because the friendship between A and B leads to the shared interest in C.*****