Easier than Excel: Social Network Analysis of DocGraph with Gephi

Post on 12-Jan-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Easier than Excel: Social Network Analysis of DocGraph with Gephi. Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com. DocGraph. Based on FOIA request to CMS by Fred Trotter Pre-released at Strata RX 2012 Medicare providers (more than doctors) - PowerPoint PPT Presentation

Transcript

Easier than Excel: Social Network Analysis of

DocGraph with GephiJanos G. Hajagos

Stony Brook School of Medicine

Fred Trotterfredtrotter.com

DocGraph Based on FOIA request to CMS by Fred Trotter

Pre-released at Strata RX 2012

Medicare providers (more than doctors)

CY 2011 dates of service

Share 11 or more patients in a 30 day forward window

Initial access restricted to MedStartr funders

2

DocGraph by the numbers Directed graph

Average total degree 52.8

940,492 providers (graph nodes/vertices)

49,685,810 shared edges

3

Geographic visualization

4

http://isurfsoftware.com/blog/2012/12/13/visualizing-geographic-connections-between-us-doctors/

DocGraph data

5

6

NPPES National Plan and Provider Enumeration System

Source of NPI (National Provider Identifier)

No cost download Information is entered and updated by provider

- Data quality is good to poor CSV file with 314 columns A custom MySQL load script is used to normalize the database

Bloom.api open source project to make data easier to access

- http://www.bloomapi.com/

7

Tabular data

8

Things we can do with tabular data

9

Graph dataRelation between authors and MeSH terms from PubMed

10

http://dx.doi.org/10.6084/m9.figshare.94595

Graph types Undirected graph

- Facebook friendships

Directed graph

- Twitter: follow and be followed

Bipartite graph

Multipartite

- RDF graph model

- Property graph model

Allow parallel edges

- RDF graph Model

11

Components of a network/graph

12

Graphs in healthcare Prescriber and patient (bipartite)

- NCPDP data with NPI

Referral data sets

Shared patients

- DocGraph

Social networks

- Tweeting about a disease

Limited by imagination

13

Generating GraphML XML based file format for graphs

Readable by a large number of tools

- Gephi

- Mathematica

- igraph (R)

NetworkX a Python library for graphs which can export to GraphML

GraphML is not a file format for really large graphs

GraphML is not readable by d3.js

14

15

GraphML can be loaded into Mathematica

Gephi

16

Gephi Java based open source tool

Focused on interactivity

- Fast graphics

- Multi-threaded

- Visual updates

Strong graph analytics

Graphs stored in memory

- Upper limit is about 100,000 nodes

Netbeans plugin architecture

- Integration with Neo4J

- Additional layout algorithms

17

Downloading Gephi

http://gephi.org/users/download/

18

Downloading sample files

https://dl.dropboxusercontent.com/u/21690634/DocGraph/docgraph_tutorial_examples.zip

19

Subsets are generated using a Python script

20

python extract_providers_to_graphml.py "npi='1750499653'" sterrence Leaf-edgesOpening connection referralConfigurationSelection criteria for subset graph: npi='1750499653'Referral table _name: referral.referral2011NPI detail table name: referral.npi_summary_primary_taxonomyNodes will be labeled by: provider_nameLeaf-to-leaf edges will be exported? False…Imported 1 nodes…Imported 986 nodes…Imported 1724 edgesEdge types imported{'core-to-leaf': 866, 'leaf-to-core': 856: None : 2}Leaf-to-leaf edges were not selected for exportWriting GraphML file

Generating a subset: some concepts

21

Core nodes

Adding leaf nodes

Connecting core nodes

Connecting to leaf nodes

Connecting leaf nodes

Sample files jamestown_core_provider_graph.graphml

- Providers selected with practice addresses in Jamestown, NY

- Small city in far western New York (approximately 30,000 residents)

- 179 nodes with 5,560 edges

jamestown_core_and_leaf_provider_graph.graphml

- Includes providers above and those who are linked to them

- 1,322 nodes with 12,457 edges

albany_core_provider_graph.graphml

- Providers selected with practice addresses in Albany, NY

- A small city in New York (approximately 100,000 residents)

- 1,368 nodes with 44,711 edges

22

Sample files (continued) bronx_core_provider_graph.graphml

- Providers selected with practice addresses in Bronx, NY

- Urban community (1.4 million residents)

- 3,268 nodes and 53,828 edges

23

Opening a graph file

24

Import report

25

Force directed layout of the graph

26

Results of the layout

27

ForceAtlas 2 works well for larger graphs

28

Navigating the graph Best experience with a three button mouse with a scroll wheel

- Right click and hold to pan

- Scroll wheel to zoom in and out

- Left click to select

- Right click for context menus

MacBook users

- command key and click and hold down on trackpad to pan

- Two fingers to zoom on trackpad

- Click on trackpad to select

- Control click for context menus

29

Coloring the graph (partitioning)

30

Coloring the graph (partitioning)

31

Varying node size based on importance Step 1: Need to select a measure for node importance

- Degree

- PageRank

- Eigenvector centrality

Step 2: Run the measure against the graph

Step 3: Ranking tab and “Size/Weight”

Step 4: Set size range

32

Graph measures Degree

- In-degree

- Out-degree

Graph structure measures

- Clustering (global and local)

- Network diameter

Centrality Measures

- Eigenvector centrality

- PageRank (Google search)

Community measures

And more . . . . .

33

Interactively viewing node attributes

34

Click the “T” icon on the bottom to turn on node labeling

Data Laboratory

35

Selecting visible fields

36

Viewing edge attributes

37

Saving your graph Save your graph in .gephi format

- xml based format

- preserves layout, size, and color

Save in GraphML format for use with outside programs

38

Filtering nodes by attributes

39

Hints for filtering nodes Drag field filter “is_physician” from the top pane to the lower pane

Set the value to filter on

- Value should equal 1

- 1 is equivalent to true

Click “Filter” to apply

40

Producing a final graph

41

We need to rescale the edge weights in the graph

Producing a final graph after scaling

42

Bronx core provider graph

43

Challenge questions Which institution is the most “important” provider for the Bronx?

- Hint: try a centrality measure

Can you determine if geography plays a role in patient sharing in the Bronx?

- Which parameter could be used to partition the graph?

Can you filter the graph to show only radiologists?

Which radiologist has the highest “authority” in the graph?

44

Other tools for graph analysis NetworkX

- Python

- Lots of algorithms

igraph

- R and Python

Gremlin – graph traversal and manipulation

- Groovy shell

- Gremlin interface is implemented for Neo4J

And more . . .

45

Scaling the analysis to the entire DocGraph Most healthcare graphs will be big (millions of nodes)

What we learn at the local level can be applied at the global level

- Importance of geography

- Supernodes (radiologist, ER docs, pathologist, transportation, …)

Many graph measures don’t scale well

- Maximal cliques

Currently exploring how to use Faunus to scale the analysiswith Hadoop

46

Linkshttp://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information)

https://github.com/jhajagos/DocGraph (code)

http://notonlydev.com/docgraph-data/ (open source $1 covers bandwidth fees)

https://groups.google.com/forum/#!forum/docgraph (mailing list)

47

Questions

48

Try to publish your own healthcare dataset as a graph!

top related