PyCon 2012: Python for data lovers: explore it, analyze it, map it

Post on 06-May-2015

1209 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides from Pycon 2012. Speakers: Jacqueline Kazil , Dana Bauer More info: https://us.pycon.org/2012/schedule/presentation/426/ | Video: http://pyvideo.org/video/676/python-for-data-lovers-explore-it-analyze-it-m

Transcript

Python for Open Data Lovers: Explore It, Analyze It, Map It

Jackie Kazil@jackiekazil

Dana Bauer@geography76

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

• open data everywhere

• a data swiss army knife

• finding network patterns

• finding spatial patterns

• which stories to pursue? moving beyond data analysis

Where are we going?

Saturday, March 10, 2012

• Data.gov

• OpenDataPhilly

• DC Data Catalog

• DataSF

• Chicago Data Portal

• NYC Open Data

• London Datastore

Saturday, March 10, 2012

assembly member expensesbicycle lanes

city purchase ordersdialysis centerselevation data

filming locationsGoogle Transit Feed Specification (GTFS)

historical photosinfluenza ratesjudicial districts

Key Stage 2 test results by free school meal eligibilityland cover

monthly calls to Human Services Agency switchboard operatorsneighborhood health clinicsOyster ticket stop locations

political districtsquality of life indicatorsrestaurant inspections

sewer linestraffic counts

utility excavation and paving five-year planviolent crime incidents

ward officesyouth centers

zoning

**real-time parking availability and pricing**

Saturday, March 10, 2012

Saturday, March 10, 2012

http://bit.ly/DCdatafail

Saturday, March 10, 2012

Saturday, March 10, 2012

• What are DC agencies spending money on?

• How much are they spending?

• What are the relationships between businesses and agencies?

• Where are these businesses located?

Saturday, March 10, 2012

Saturday, March 10, 2012

swiss army knife

• csvkit: http://csvkit.readthedocs.org/

• a set of Python utilities for working with csv

• meant to replace csv module

• pip install csvkit (no issues!)

Saturday, March 10, 2012

$ csvcut -n purchase2011_cleaned.csv 1: PO_NUMBER 2: AGENCY_NAME 3: NIGP_DESCRIPTION 4: PO_TOTAL_AMOUNT 5: ORDER_DATE 6: SUPPLIER 7: SUPPLIER_FULL_ADDRESS

! ! !

Saturday, March 10, 2012

$ csvcut -c 2,6 purchase2011_cleaned.csv | csvstat 1. AGENCY_NAME! <type 'unicode'>! Nulls: False! Unique values: 85! 5 most frequent values:! ! DISTRICT OF COLUMBIA PUBLIC SCHOOLS:!2410! ! STATE SUPERINTENDENT OF EDUCATION (OSSE):! 1340! ! DEPARTMENT OF HEALTH:! 895! ! OFFICE OF CHIEF TECHNOLOGY OFFICER:! 786! ! OFF PUBLIC ED FACILITIES MODERNIZATION:!722! Max length: 40 2. SUPPLIER! <type 'unicode'>! Nulls: False! Unique values: 4357! 5 most frequent values:! ! OST, INC.:! 841! ! DELL COMPUTER CORP.:! 366! ! AMERICAN EXPRESS COMPANY:! 282! ! MVS, INC.:! 176! ! CAPITAL SERVICES AND SUPPLIES:! 167! Max length: 52

Row count: 16075

! ! !

Saturday, March 10, 2012

$ csvgrep -c 6 -r ^MAYA purchase2011_cleaned.csv

PO_NUMBER,AGENCY_NAME,NIGP_DESCRIPTION,PO_TOTAL_AMOUNT,ORDER_DATE,SUPPLIER,SUPPLIER_FULL_ADDRESSPO352244,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,408644.73,01/04/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352652,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,111679.16,01/07/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352920,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2205630.13,01/11/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO355150,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,391092.49,02/07/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356426,STATE SUPERINTENDENT OF EDUCATION (OSSE),FINANCIAL SERVICES (NOT OTHERWISE CLASSIFIED) 49,999891,02/23/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356632,STATE SUPERINTENDENT OF EDUCATION (OSSE),PROFESSIONAL SERVICES (NOT OTHERWISE CLASSIFIED) 58,187200,02/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO359961,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,1753238,04/12/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO360284,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,110729.88,04/14/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO361203,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,92617.32,04/28/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO351462-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATIONAL RESEARCH SERVICES 19,152229.95,05/05/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO364208,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,118825.51,06/09/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO366839,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2767027,07/12/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO365094-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,98092.35,08/15/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO370948,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,45736.58,08/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO361027-V5,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,29424.86,09/06/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO374132,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,9000,09/28/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO377919,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,491663.6,10/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO381219,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,120188.81,11/29/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO383965,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,294690.57,12/22/2011,MAYA ANGELOU PCS,"1436 U STREET, NW SUITE 203, WASHINGTON, DC, 20009"! ! !Saturday, March 10, 2012

$ csvcut -c 4,2,6,5 purchase2011_cleaned.csv | csvsort -r | head -n 20 | csvlook------------------------------------------------------------------------------------------------------------| PO_TOTAL_AMOUNT | AGENCY_NAME | SUPPLIER | ORDER_DATE |------------------------------------------------------------------------------------------------------------| 154133337.02 | DEPARTMENT OF TRANSPORTATION | SKANSKA-FACCHINA JV | 2011-11-10 || 62677473.88 | DEPARTMENT OF REAL ESTATE SERVICES | EEC OF DC INC-FORRESTER CONSTR | 2011-09-22 || 31809425.48 | DEPARTMENT OF HEALTH | DEFENSE LOGISTIC AGENCY | 2011-09-08 || 23600580.0 | DEPARTMENT OF CORRECTIONS | UNITY HEALTH CARE, INC. | 2011-10-24 || 23538552.0 | DEPARTMENT OF REAL ESTATE SERVICES | EEC-FORRESTER ANACOSTIA | 2011-11-08 || 22375314.45 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-05-25 || 21450000.04 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-08-18 || 20813348.99 | DEPARTMENT OF REAL ESTATE SERVICES | THE JOHN AKRIDGE CO | 2011-06-28 || 20622000.0 | DEPARTMENT OF TRANSPORTATION | W M SCHLOSSER CO INC | 2011-08-29 || 19824914.0 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-10-24 || 18300956.56 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-11-29 || 18104339.98 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-05-17 || 18000000.0 | DEPARTMENT OF HEALTH | DC PRIMARY CARE ASSOCIATION | 2011-03-10 || 17000000.0 | DEPARTMENT OF HEALTH | CHILDRENS NATIONAL MEDICAL CTR | 2011-11-25 || 16850000.0 | DEPUTY MAYOR FOR ECONOMIC DEVELOPMENT | 2 M STREET REDEVELOPMENT LLC | 2011-09-29 || 16333257.33 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIP\HOME | 2011-06-02 || 14206937.0 | PUBLIC CHARTER SCHOOLS | FRIENDSHIP PCS | 2011-07-12 || 13862557.44 | MUNICIPAL FACILITIES: NON-CAPITAL | US SECURITY ASSOCIATES, INC. | 2011-10-07 || 13800000.0 | DISTRICT DEPARTMENT OF THE ENVIRONMENT | VERMONT ENERGY INVESTMENT CORP | 2011-10-04 |------------------------------------------------------------------------------------------------------------

! ! !

Saturday, March 10, 2012

Social Network Analysis

“Social network analysis is focused on uncovering the patterning of people's

interaction.” - http://www.insna.org/sna/what.html

Saturday, March 10, 2012

President: ReaganHouse majority: DemocratsYears: 1985, 1986

99th House

Saturday, March 10, 2012

107th House

President: BushHouse majority: RepublicansYears: 2001, 2002

Saturday, March 10, 2012

President: BushHouse majority: RepublicansYears: 2003, 2004

108th House

Saturday, March 10, 2012

109th House

President: BushHouse majority: RepublicansYears: 2005, 2006

Saturday, March 10, 2012

President: BushHouse majority: DemocratsYears: 2007, 2008

110th House

Saturday, March 10, 2012

111th House

President: ObamaHouse majority: DemocratsYears: 2009, 2010

Saturday, March 10, 2012

CSV to network import networkx as nx

G = nx.Graph()node_edgelist = []

# grab edgesfor row in csv_file: node_edgelist.append((n,e))

# create edgesfor f in node_edgelist: for t in node_edgelist: if t != f: add_edge_or_weight(G, f[0], t[0])

Saturday, March 10, 2012

Centrality Analysis (networkx)Degree - nx.degree(G)# of connections; More connections = more important

Closeness centralitynx.closeness_centrality(G)Distance to all other nodes; Closer = more important

Betweenness centralitynx.betweenness_centrality(G)Based on the shortest path of info control

Page ranknx.pagerank(G)Node gains importance via the importance around him

Saturday, March 10, 2012

Centrality Analysis (networkx)

Saturday, March 10, 2012

Centrality Analysis (networkx)Digi Docs Inc Document Mangers (Dallas)“Offers software that generates loan documents for electronic delivery.”

Iron Mountain (Mountain View)“Iron Mountain provides information management services that help organizations lower the costs, risks and inefficiencies of managing their physical and digital data.”

MVS, Inc. (Washington, DC)“MVS Consulting is an 8(a) STARS II, HUBZone, LSDBE, CBE, and MBE IT Solutions company that provides IT solutions to Federal, State and Local Government Agencies.”

MDM OFFICE SYSTEMS INC (Washington, DC)"Standard Office Supply - Office Supplies, Furniture Dealer, Educational Products, Breakroom Supplies, Imaging Supplies, and Coffee Services"

Capital Services and Supplies (Washington, DC)“CSSI is an office solutions firm located in Washington, DC since 1980. CSSI’s goods and services are available to commercial, government, and educational institutions throughout the continental United States.”

Saturday, March 10, 2012

Centrality Analysis (networkx)

Not included in previous slide...

United States Postal Service&

Dell Computer Corp

Saturday, March 10, 2012

Visual the networkpos=nx.spring_layout(G,iterations=100)plot.figure(1,figsize=(15,15))plt.axis('off')

nx.draw_networkx_nodes( G, pos,node_size=100, alpha=1, node_color='g')

nx.draw_networkx_edges(G,pos,alpha=0.2)plot.savefig('graph.png')

Saturday, March 10, 2012

Visual the network

Saturday, March 10, 2012

Trimming nodes

g2 = G.copy()d = nx.degree(g2)for n in g2.nodes(): if d[n] <= degree: g2.remove_node(n) return g2

Saturday, March 10, 2012

d=nx.degree(G)plot.figure(1,figsize=(15,10))h=plot.hist(d.values(),100)

Degree Distribution

Saturday, March 10, 2012

Degree Distribution

Saturday, March 10, 2012

Degree Distribution

Saturday, March 10, 2012

Trimmed nodes

Saturday, March 10, 2012

Adding labels

Saturday, March 10, 2012

nx.draw_networkx_labels(g3,pos,alpha=1)nx.draw_networkx_edges(g3,pos,alpha=0.05)

Saturday, March 10, 2012

Maps to maps

Saturday, March 10, 2012

Spatial is special

• spatial data = attributes, location, time

• mappable!

• spatial data must be referenced in space

• Tobler’s First Law of Geography

Saturday, March 10, 2012

• large data sets a smaller amount of meaningful information

• exploratory (ESDA)

• spatial statistics

• mathematical modeling and prediction of spatial processes

Spatial analysis

Saturday, March 10, 2012

Techniques

• point pattern analysis -- hot spots, k density, nearest neighbor

• spatial interpolation -- kriging

• spatial regression -- ordinary least squares, geographically weighted regression

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

Saturday, March 10, 2012

PySAL

• GeoDa Center at ASU

• Python library for spatial analysis, with modules for exploratory spatial data analysis, spatial econometrics, and location modeling

• http://code.google.com/p/pysal/

• requires NumPy, SciPy

Saturday, March 10, 2012

PySAL• developers looking for spatial analytical methods

to incorporate in application development

• analysts working on projects that require custom scripting

• looking for a user-friendly GUI? Try STARS, GeoDA, GeoDASpace.

• want to integrate into a powerful GIS? Look for plug-ins for ArcGIS & QGIS.

Saturday, March 10, 2012

Saturday, March 10, 2012

Next steps

• quantify clusters in city, region, nation

• examine clusters along networks, business corridors

• create beautiful, interactive maps and charts to allow users to explore spending patterns on their own

Saturday, March 10, 2012

From data analysis to stories

Saturday, March 10, 2012

Which stories would we go after?

• construction contracts

• funding to charter schools

• health care costs in prisons

• local vs. regional vs. national purchases

• technology services -- look for overlap

Saturday, March 10, 2012

The SAGE Handbook of Spatial Analysiseds. A. Stewart Fotheringham and Peter A. Rogerson

Interactive Spatial Data AnalysisTrevor Bailey and Tony Gatrell

Geographic Information AnalysisDavid O’Sullivan and David Unwin

PySALLuc Anselin, GeoDA CenterArizona State University

Want to learn more?

Mia, age 3, geographer in training

Saturday, March 10, 2012

top related