-
1/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Big Data: Data Wrangling Boot CampPython Sentiment Analysis
Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge,
PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge,
PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge,
PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge,
PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge,
PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge,
PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD
28 January 201728 January 201728 January 201728 January 201728
January 201728 January 201728 January 201728 January 201728 January
201728 January 201728 January 201728 January 201728 January 201728
January 201728 January 201728 January 201728 January 201728 January
201728 January 201728 January 201728 January 2017
-
2/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Table of contents I
1 Introduction
2 Preview
3 Sent. analysis
4 Languages
5 System req.
6 Sublime IDE
7 Q & A
8 Conclusion
9 References
10 Files
-
3/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
What are we going to cover?
Look to the future
Talk briefly about sentimentanalysis
Address the polyglot of computerlanguages
Talk about our sentiment analysissystem
Data wrangle tweets using Python
-
4/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Things that will be happening today
Things that we will be doing.
1 Data wrangle tweets usingpython
2 Conduct sentiment analysison tweets
3 Look at the sentiments indifferent ways
Sentiments over time.
-
5/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Things that will be happening today
Sentiment by sending device
-
6/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Things that will be happening today
Sentiment by geographic location
-
7/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
How we’ll get there
How we’ll get to the images
We’ll walk before we run.
Start with a replay file
Data wrangle using thelibrary file
Go live and download livetweets
Data flows through the backend,into the database, out
thefrontend.
The software design document(attached) contains lots of
details.
-
8/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
How we’ll get there
Same image.
The software design document (attached) contains lots of
details.
-
9/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
How we’ll get there
What some of the files do:
Replay file – previously recordedtweets
Configuration file – directivesused by both backend andfrontend
script files
Shared library file – routines thatare common to backend
andfrontend script files
Backend file – script file thatpopulates the database from
thereplay file, or the Internet
Frontend file – script file thatextracts data from the
databaseand presents results
-
10/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
What is it, and why should I care?
A working definition
“Sentiment analysis (also known as opinion mining)refers to the
use of natural language processing, textanalysis and computational
linguistics to identify andextract subjective information in source
materials.Sentiment analysis is widely applied to reviews and
socialmedia for a variety of applications, ranging frommarketing to
customer service.”
W. Staff [3]
-
11/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
What is it, and why should I care?
More formal definitions
“The field of opinion mining and sentiment analysis
iswell-suited to various types of intelligence applications.Indeed,
business intelligence seems to be one of the mainfactors behind
corporate interest in the field.”
Pang and Lee [2]
“Sentiment analysis, also called opinion mining, is thefield of
study that analyzes peoples opinions, sentiments,evaluations,
appraisals, attitudes, and emotions towardsentities such as
products, services, organizations,individuals, issues, events,
topics, and their attributes.”
Liu [1]
-
12/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
What is it, and why should I care?
Our approach to sentiment analysis
We will:
1 Search the “twitterverse”for tweets using specifichashtags
2 Tokenize each tweet3 Data wrangle each token4 Remove all stop
words
from the tokens
5 Count number of positiveand negative tokens
6 Compute the positive,negative, or neutralsentiment for the
tokens
7 Display the results
Our approach is language agnostic.
-
13/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
A visualization
A “Universe” of words
-
14/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
A visualization
Some words are “Positive”
-
15/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
A visualization
Some words are “Negative”
-
16/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
A visualization
A tweet will/may have positive and negative words
-
17/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
A visualization
Some words we don’t care about
-
18/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
A visualization
Mechanically this is what we are doing
The steps are:
1 Break the tweet into tokens
2 Remove stop words from the tokens
3 Compute the percentage of remaining tweet tokens that
arepositive
4 Compute the percentage of remaining tweet tokens that
arenegative
5 Classify the tokens as positive, negative, or neutral
-
19/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
A visualization
Mathematically this is what we are doing
The steps are:
tokens = {words in tweet}tokensLessStop = tokens − stopWords
positivePart = positiveWords ∩ tokensLessStopnegativePart =
negativeWords ∩ tokensLessStop
classification =
positive,ifpositiveThreshold ≤
positiveParttokensLessStopANDnegativePart
tokensLessStop < negativeThreshold
negative,ifnegativeThreshold ≤
negativeParttokensLessStopANDpositivePart
tokensLessStop < positiveThreshold
neutral , otherwise
-
20/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Comparing a known language with Python and R
A specific language was not arequirement for this boot
camp.Python and R used in boot-camp.Too many languages to compareto
Python and R.
http://hyperpolyglot.org/
Use Hyperployglot.org as a cross reference.
http://hyperpolyglot.org/
-
21/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
A software design document
The document contains:
1 Overall system design
2 Algorithms used through outthe system
3 Details about theconfiguration file
4 Details about the databasetables
The file is attached.
-
22/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
What will happen to the tweets when we do nothing.
A histogram of how often atoken occurs.
-
23/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
Same image.
-
24/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
What will happen to the tweets when we makeeverything the same
case
Changing case is easy, unlessthey are emojois.
-
25/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
Same image.
-
26/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
What will happen to the tweets we remove non-ASCII
We need to talk about ASCII andthe rest of the character
systems.
-
27/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
Same image.
-
28/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
What will happen to the tweets when we remove “stopwords”
Stop words are words/tokensthat have no use in whatever weare
doing. Stop words aredomain specific.And we change case,
removenon-ASCII, remove punctuation,etc.
-
29/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
And implementation
Same image.
-
30/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Basic operations
A complete IDE
A complete Python integrateddeveloment environment (IDE)
1 Text editor
2 Python console
3 Extensible
4 Tabbed display for multiplefiles
We incorporate the Anacondaplugin for additional
functionality.
-
31/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Basic operations
Editor
“Smart” editor
CTRl + O to open a file
CTRl + S to save a file
CTRl + B to run a file
Multiple files can be openedat once
The Anaconda plugin enhances the editor, and adds “lint”
andstyle functions.
-
32/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Q & A time.
Q: Do you know what the deathrate around here is?A: One per
person.
-
33/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
What have we covered?
A preview of today’s activitiesAn overview of the sublime
IDE
Next: Hands on analysing tweets with Python.
-
34/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
References I
[1] Bing Liu, Sentiment analysis and opinion mining, 2012.
[2] Bo Pang and Lillian Lee, Opinion mining and
sentimentanalysis, 2008.
[3] Wikipedia Staff, Sentiment
analysis,https://en.wikipedia.org/wiki/Sentiment_analysis,2016.
https://en.wikipedia.org/wiki/Sentiment_analysis
-
35/35
Introduction Preview Sent. analysis Languages System req.
Sublime IDE Q & A Conclusion References Files
Files of interest
1 Software designdocument
-
ODU Big Data, Data Wrangling Boot CampSoftware Overview and
Design
Chuck Cartledge
January 17, 2017
Contents
List of Tables i
List of Figures ii
1 Introduction 1
2 Software system design 32.1 Algorithms used by the software
front and back ends . . . . . . . . . . . . . 32.2 Configuration
file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 92.3 Design limitations . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 11
3 References 11
A Database tables 12
B Notational data structures 12
C Software on each workstation 13
List of Tables
1 Frontend and backend algorithm cross matrix. . . . . . . . . .
. . . . . . . . 42 Configuration file entries. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 93 Tables to support Twitter
analysis. . . . . . . . . . . . . . . . . . . . . . . . 124
Notional plotting data structure. . . . . . . . . . . . . . . . . .
. . . . . . . 125 Notional plotting data structure. . . . . . . . .
. . . . . . . . . . . . . . . . 12
i
-
List of Figures
1 Notional data science data flow. . . . . . . . . . . . . . . .
. . . . . . . . . . 22 System design. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 3
List of Algorithms
1 Process configuration file. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 42 Update database with new data. . . . . .
. . . . . . . . . . . . . . . . . . . . 53 Normalize text. . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Evaluate text. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 65 Update display with new data. . . . . . . . .
. . . . . . . . . . . . . . . . . . 76 Plotting hash tag based
Tweet sentiment. . . . . . . . . . . . . . . . . . . . . 8
ii
-
1 Introduction
The tweet sentiment analysis software used as part of the Old
Dominion University College ofContinuing Education and Professional
Development Big Data: Data Wrangling boot camp1
will be used to provide boot-camp attendees with hands-on
experience doing data-wranglingof textual data.
“We define such data wrangling as a process of iterative data
exploration andtransformation that enables analysis. . . . In other
words, data wrangling is theprocess of making data useful.”
Kandel et al. [1]
In the boot-camp, we will be looking at tweets to conduct
sentiment analysis relative toarbitrary hashtags. We will be
focusing on data wrangling (see Figure 1) using both Pythonand R
programming languages.
Each boot-camp workstation will have the same software load (see
Section C), and almostfully functional software in Python and R.
The software will be complete in that it will:
• Retrieve tweets from Twitter,
• Place tweets in a PostGres database,
• Retrieve tweets from the database,
• Tokenize the tweets,
• Qualify the tweets as positive, negative, or neutral, and
• Plot the results in different ways.
Data wrangling will focus on:
1. Identifying problems with the tweet tokens,
2. Developing solutions to those problems, and
3. Reducing the number of problematic tokens.
The remaining sections layout in detail the overall system
design, details of the majoralgorithms, database tables, and the
configuration file used to control the system.
1https://www.odu.edu/cepd/bootcamps/data-wrangling
https://www.odu.edu/cepd/bootcamps/data-wrangling
-
Figure 1: Notional data science data flow. Data wrangling
requires domain specific knowl-edge to cleanse, merge, adapt, and
evaluate raw data. Image from [1].
2
-
Figure 2: System design. Both front and back ends read data from
a common configurationfile, and use a shared library file of common
functions.. The back end will receive data fromthe internet or from
a replay file, based on directives in the configuration file and
updatethe database with new data. The front end will connect to the
database and retrieve databased on directives from the
configuration file.
2 Software system design
The sytem is logically divided into three parts (see Figure
2):
1. A “backend” that gets tweets from Twitter or a data file.
2. A database to hold tweets from the backend.
3. A “frontend” that retrieves data from the data base for
analysis and display.
4. The frontend and backend processes are controlled by the
contents of a configurationfile (see Section 2.2).
The backend receives data from Twitter or a data file, puts the
data into the database,and the frontend retrieves the data for
processing.
2.1 Algorithms used by the software front and back ends
Details of the various algorithms used by the backend and
frontend processes are outlinedin this section.
3
-
Table 1: Frontend and backend algorithm cross matrix. Alogrithms
that are used by boththe front and back ends are recommended for a
“utillity” file or library that can be accessedby both ends.
Num. Name Back end Front end
1 Process configuration file. X X
2 Update database with new data X
3 Normalize text X
4 Evaluate text X
5 Update display with new data X
6 Plotting hash tag based Tweet sentiment X
Input: Location of configuration fileAssign default values;while
not at the end of the file do
read line;if not a comment line then
get token;get value;if is a Hashtag then
add value to list of hashtags;else
structure token value = value;end
end
endResult: A language specific data structure, values from file
override defaults.
Algorithm 1: Process configuration file.
4
-
Input: Language specific configuration structurestart = first
time in data file;if Offset = TRUE then
diff = now() - start;else
diff = 0endtime end = start + SleepyTime + diff;for Polls
remaining do
if Live thensubmit query to Twitter;request data from
Twitter;for lines from Twitter do
extract time from JSON;data = base64 encoding of entire
JSON;insert time and data into database;if CollectionFile is not
NULL then
append time and data to CollectionFile;end
endsleep SleepyTime;
elseread line from file;parse line into time and data;while time
> time end do
time end = time end + SleepyTime;sleep SleepyTime;
endinsert time and data into database;
end
endResult: An updated database.
Algorithm 2: Update database with new data.
5
-
Input: Text to be “normalized”, “stop word list”cleansed =
Null;for Text do
lower case Text;remove non-ASCII;stemming;if Text not in “stop
word list” then
append Text to cleansedend
endreturn cleansed ;Result: Normalized text
Algorithm 3: Normalize text.
Input: sourceText, baseLineTextnumberOfSourceWords = number of
words in baseLineTex;percentage = numberOfSourceWords /
numberOfWordsInSourceText;return percentage;Result: Percentage of
source text in baseLineText
Algorithm 4: Evaluate text.
6
-
Input: Language specific configuration structurecleansedPositive
= normalize positive words;cleansedNegative = normalize negative
words;cleansedStopWords = normalize Stop words;timeStart = minimum
time from database ;for Polls remaining do
timeEnd = timeStart + SleepyTime;hash tag new data = NULL;lines
= query database from timeStart to timeEnd;for lines do
tweet = base64 decode of data;if parse Tweet is GOOD then
extract text;extract hash tag from Tweet text;cleansedText =
normalized text less cleansedStopWords;positive percentage =
evaluate cleansedText vs. cleansedPositive;negative percentage =
evaluate cleansedText vs. cleansedNegative;neutral percentage = 100
- positive percentage - negative percentage;update plotting
information (hash tag, source, location);
end
endplot hash tag results;timeStart = timeEnd;sleep
SleepyTime;
endplot source information;plot location information;Result: An
updated display.
Algorithm 5: Update display with new data.
7
-
Input: The previous/current plotting data, and new datafor Each
Tweet type do
set the lower left polygon point as the previous poll and the
last previous typecount;
set the upper left polygon point as the previous poll and the
last previous typecount + next previous type count;
set the lower right polygon point as the current poll and the
current type type;set the upper right polygon point as the current
poll and the current type count +next current type count;
plot the polygon, filling it with then Tweet type colorendfor
Each Tweet type do
set previous Tweet count value to current Tweet count
value;endResult: An updated display data structure, and
display.
Algorithm 6: Plotting hash tag based Tweet sentiment. From the
user’s perspective, astacked histogram is plotted. From a
programatic perspective, each three filled polygonsare plotted
where the left and right edges are the poll number, and the
vertical componentis the number of Tweets per type (positive,
neutral, and negative). The display will showthe absolute number of
Tweets, and the color bands will show the proportions of each
type.
8
-
2.2 Configuration file
The software frontend and backend processes are coordinated by
control values in a sharedconfiguration file.
1. A common configuration file to be used by both the data
capture and the data presen-tation programs.
2. The file will default to a “well known” name in a “well
known” location.
3. An alternative file can be passed in as a command line
argument.
4. Any line in the file starting with a hashtag (#) will be
treated as a comment and notprocessed.
5. File entries are case sensitive.
6. All entries are optional. Some are required for live
operation capture.
7. If the same option appears more than once, the last option
will be honored, exceptfor hashtags. Hashtags will be treated as a
collective.
8. “White space” separates each token from its value.
Table 2: Configuration file entries. The default stop word file
will be provided (source:
http://xpo6.com/list-of-english-stop-words/). It can be modified or
replaced as needed.
Token Meaning Default
APIPrivateKey Twitter private API key. Must be supplied forlive
operation.
(None)
APIPublicKey Twitter public API key. Must be supplied forlive
operations.
(None)
CollectionFile A file to collect raw Tweets during live
opera-tions.
(None)
ColorNegative The color used to indicate negative tweets.
BLACK
ColorNeutral The color used to indicate neutral tweets.
WHITE
ColorPositive The color used to indicate positive tweets.
GREEN
Hashtag This is the hashtag used to search Twitterwithout the
leading hashtag (#).
(None)
(Continued on the next page.)
9
http://xpo6.com/list-of-english-stop-words/
http://xpo6.com/list-of-english-stop-words/
-
Table 2. (Continued from the previous page.)
Token Meaning Default
LexiconFile A text file containing positive and
negativewords.
lexicon.csv
Offset Should the replay data be brought forward tocurrent time?
Accepted values are TRUE orFALSE.
FALSE
Poll How many times to add new data to thedatabase. If data is
being replayed, the maxi-mum number of database updates will be
thisvalue, or the end of data from the file. If liveoperations,
then this is how many times Twit-ter will be polled for new
data.
10
PostgresTable The Postgres table containing the tweets.
tweeets
PostgresUser The Postgres user name used to access
thedatabase.
openpg
PostgresPassword The Postgres password associated with
thePostgres user.
new user password
ResetDatabase Should the database be reset, and all previousdata
lost when the program starts. Acceptedvalues are TRUE or FALSE.
FALSE
SleepyTime How many seconds between updates to thedatabase. It
is possible that no data will beadded to the database if there
isn’t any Twit-ter activity for a hashtag.
5
SourceFile The file containing the data to be replayed.If this
option is not set, then the operation isassumed to be “live.”
(None)
StopwordsFile The file containing “stop words” that will notbe
considered in determining positive or neg-ative sentiment.
stopword.txt
ThresholdNegative The percentage of words in a tweet
considerednegative for the tweet to be labeled negative.
0.33
ThresholdPositive The percentage of words in a tweet
consideredpositive for the tweet to be labeled positive.
0.33
(Last page.)
10
-
2.3 Design limitations
The current design polls Twitter for new tweets on a periodic
basis. The entire list of searchhash tags are polled, any returned
tweets are stored in the database, and the system “sleeps”for a
number of seconds (as per the configuration file). There are a
number of factors thataffect this processing cycle, including:
1. The number of hash tags being queried. Each poll takes a
finite amount of time, evenif no tweets are returned, so the more
hash tags being queried, the longer it will taketo service the
complete list of tags.
2. Each tweet becomes a single row in the database. The more
tweets that are returnedfrom the query, the longer it takes to
update the database with all the tweets.
3. Each tweet has a unique serial number. Each query includes
the serial number of theearliest (the one furthest in the past) one
of interest in order to get a complete andcontinuous tweet stream
for the hash tag. The earliest acceptable tweet is updatedafter a
successful query.
4. The no-cost query capability is limited to 100 tweets per
query. If more than 100 tweetsare created between queries, then the
polling process will continue to fall further andfurther
behind.
Because of these design and implementation limitations, if
tweets are being created fasterthan the polling process can collect
them, then the system will fall further and further behind.
The limitations imposed by a polling interface can be overcome
by using a streaminginterface2. A polling interface was used
because it is simple to design, simple to implement,and simple to
test. The back-end process could be replaced by a streaming
interface withoutaffecting the front-end process.
3 References
[1] Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie
Kennedy, Frank van Ham,Nathalie Henry Riche, Chris Weaver, Bongshin
Lee, Dominique Brodbeck, and PaoloBuono, Research directions in
data wrangling: Visualizations and transformations forusable and
credible data, Information Visualization 10 (2011), no. 4,
271–288.
2Python
example:http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-streaming-api/,R
example:http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-streaming-api/
11
http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-streaming-api/
http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-streaming-api/
-
A Database tables
These are the database tables/data structures to support Twitter
analysis:
Table 3: Tables to support Twitter analysis.
Column Meaningtime Unix seconds as extracted from the Tweet.data
Base 64 encoding of the entire JSON Tweet.
B Notational data structures
These are the notational data structures used by the various
processes.
Table 4: Notional plotting data structure. A multidimensional
structure indexed by hashtag.
Name PurposePositiveTweetSource A dictionary/hash table to keep
track of the
number of positive Tweets by software source.This is for the
entire polling period.
NegativeTweetSource A dictionary/hash table to keep track of
thenumber of negative Tweets by software source.This is for the
entire polling period.
PositiveTweetLocation A dictionary/hash table to keep track of
thegeographic location of a positive Tweet.
NegativeTweetLocation A dictionary/hash table to keep track of
thegeographic location of a negative Tweet.
Table 5: Notional plotting data structure. This structure is
indexed by hashtag.
Cell Use0 Number of positive Tweets.1 Number of neutral Tweets.2
Number of negative Tweets.
12
-
C Software on each workstation
This section contains the assumptions about the operating system
environment, and softwareload out for each work station.
1. Operating system: Windows 7
2. Database
(a) Name: PostgresSQL
(b) Version: 9.5.3
(c) Source: http://www.postgresql.org/download/windows/ and
http://www.enterprisedb.com/products-services-training/pgdownload#windows
(d) Superuser password: ODUBootcamp
(e) Misc: It may be necessary to manually start the Postgres
server using thesecommands in a terminal window:
cd "\Program Files\PostgreSQL\9.5\bin"
.\pg_ctl -D "c:\Program Files\PostgreSQL\9.5\data" start
3. Software
(a) Python
i. Interpreter version: 2.7.12
ii. Available from:
https://www.python.org/downloads/release/python-2712/
iii. Modules:
• base64• csv• descartes• geos• json• lxml==3.6.0• math•
matplotlib• nltk• numpy• os• pickle
13
http://www.postgresql.org/download/windows/
http://www.enterprisedb.com/products-services-training/pgdownload#windows
http://www.enterprisedb.com/products-services-training/pgdownload#windows
https://www.python.org/downloads/release/python-2712/
-
• psycopg2• shapely3
• sys• time• traceback• tweepy• urllib
(b) Java
• Version: Java SE Development Kit 7u79 (assuming Windows 64 bit
OS)• Available from:
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
(c) pgAdmin
• Version: 1.22.1• Available from:
https://www.pgadmin.org/download/
(d) R
• Version: 3.3.1• Available from:
https://cran.r-project.org/bin/windows/base/• Packages:
i. bitops
ii. DBI
iii. ggmap
iv. ggplot2
v. httr
vi. jsonlite
vii. mapdata
viii. mapplots
ix. mapproj
x. maps
xi. methods
xii. NLP
3The shapely module depends on a number of other modules and can
be difficult to install. The best waythat I’ve found to install the
module and all its dependencies is to use the Python and hardware
architecturespecific whl file from
http://www.lfd.uci.edu/~gohlke/pythonlibs/#shapely After
downloading the whlfile, it can be installed with using pip, for
example:pip install Shapely-1.5.17-cp27-cp27m-win amd64.whl
14
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
https://www.pgadmin.org/download/
https://cran.r-project.org/bin/windows/base/
http://www.lfd.uci.edu/~gohlke/pythonlibs/#shapely
-
xiii. openssl
xiv. RCurl
xv. rjson
xvi. ROAuth
xvii. RPostgreSQL
xviii. SnowballC
xix. streamR
xx. tm
(e) R-Studio
• Vesion: 0.99.903• Available from:
https://www.rstudio.com/products/rstudio/download/
(f) Sublime
• Version: build 3114• Available from:
https://www.sublimetext.com/3
Requires:
• Anaconda plugin– Version: v1.4.4
– Available from: http://damnwidget.github.io/anaconda/
Anaconda can also be installed using Package Control for Sublime
Text 3.
i. Install Package Control using the Python script found
at:https://packagecontrol.io/installation
in the Sublime Console window.
ii. Using the Package Control capability just added to Sublime,
install Ana-conda using the directions found
at:http://damnwidget.github.io/anaconda/
basically:Tools → Command Palette → Install Package →
anaconda
The PATH environment variable should be updated to include the
location of the Rand Python interpreters.
15
https://www.rstudio.com/products/rstudio/download/
https://www.sublimetext.com/3
http://damnwidget.github.io/anaconda/
https://packagecontrol.io/installation
http://damnwidget.github.io/anaconda/
"Chuck Cartledge"
IntroductionPreviewThings that will be happening todayHow we'll
get there
Sent. analysisWhat is it, and why should I care?A
visualization
LanguagesSystem req.And implementation
Sublime IDEBasic operations
Q & AConclusionReferencesFiles