Top Banner
Unconference
76

UnConference for Georgia Southern Computer Science March 31, 2015

Jul 19, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UnConference for Georgia Southern Computer Science March 31, 2015

Unconference

Page 2: UnConference for Georgia Southern Computer Science March 31, 2015

TopicsAgile in the real world

CareerKafka

DevOpsCloud and aaS

NoSQL: MongoDBNoSQL: RedisBloomFilters

Speed Round (5 topics in 8 minutes)Big Data (Hadoop, Spark)

Actor Systems (Akka)Streaming (Storm, InfoSphere Streams)

Page 3: UnConference for Georgia Southern Computer Science March 31, 2015

Agile

Page 4: UnConference for Georgia Southern Computer Science March 31, 2015

Scrum, Kanban

Processes for managing work Team is not just software engineers

− QA, test automation− Product Analyst− Documentation− Production/Operations Engineers

Co-location

Page 5: UnConference for Georgia Southern Computer Science March 31, 2015

Kanban

Limits amount of work team is working on Visually displays what is being worked on,

waiting Some online tools, but teams usually work on

the walls

Page 6: UnConference for Georgia Southern Computer Science March 31, 2015

Video

Page 7: UnConference for Georgia Southern Computer Science March 31, 2015

Kafka

Page 8: UnConference for Georgia Southern Computer Science March 31, 2015

Message Queues

What if you had billions of messages each day? Any couldn't lose any of them?

'current' technology is a queue, held in memory or RDBMS

Lots of problems with holding in memory Lots of RDBMS activity Broker needs to know who consumed a

message so he can delete it

Page 9: UnConference for Georgia Southern Computer Science March 31, 2015

Point to point integration

Page 10: UnConference for Georgia Southern Computer Science March 31, 2015

What we'd really like

Page 11: UnConference for Georgia Southern Computer Science March 31, 2015

Changing the Paradigm

Kafka uses disk as the primary store Kernel level writes directly to disk controller =

FAST Kafka doesn't care if anyone consumed an

event Consumers ask for a specific offset in a topic Consumers can listen for new events

Page 12: UnConference for Georgia Southern Computer Science March 31, 2015

Cool Tech: Commit Logs

Page 13: UnConference for Georgia Southern Computer Science March 31, 2015

Use Cases

Huge volumes of input data > memory Consumers at different times/rates Replay for support, developers

Page 14: UnConference for Georgia Southern Computer Science March 31, 2015

DevOps

Page 15: UnConference for Georgia Southern Computer Science March 31, 2015

Old Way *

Someone gives you a 2” thick specification You code it for 6 months Test it for 3 months Give it to operations to run Get bug reports Fix them, deploy them monthly *unfortunately a lot of people still do this

Page 16: UnConference for Georgia Southern Computer Science March 31, 2015

Old Way - Operations

Manually build a server Hope you got everything on it Deploy the software, hope you got everything Manually test that the software works If the server fails, repeat Upgrades take hours

Page 17: UnConference for Georgia Southern Computer Science March 31, 2015

DevOps and Agile Way

Customer presents stories to team Team asks questions, scopes and defines tests Team develops code, tests, talks with client Production engineers learn what team is

building, what tech etc. Continuous Integration builds make sure it

works Minimum Viable Product is produced Operations runs a command, code is deployed

Page 18: UnConference for Georgia Southern Computer Science March 31, 2015

DevOps - Operations

Template for a server is defined (OS, software etc.) Template for software release is defined (versions,

dependencies, automated tests) Deployment: Stock OS server is booted Deployment command is run UrbanCode/Chef downloads Server templates,

application templates and automated tests. Runs all of them, tells operations it is ready.

Server failure? Boot new machine and repeat Scale? Boot new machine and repeat

Page 19: UnConference for Georgia Southern Computer Science March 31, 2015

DevOps Software

Lots of features added to a release to support in production

Alerts, monitoring, metrics Automated testing Interfaces for 'probing' what is going on inside New Tech is vetted with operations during

development, not 'throw it over the wall'

Page 20: UnConference for Georgia Southern Computer Science March 31, 2015

Cloud and aaS

Page 21: UnConference for Georgia Southern Computer Science March 31, 2015

Cloud

Most hyped technology in the last 10 years (Big Data is a close 2nd) Basic idea: someone deploys thousands

(millions in Amazon's case) of servers and makes it easy for you to use them

No capital costs. Pay for what you use Infinite* capacity if you built it right Vendors are continually adding new features

Page 22: UnConference for Georgia Southern Computer Science March 31, 2015

Why?

$0.20 an hour servers IT Doesn't know about it ... Site gone viral? Spin up 100 more app servers,

bigger database server (see DevOps discussion)

No capital up front. $100/month until the hockey stick, then revenue covers usage

Page 23: UnConference for Georgia Southern Computer Science March 31, 2015

As a Service (*aaS)

Business model for making $$ on the cloud Offer a service to business at a cost less than if

they did it themselves IaaS – Infrastructure as a Service (Amazon,

SoftLayer) PaaS – Platform as a Service (BlueMix, Azure,

AppEngine) SaaS – Software as a Service (Silverpop,

Hotmail, Gmail, LinkedIn)

Page 24: UnConference for Georgia Southern Computer Science March 31, 2015

AaS terminology

Virtual Machine Software Defined Networking Multi-tenancy Scalability, Redundancy and Disaster Recovery Netflix

Page 25: UnConference for Georgia Southern Computer Science March 31, 2015

Tips

See how far you can get with PaaS, but be wary of vendor-specific features

Look for standard databases (MySQL, MongoDB, Redis, SQL Server) PaaS before you stand up your own

Docker Shut it down when you aren't using it!

Page 26: UnConference for Georgia Southern Computer Science March 31, 2015

NoSQL: MongoDB

Page 27: UnConference for Georgia Southern Computer Science March 31, 2015

Document Databases

Schema-less Store JSON, search by JSON, retrieve JSON Document is the transaction layer

Page 28: UnConference for Georgia Southern Computer Science March 31, 2015

Sharding & Partitioning

Lots of scale problems are addressed by splitting the data

Partitioning is the data is grouped by a key in the data. The location is usually fixed (or takes a lot to move).

Sharding is the data is grouped by a key, but the location can change.

Example: shard on event timestamp.

− Partition: all 12/2015 go into the same disk− Shard: all 12/2015 could start in the same disk, but

be divided by smaller date ranges as volume increases

Page 29: UnConference for Georgia Southern Computer Science March 31, 2015

“Internet Scale”

Adding new nodes is easy Shards 'move' based on load Perfect for event storage with no or in place

changes

Page 30: UnConference for Georgia Southern Computer Science March 31, 2015

But ...

Not for documents that grow JSON is tricky to query Schema-less sounds great, until you have to

manage database someone else built No substitute for good design

Page 31: UnConference for Georgia Southern Computer Science March 31, 2015

NoSQL: Redis

Page 32: UnConference for Georgia Southern Computer Science March 31, 2015

Data Structure Server

CS 101 data structures: list, set, string CS 101 operations: push, pop, add, delete Very, very fast. Very simple 100% in memory Single threaded Lots of people use it as an application cache Big O notation documented for every operation

Page 33: UnConference for Georgia Southern Computer Science March 31, 2015

So?

Lets you build shared state across lots of consumers

Pushes operations to the server, so they are atomic

For example: set the 1000th bit on a bit array and increment a population counter. Without pulling the data back to the client

Page 34: UnConference for Georgia Southern Computer Science March 31, 2015

Use Cases

BloomFilters Work queue for thousands of worker threads

(Akka) without continual database polling Cache of web sessions “For the cost of an Oracle Enterprise License I

can give you 64 cores and 3 TB of memory”

Page 35: UnConference for Georgia Southern Computer Science March 31, 2015

BloomFilters

Page 36: UnConference for Georgia Southern Computer Science March 31, 2015

Bloom Filters

From WikiPedia (Don't tell my kid's teacher!)

"A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate"

Page 37: UnConference for Georgia Southern Computer Science March 31, 2015

Hashing

Apply 'x' hash functions to the key to be stored/queried

Each function returns a bit to set in the bitset

Mathematical equations to determine how big to make the bitset, how many functions to use and your acceptable error level

http://hur.st/bloomfilter?n=4&p=1.0E-20

Page 38: UnConference for Georgia Southern Computer Science March 31, 2015

Example

Page 39: UnConference for Georgia Southern Computer Science March 31, 2015

False Positives

Perfect hash functions aren't worth the cost to develop

Sometimes existing bits for a key are set by many other keys

Make sure you understand the business impact of a false positive

Remember, never a false negative

Page 40: UnConference for Georgia Southern Computer Science March 31, 2015

Question

How would you keep track of the number of unique visitors to a website today?

What if I wanted to know if a specific user had visited today?

SQL Query? List of visitors in memory?

Page 41: UnConference for Georgia Southern Computer Science March 31, 2015

Examples

Redis-backed− Unique visitor counts this week− White lists/black lists− What ad to show a visitor? (hash cookie id)− Client side joins (time based population counts)

Databases− Is this key in the page?

Fail Fast

Page 42: UnConference for Georgia Southern Computer Science March 31, 2015

Speed Round

Things we could talk about for hours. In minutes

Page 43: UnConference for Georgia Southern Computer Science March 31, 2015

Version Control

Git is free. Use it Local repository, keep copies of all your work Very helpful when you're working on a project

and need to revert

Plus a good thing to talk about on an interview

Page 44: UnConference for Georgia Southern Computer Science March 31, 2015

Google Docs

Huh? Get a Gmail account (or custom domain, more

about this later) Use Gmail's 10 GB of free storage to put all

your class assignments. No risk of losing school work. Can access from

any computer

Page 45: UnConference for Georgia Southern Computer Science March 31, 2015

IntelliJ

For the Java Users Eclipse is 'difficult'. Sorry, but it is (So is vi, don't get me started on emacs) Community edition is free Syntax highlighting, multiple language support

Page 46: UnConference for Georgia Southern Computer Science March 31, 2015

Clustering

Technology for building a set of processes that work together

Auto detect (and launch) of process failures Keeps your service available 24x7 ZooKeeper, Mesos are common technologies If done right, failure of a server isn't a 'wake up

the team!' event.

Page 47: UnConference for Georgia Southern Computer Science March 31, 2015

Replication

Usually requires Clustering One process is the 'master' of the data or

processing. One or more 'slaves' listen to updates/actions

of the master silently If the master dies, Clustering will promote a

slave to master and the processing continues Sometimes you can read from slaves for read

performance increases since they should be identical (or nearly) to the master

Page 48: UnConference for Georgia Southern Computer Science March 31, 2015

Bonus: Online Resources

Let me Google that for you HighScalability.com StackOverflow.com Infoq.com Feedly.com – RSS isn't dead yet

Page 49: UnConference for Georgia Southern Computer Science March 31, 2015

Big Data

Page 50: UnConference for Georgia Southern Computer Science March 31, 2015

What?

2nd most hyped concept in the last 5 years (Cloud is #1)

Basically, we generate too much data Companies are afraid to throw it away But don't know what to do with it, or if it is even

valuable

Page 51: UnConference for Georgia Southern Computer Science March 31, 2015

How to store it?

Often unstructured Lots and lots of it (sensor logs) Can't store it on one disk, can't risk losing it if

that disk is lost Can't back it up HDFS – Hadoop File System Distributed, multiple copies, multiple servers

Page 52: UnConference for Georgia Southern Computer Science March 31, 2015

How to access it?

Not in a RDBMS Not in a typical NoSQL (Redis, MongoDB etc.) Hadoop is one common tool for making sense

out of it Still need to write code (Cascading is my

favorite)

Page 53: UnConference for Georgia Southern Computer Science March 31, 2015

Map/Reduce

Page 54: UnConference for Georgia Southern Computer Science March 31, 2015

Other Approaches

Spark− Actor based approach to processing data− Uses HDFS, but holds results in memory as long as

possible− Still need to write code− Very, very fast

Hbase− Non-relational database with columns and rows− Most JDBC drivers can talk to it

Page 55: UnConference for Georgia Southern Computer Science March 31, 2015

Careers

Data Scientist− Apply knowledge of statistics and algorithms to the

big data− Find actionable insights in the data to drive the

business− Build Predictive models based on data to determine

what might happen next− Both coding and math skills

Page 56: UnConference for Georgia Southern Computer Science March 31, 2015

Actor Systems

Page 57: UnConference for Georgia Southern Computer Science March 31, 2015

Your phone is more powerful than the Space Shuttle's computers

Typical phone has 2 cores (some 4) Typical laptop has 4 cores Typical server has 8 cores Being able to do things in parallel is necessary

to scale

Page 58: UnConference for Georgia Southern Computer Science March 31, 2015

Multi-threading is really hard

Books written about it Courses taught about it Java got it wrong TWICE so far Concurrency bugs are very difficult to debug

and fix Servers are so cheap now, concurrency across

machines is very common

Page 59: UnConference for Georgia Southern Computer Science March 31, 2015

Actors

So rather than writing the code yourself, why not rely on a concurrency model?

Actors are called with a piece of work to do, respond with an answer

All the threading, clustering, messaging in handled by it.

Akka for Java. Scala Actors (or Akka for Scala)

Page 60: UnConference for Georgia Southern Computer Science March 31, 2015

Use case

Redis list with all the tasks to be performed Akka actor tells 'manager' it is ready for work Manager pops item from Redis into a key

specific for that worker and tells the worker what to do.

When done, the worker tells the manager and the key is removed (or more tasks are created)

Scales to thousands of machines, 10 threads per machine

Page 61: UnConference for Georgia Southern Computer Science March 31, 2015

Streaming

Page 62: UnConference for Georgia Southern Computer Science March 31, 2015

Real time and over time processing

Some events need to be responded to immediately

Some events need to be compared to other events in the last 'x' time periods to understand what to do about them

Some events can be discarded quickly, but need to evaluate them to determine that

Page 63: UnConference for Georgia Southern Computer Science March 31, 2015

Used to be $$$$ to use

Storm changed that Open Source Event Processing System used

by Twitter You write code to evaluate events as they

occur, not to query a set of events at rest Not 'everyone in the last hour who bought more

than $100' Rather 'this person just bought >$100' so do

something. Don't wait

Page 64: UnConference for Georgia Southern Computer Science March 31, 2015

Use Cases

Combined with Kafka for event ingestion Every purchase request, check if the

geographic coordinates match the last 5 purchases, or their home zip code + 5 miles

Every download of a white paper, see if they are a sales prospect and alert the salesperson the customer is on the site and interested

Page 65: UnConference for Georgia Southern Computer Science March 31, 2015

Careers

Security and anti-fraud Online marketing Securities Trading (Wall Street)

Page 66: UnConference for Georgia Southern Computer Science March 31, 2015

Career

Managing your career before you have one

Page 67: UnConference for Georgia Southern Computer Science March 31, 2015

Linked In

Get one This is not Facebook or Instagram Who would you want to work with in 5 years? Who wouldn't you? Stay in touch with Intern/co-op co-workers Lots of useful information, follow 'smart' people

Page 68: UnConference for Georgia Southern Computer Science March 31, 2015

Internships

Get one, even on campus LinkedIn to search for them Nepotism is okay!

Page 69: UnConference for Georgia Southern Computer Science March 31, 2015

Open Source

Find an interesting project Does it have a 'startup guide'? Or a poorly

written one? Get an example working and submit a

documentation update Another thing to talk about during an interview

Page 70: UnConference for Georgia Southern Computer Science March 31, 2015

Study Groups

Get several! No, not organized cheating What you understand someone else doesn't Eventually you won't be the smartest in a group Interview discussion topic

Page 71: UnConference for Georgia Southern Computer Science March 31, 2015

Resumes

Gold Award? Eagle Scout? List courses in major Describe your impact on the business (even if a

cook or waiting tables) One page. Really. Vanity Emails – time to get rid of them

− Seriously consider a Google-hosted domain ($50 year)

Page 72: UnConference for Georgia Southern Computer Science March 31, 2015

Interviews

Prepare, prepare, prepare Details about course work How will your internship help the company? Test First Development during coding examples Ask questions Be ready: What else do you do for fun?

Page 73: UnConference for Georgia Southern Computer Science March 31, 2015

Life

Have one outside of computers Have non-CS friends Get a hobby Exercise (Google 'The Hackers diet')

Why? Burn out

Page 74: UnConference for Georgia Southern Computer Science March 31, 2015

Other things

Learn about Mobile

Learn about JavaScript client libraries (JQuery, Angular.js etc.) even if you want to do backend work

Learn about security, both building secure and the tools for testing

Data Science has a lot of buzz (Learn R, Predictive Analytics)

Internet of Things (IoT)

Page 75: UnConference for Georgia Southern Computer Science March 31, 2015

Finally

Never stop asking why?

Never stop taking things apart

Never stop listening and learning

Page 76: UnConference for Georgia Southern Computer Science March 31, 2015

Finally

Never stop asking why?

Never stop taking things apart

Never stop listening and learning