© 2013 A. Haeberlen, Z. Ives Welcome to CIS 455 / 555 – Internet and Web Systems Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 14, 2015
Dec 14, 2015
© 2013 A. Haeberlen, Z. Ives
Welcome to CIS 455 / 555 –
Internet and Web Systems
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
January 14, 2015
© 2011-14 A. Haeberlen, Z. Ives 2
What this Course Is About
• How do we build services like Google, Akamai, iTunes, Facebook, EBAY, …?
• What are the principles behind them?(This is NOT a course on building Web sites! See CIS 450/550…)
• How do “cloud computing,” P2P, and Web services relate?
• The main themes of the course:
• Distributed systems concepts, with emphasis on data, scalability and interoperability (including “the cloud”)
• Data representation fundamentals, with emphasis on XML
• Information retrieval concepts, including ranking and indexing
• It’s a course that involves building software using the principles learned, evaluating it, and programming in teams
© 2011-14 A. Haeberlen, Z. Ives 3
How Does this Relate to Other CIS Courses?
NETS 212
• Cloud service layers
• Key/value stores, in particular
• MapReduce, Spark, and data-parallel programming basics
CIS 450/550
• Data representation and management
• Relational querying with SQL; XML querying with XQuery
• DBMS-backed web sites
• 455/555 focuses on data with respect to interoperability
CIS 350/573: software engineering and mashups
CIS 505: focuses on distributed systems and algorithms• CIS 505 is less project-oriented than CIS 555
• CIS 555 covers Web services, cloud architectures in more detail
© 2011-14 A. Haeberlen, Z. Ives 4
Some Things We’ll Look at
•What are the principles behind building systems that work on the Internet?
• How do these relate to many of today’s hot technologies?• Web servers, DHTML, Servlets, JSP, …
• XML
• Web services
• Peer-to-peer
• Application servers
• Cloud computing environments
• Content distribution networks
• Web search
• Mash-ups
• The cloud
• …
© 2011-14 A. Haeberlen, Z. Ives 5
Staff
• Instructor: Zack Ives, zives@cis
• Office: 576 Levine North
• Office hours W 1:30-2:30 (and by arrangement)
• TAs:
• Avani Deshpande Akshay Hegde
• Mounica Maddela Shruthi Gorantala
• Shenga Ding
• Piazza: piazza.com/upenn/spring2015/cis455555
• Will have custom homework submission platform (coming soon)
© 2011-14 A. Haeberlen, Z. Ives 6
Textbooks
• Distributed Systems: Principles and Paradigms, 2nd ed, Tanenbaum and van Steen
• We’ll read from the book ~50% of the time
• Frequent supplementary handouts
• Excerpts from several books
• Many recent research papers
• Your first one, which you should read by Wed:http://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf
(linked off the CIS 555 page)
© 2011-14 A. Haeberlen, Z. Ives 7
Prerequisites, Workload, etc.
Necessary skills:
• Ability to code in Java: there is a substantial implementation project
• Good debugging skills – this will be the biggest time sink!
• The ability to work as a team with classmates (towards the end)
• A willingness to learn how to read API documentation
• Some exposure to threads and concurrent programming
• A willingness to “push the envelope”
Workload:
• Several programming/debugging-based homework assignments
• A substantial term project with experimental evaluation and a report
• Two midterms
Payoff:• Lots of practical development and debugging experience
• A good working knowledge of the fundamentals behind scalable systems
• A working “academic clone of Google,” hosted on Amazon EC2!
WARNING: this course should be considered 1.5 CU!
© 2011-14 A. Haeberlen, Z. Ives 8
A Disclaimer…
• This remains a “bleeding edge” course!• Goal 0: an understanding of scalable distributed data-centric
systems
• Goal 1: a look under the covers of today’s hottest topics – in lectures and in projects
• Goal 2: a level of comfort in managing large, complex software development with others’ code
• Part of this means doing a substantial implementation project
• As in the real world: learning APIs, dealing with inadequate tools
• Most of you will find this a struggle! You’ll spend many hours debugging!
• We will be using some immature technology• Not everything will have been validated ahead of time
• We’ll do the best we can to smooth over the bugs!
• We hope it will be a fun course, though…
… And an interesting one!
© 2011-14 A. Haeberlen, Z. Ives 10
What Exactly Is the Web?
• The Web consists of HTTP servers that publish HTML, XML, and a few other content types• These are hyperlinked via URLs (a subset of URIs)
• Plus there are a huge number of web clients
• The Web is built on a number of Internet protocols:• DNS, TCP, IP
• Other Internet services use other protocols• SMTP, IMAP, POP, AIM, FTP, …
• Streaming media, music swapping protocols, …
• Web services, custom applications may actually also use HTTP in ways it wasn’t designed for
© 2011-14 A. Haeberlen, Z. Ives
11
The Internet is Built in Layers
IPv4, IPv6 Unicast, (multicast)
TCP (session-based)
UDP (sessionless)
WiFi, ZigBee, Ethernet, WiMax
Lightweight streaming, etc.
SSH, FTP,HTTP, IM, P2P,
…
Web Services, distrib
transactions, …
Link
IP
Transport
Session
Middleware
Your Application
… …
© 2011-14 A. Haeberlen, Z. Ives 12
What Is an Internet System?
• Not just a web server or web application…
• An application built over the Internet, whose functionality is distributed across more than one machine
• Typically, at least in a client-server or server-to-server fashion, but may have many more participants
• Typically, data and/or code must be exchanged in distributed fashion for the functioning of the application
• Often, the data must be partitioned, replicated, translated, etc. (“shards” in Google-speak)
• Often, the code is written in multiple different environments, languages, etc.
• Often, there are concerns about handling failures, firewalls, attacks, …
© 2011-14 A. Haeberlen, Z. Ives 13
Why Are Internet System Topics Interesting?
• Understanding what’s underneath today’s Web
• How does it work?
• What are its shortcomings?
• What are its strengths?
• Understanding distributed algorithms
• Using the right approach when designing new protocols and web systems
• Being able to anticipate what’s actually possible in the future
© 2011-14 A. Haeberlen, Z. Ives 14
Example: Web Search, a Cloud Service
Index Servers
Crawlers
Search Interface Servers
queries HTML forms;
results
query results
Web
Pages
pages
keywords +
locations
clientclientclient
Uses a model ofdocument/word
similarity to rankmatches
© 2011-14 A. Haeberlen, Z. Ives 15
Example: Social Networking (Facebook / Twitter), a Cloud
Service
Recommender
Users &
entities
User PageServers
clicks pages & notifications
suggestions
common properties,
usage logs, …
clientclientclient
updates, posts
© 2011-14 A. Haeberlen, Z. Ives 16
Example: Enterprise (or Web) Information Integration
XML sources
Mediator
System
queries results in
“mediated schema”
clientclientclient
Relational
sources HTML sources
XQuery
+ XPath
over
XML XMLSQL
ODBC
results HTTP POST
HTML
Maps all data into a single format and virtual
schema
© 2011-14 A. Haeberlen, Z. Ives 17
Example: SETI@home
Problem Partitioning
clientclientclient
Breaks computation intomany parts and
distributes them tothe clients Data
Aggregation
New sub-problems Computed
subresults
© 2011-14 A. Haeberlen, Z. Ives 18
Example: P2P File Sharing
client
client
client
client
request
request
request data
data
data
Processes name-basedrequests for data; each
node can make requests,forward requests,
return data
© 2011-14 A. Haeberlen, Z. Ives 19
What are the Hard Problems?
• Disclaimer: most of the hard problems AREN’T solved (or solvable) – and there often isn’t any single BEST solution
Much of systems design is about finding the right compromise for each specific problem
• We can divide them into:• Scalability
• Availability / reliability
• Consistency
• Interoperability
• Location and resource discovery
© 2011-14 A. Haeberlen, Z. Ives 20
Scalability
• How do we support a large number of clients or requests?
• Distribute work!
• Challenges:
• Coordination – takes significant overhead in the general case
• Load balancing – avoid having bottlenecks
• Parts of the solution:
• Client-server, multi-tier, P2P architectures
• Restricted programming models, e.g., MapReduce
• Data partitioning, replication, remote procedure calls, …
© 2011-14 A. Haeberlen, Z. Ives 21
Availability/Reliability
• How do we ensure the system is “up” when we want it to be, and doing the “right” thing?
• Replication and redundancy
• Security measures against attacks
• Ability to undo/redo
• Challenges:
• Keeping things consistent
• Performance vs. security
• Acknowledgments
• Parts of the solution:
• Data partitioning, replication, …
• Logging, transactions, …
• Redundant hardware, multiple sites, …
• Quorum and consensus algorithms
© 2011-14 A. Haeberlen, Z. Ives 22
Consistency / Consensus
• Replication, distribution, and failures make it difficult to keep a unified, consistent view of the world – how do we combat this?
• Locking, concurrency control, and invalidation schemes
• Clock synchronization
• Challenges:
• Locking has huge performance overhead
• Network partitions, disconnected operation
• Parts of the solution:
• Optimistic concurrency control, 2-phase locking
• Distributed clock sync
• Conflict resolvers
© 2011-14 A. Haeberlen, Z. Ives 23
Interoperability
• How do we coordinate the efforts of components that have different data formats and/or source languages, and are on different machines?
• Standardization!
• Challenges:• Everything has a different semantics!
• Parts of the solution:• Standard data formats: XML, XML schemas
• “Schema mediation” and data translation
• Remote procedure calls: CORBA, XML-RPC, …
© 2011-14 A. Haeberlen, Z. Ives 24
Location & Resource Discovery
• How do you find what you’re looking for?
• Naming
• Declarative queries over standard schemas
• Advertisements
• Challenges:
• Naming has implicit semantics
• What do you do when you don’t know what to call something?
• Parts of the solution:
• Directory systems – DNS, LDAP, etc.
• Resource discovery and advertising protocols
• Overlay networks, sharding schemes
• Standardized schemas
© 2011-14 A. Haeberlen, Z. Ives 25
Our First Focus: Single Machines, aka Servers
• How do you handle large numbers of concurrent users?
• Processes
• Threads
• Events
• Hybrids (e.g., thread pools)
• Staged architectures
© 2011-14 A. Haeberlen, Z. Ives 26
Next Time…
•We’ll look under the covers of an HTTP server
• Key ideas in building scalable systems
• Principles of HTTP and web servers
• Management of concurrent sessions
• To read by next Wednesday:
• Lampson and Saltzer paperhttp://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf
• Tanenbaum Ch. 3.1
• If necessary: Review Tanenbaum “Modern OS,” Ch. 2.3 or a similar OS book on interprocess communication