NetWork

Introduction to Computer Networks

Slides courtesy: T. S. Eugene Ng

Organizing Network Functionality

• Many kinds of networking functionality– e.g., encoding, framing, routing, addressing, reliability, etc.

• Many different network styles and technologies– circuit-switched vs packet-switched, etc.– wireless vs wired vs optical, etc.

• Many different applications– ftp, email, web, P2P, etc.

• Network architecture– How should different pieces be organized?– How should different pieces interact?

Problem

• new application has to interface to all existing media– adding new application requires O(m) work, m = number of media

• new media requires all existing applications be modified– adding new media requires O(a) work, a = number of applications

• total work in system O(ma) eventually too much work to add apps/media

• Application end points may not be on the same media!

SMTP SSH FTP

Packetradio

Coaxial cable

Fiberoptic

Application

TransmissionMedia

Solution: Indirection

• Solution: introduce an intermediate layer that provides a single abstraction for various network technologies– O(1) work to add app/media– Indirection is an often used technique in computer science

SMTP SSH NFS

802.11LAN

Coaxial cable

Fiberoptic

Application

TransmissionMedia

Intermediate layer

Network Architecture

• Architecture is not the implementation itself

• Architecture is how to “organize” implementations– what interfaces are supported– where functionality is implemented

• Architecture is the modular design of the network

Software Modularity

Break system into modules:

• Well-defined interfaces gives flexibility– can change implementation of modules– can extend functionality of system by adding new modules

• Interfaces hide information– allows for flexibility– but can hurt performance

Network Modularity

Like software modularity, but with a twist:

• Implementation distributed across routers and hosts

• Must decide both:– how to break system into modules– where modules are implemented

Outline

• Layering– how to break network functionality into modules

• The End-to-End Argument– where to implement functionality

Layering

• Layering is a particular form of modularization

• The system is broken into a vertical hierarchy of logically distinct entities (layers)

• The service provided by one layer is based solely on the service provided by layer below

• Rigid structure: easy reuse, performance suffers

ISO OSI Reference Model

• ISO – International Standard Organization• OSI – Open System Interconnection• Goal: a general open standard

– allow vendors to enter the market by using their own implementation and protocols

ISO OSI Reference Model• Seven layers

– Lower two layers are peer-to-peer– Network layer involves multiple switches– Next four layers are end-to-end

ApplicationPresentation

SessionTransportNetworkDatalinkPhysical

NetworkDatalink

PhysicalPhysical medium A Physical medium B

Host 1 Intermediate switch Host 2

Layering Solves Problem

• Application layer doesn’t know about anything below the presentation layer, etc.

• Information about network is hidden from higher layers

• This ensures that we only need to implement an application once!

Key Concepts

• Service – says what a layer does– Ethernet: unreliable subnet unicast/multicast/broadcast

datagram service– IP: unreliable end-to-end unicast datagram service– TCP: reliable end-to-end bi-directional byte stream service– Guaranteed bandwidth/latency unicast service

• Service Interface – says how to access the service – E.g. UNIX socket interface

• Protocol – says how is the service implemented– a set of rules and formats that govern the communication

between two peers

Physical Layer (1)

• Service: move information between two systems connected by a physical link

• Interface: specifies how to send a bit

• Protocol: coding scheme used to represent a bit, voltage levels, duration of a bit

• Examples: coaxial cable, optical fiber links; transmitters, receivers

Datalink Layer (2)

• Service: – framing (attach frame separators) – send data frames between peers– others:

• arbitrate the access to common physical media• per-hop reliable transmission• per-hop flow control

• Interface: send a data unit (packet) to a machine connected to the same physical media

• Protocol: layer addresses, implement Medium Access Control (MAC) (e.g., CSMA/CD)…

Network Layer (3)

• Service: – deliver a packet to specified network destination– perform segmentation/reassemble– others:

• packet scheduling• buffer management

• Interface: send a packet to a specified destination• Protocol: define global unique addresses; construct

routing tables

Transport Layer (4)

• Service:– Multiplexing/demultiplexing– optional: error-free and flow-controlled delivery

• Interface: send message to specific destination

• Protocol: implements reliability and flow control

• Examples: TCP and UDP

Session Layer (5)

• Service:– full-duplex– access management (e.g., token control)– synchronization (e.g., provide check points for long transfers)

• Interface: depends on service

• Protocol: token management; insert checkpoints, implement roll-back functions

Presentation Layer (6)

• Service: convert data between various representations

• Interface: depends on service

• Protocol: define data formats, and rules to convert from one format to another

Application Layer (7)

• Service: any service provided to the end user

• Interface: depends on the application

• Protocol: depends on the application

• Examples: FTP, Telnet, WWW browser

Who Does What?

NetworkDatalinkPhysical

Physical medium

Host A Host B

Router

Logical Communication

• Layers interacts with corresponding layer on peer

Physical medium

Host A Host B

Router

Physical Communication

• Communication goes down to physical network, then to peer, then up to relevant layer

Physical medium

Host A Host B

Router

Encapsulation

• A layer can use only the service provided by the layer immediate below it

• Each layer may change and add a header to data packet

Example: Postal System

Standard process (historical):• Write letter• Drop an addressed letter off in your local mailbox• Postal service delivers to address• Addressee reads letter (and perhaps responds)

Postal Service as Layered System

Layers: • Letter writing/reading• Delivery

Information Hiding:• Network need not know letter contents• Customer need not know how the

postal network works

Encapsulation:• Envelope

Customer

Post Office

Customer

Post Office

Functions of the Layers

– Service: Handles details of application programs.– Functions:

– Service: Controls delivery of data between hosts.– Functions: Connection establishment/termination,

error control, flow control, congestion control, etc.

– Service: Moves packets inside the network.– Functions: Routing, addressing, switching, etc.

– Service: Reliable transfer of frames over a link.– Functions: Synchronization, error control, flow

control, etc.

telnet, ftp, emailwww, AFS

TCP, UDP

IP, ICMP, OSPFRIP, BGP

Ethernet, WiFiT1

ApplicationLayer

TransportLayer

NetworkLayer

(Data) LinkLayer

Internet Protocol Architecture

FTPprogram

EthernetDriver

ATMDriver

FTPprogram

ATMDriver

FTP protocol

TCP protocol

IP protocol IP protocol

Ethernetprotocol

ATMprotocol

Internet Protocol Architecture

MPEG Servierprogram

EthernetDriver

ATMDriver

MPEG Playerprogram

ATMDriver

RTP protocol

UDP protocol

IP protocol IP protocol

Ethernetprotocol

ATMprotocol

Application

EthernetDriver

User data

User dataApplicationHeader

Application dataTCP Header

Application dataTCP HeaderIP Header

Application dataTCP HeaderIP HeaderEthernetHeader

EthernetTrailer

IP datagram

TCP segment

Ethernet frame

Encapsulation• As data is moving down the protocol stack, each protocol

is adding layer-specific control information.

Hourglass

Note: Additional protocols like routingprotocols (RIP, OSPF) needed to makeIP work

Implications of Hourglass

A single Internet layer module:

• Allows all networks to interoperate– all networks technologies that support IP can exchange

packets

• Allows all applications to function on all networks– all applications that can run on IP can use any network

• Simultaneous developments above and below IP

Reality

• Layering is a convenient way to think about networks• But layering is often violated

– Firewalls– Transparent caches– NAT boxes

Summary

• Layering is a good way to organize network functions

• Unified Internet layer decouples apps from networks

• E2E argument argues to keep IP simple

• Be judicious when thinking about adding to the network layer

OSI & Internet protocol suite

Where we work?

Sockets API

Open/X Transport Interface

Two reasons for this design

• Upper three layers handle all the details of application and know little about communication i.e. sending, receiving data etc

• Upper three layers form a user process while the lower four layers are provided as part of operating system or kernel.

About kernel

Kernel

• the part of the operating system that is mandatory and common to all other software

• simply the name given to the lowest level of abstraction that is implemented in software

Functionalities of Kernel

• Process Management• Memory Management• Device Management• System Calls

Process Management

• A kernel typically sets up an address space for the process,

• loads the file containing the code into memory, sets up a stack for the program and branches to a given location inside the program, thus starting its execution

Memory Management

• The kernel has full access to the system's memory and must allow processes to safely access this memory as they require it.

• Virtual addressing allows the kernel to make a given physical address appear to be another address, the virtual address.

• Virtual address spaces may be different for different processes;

Device Management

• Processes need access to the peripherals connected to the computer, which are controlled by the kernel through device drivers.

• For example, to show the user something on the screen, an application would make a request to the kernel, which would forward the request to its display driver, which is then responsible for actually plotting the character/pixel

System Calls

• A process must be able to access the services provided by the kernel. This is implemented differently by each kernel, but most provide a C library or an API, which in turn invokes the related kernel functions

• Implemented using software simulated interrupts

Programs and Processes

• A program is an executable file residing on disk. A program is read into memory and executed by the kernel

• An executing instance of a program is called a process

• Every process has a unique non-negative identifier called process id (PID)

Process Environment

• What happens when we execute a C program? ./a.out

• How the command-line arguments are passed to the process?

• Memory layout of a process

What happens when we execute a C program?

• int main(int argc, char *argv[]); • When a C program is executed by the kernel by one of the exec

functions, a special start-up routine is called before the main function is called.

• The executable program file specifies this routine as the starting address for the program;

• This start-up routine takes values from the kernel the command-line arguments and the environment

Memory Layout of C Program

• Code - text segment• Initialized data – data segment• Uninitialized data – bss segment• Heap• Stack

• Code - text segment– Machine instructions that the CPU executes– Sharable – Read-only

• Initialized data – data segment– Variables initialized to non-zero values appearing outside

any function causes this variable to be stored in the initialized data segment with its initial value.

– Statically allocated and global data that are initialized with nonzero values live in the data segment

• Uninitialized data – bss segment– BSS stands for ‘Block Started by Symbol’. – Global and statically allocated data that initialized to zero

by default are kept here

Memory Layout

• Stack– The stack segment is where local (automatic) variables are allocated. – The data is popped up or pushed into the stack following the Last In First

Out (LIFO) rule. – When a function is called, a stack frame is created and PUSHed onto the

top of the stack. This stack frame contains information such as the address from which the function was called and where to jump back to when the function is finished (return address), parameters, local variables, and any other information needed by the invoked function.

– When a function returns, the stack frame is POPped from the stack. Typically the stack grows downward, meaning that items deeper in the call chain are at numerically lower addresses and toward the heap.

• Heap– The heap is where dynamic memory (obtained by malloc(), calloc(),

realloc()) comes from. – It is typical for the heap to grow upward. This means that successive items

that are added to the heap are added at addresses that are numerically greater than previous items.

– The end of the heap is marked by a pointer known as the break. You cannot reference past the break. You can, however, move the break pointer (via brk() and sbrk() system calls) to a new position to increase the amount of heap memory available.

Environment Variables

• Stored in process memory• Set of parameters that are inherited from process to process.• Each program is also passed an environment list like the

argument list.• Environment list is an array of character pointers, with each

pointer containing the variable name and its value.

Environment Variables

Listing all arguments and environment vars

intmain (int argc, char *argv[]){ int i; char **ptr; extern char **environ; for (i = 0; i < argc; i++) /* echo all command-line args */ printf ("argv[%d]: %s\n", i, argv[i]); for (ptr = environ; *ptr != 0; ptr++) /* and all env strings */ printf ("%s\n", *ptr); exit (0);}

Functions to access environment variables

Process Control

• Every process has a unique process ID, a non-negative integer.

• Although unique, process IDs are reused. As processes terminate, their IDs become candidates for reuse.

• Process ID 0 is usually the scheduler process and is often known as the swapper.

Process Control

• Process ID 1 is usually the init process and is invoked by the kernel at the end of the bootstrap procedure. This process is responsible for bringing up a UNIX system after the kernel has been bootstrapped.

• The init process never dies. It is a normal user process, not a system process within the kernel, although it does run with super user privileges.

• init becomes the parent process of any orphaned child process.

Process Identifiers

#include <unistd.h> • pid_t getpid(void);

Returns: process ID of calling process• pid_t getppid(void);

Returns: parent process ID of calling process• uid_t getuid(void);

Returns: real user ID of calling process• uid_t geteuid(void);

Returns: effective user ID of calling process• gid_t getgid(void);

Returns: real group ID of calling process• gid_t getegid(void);

Returns: effective group ID of calling process

fork()

• An existing process can create a new one by calling the fork function.#include <unistd.h> pid_t fork(void);

Returns: 0 in child, process ID of child in parent, 1 on error• The new process created by fork is called the child process. This

function is called once but returns twice. The only difference in the returns is that the return value in the child is 0, whereas the return value in the parent is the process ID of the new child

fork()

• Both the child and the parent continue executing with the instruction that follows the call to fork.

• The child is a copy of the parent. For example, the child gets a copy of the parent's data space, heap, and stack. Note that this is a copy for the child; the parent and the child do not share these portions of memory. The parent and the child share the text segment

copy-on-write (COW)

• don't perform a complete copy of the parent's data, stack, and heap

• These regions are shared by the parent and the child and have their protection changed by the kernel to read-only

• If either process tries to modify these regions, the kernel then makes a copy of that piece of memory only, typically a "page" in a virtual memory system.

int glob = 6; //global variableintmain (){ int var; pid_t pid; var = 88; printf ("Before fork\n"); if ((pid = fork ()) < 0) perror ("fork"); //function to print error that occurred in the process else if (pid == 0) { glob++; var++; printf ("pid = %d, glob=%d, var=%d\n", getpid (), glob, var); exit (0); } else { printf ("pid = %d, glob=%d, var=%d\n", getpid (), glob, var); exit (0); }}

fork()

• In general, we never know whether the child starts executing before the parent or vice versa. This depends on the scheduling algorithm used by the kernel.

• To synchronize child and parent, some form of interprocess communication is required.

File sharing between parent and child

• one characteristic of fork is that all file descriptors that are open in the parent are duplicated in the child.

• The parent and the child share a file table entry for every open descriptor .

• Generally shell process has three different files opened for standard input, standard output, and standard error. When a command is executed as a process, they are inherited

vfork()

• The vfork function is intended to create a new process when the purpose of the new process is to exec a new program

• The vfork function creates the new process, just like fork, without copying the address space of the parent into the child, as the child won't reference that address space

• vfork guarantees that the child runs first, until the child calls exec or exit. When the child calls either of these functions, the parent resumes.

What child inherits?

• Real user ID, real group ID, effective user ID, effective group ID• Current working directory• Root directory• File mode creation mask• Environment • Process group ID• Session ID• Controlling terminal• Attached shared memory segments• Memory mappings• Resource limits

What values in child are different from parent?

• The return value from fork• The process IDs are different• The two processes have different parent process IDs: the parent

process ID of the child is the parent; the parent process ID of the parent doesn't change

• The child's tms_utime, tms_stime, tms_cutime, and tms_cstime values are set to 0

• File locks set by the parent are not inherited by the child• Pending alarms are cleared for the child• The set of pending signals for the child is set to the empty set

Process Termination

• Normal Termination– Return from main– Calling exit– Calling _exit or _Exit– Return of the last thread from its start routine– Calling pthread_exit from the last thread

• Abnormal termination – Calling abort – Receipt of a signal – Response of the last thread to a cancellation request

Process Termination

• Regardless of how a process terminates, the same code in the kernel is eventually executed. This kernel code closes all the open descriptors for the process, releases the memory that it was using, and the like.

• Te able to notify its parent how it terminated, child passes an exit status as the argument to exit functions (exit, _exit, and _Exit),

• In the case of an abnormal termination, however, the kernel, not the process, generates a termination status to indicate the reason for the abnormal termination.

• In any case, the parent of the process can obtain the termination status using wait or the waitpid function

Process Termination

• When a process terminates, either normally or abnormally, the kernel notifies the parent by sending the SIGCHLD signal to the parent.

• This signal is the asynchronous notification from the kernel to the parent. The parent can choose to ignore this signal, or it can provide a function that is called when the signal occurs: a signal handler.

• The default action for this signal is that it is ignored.

wait() & waitpid()

• Parent can obtain termination status from kernel using these calls

• Process that calls wait or waitpid can– Block, if all of its children are still running– Return immediately with the termination status of a child, if a child

has terminated and is waiting for its termination status to be fetched– Return immediately with an error, if it doesn't have any child

processes

Syntax

waitpid()

main (){ int i = 0, j = 0; pid_t ret; int status; ret = fork (); if (ret == 0) { for (i = 0; i < 5000; i++) printf ("Child: %d\n", i); printf ("Child ends\n"); } else { wait (&status); printf ("Parent resumes.\n"); for (j = 0; j < 5000; j++) printf ("Parent: %d\n", j); }}

What happens if parent terminates before child?

• the init process becomes the parent process of any process whose parent terminates ( process has been inherited by init)

• parent process ID of the surviving process is changed to be 1 (the process ID of init). This way, we're guaranteed that every process has a parent.

What happens when a child terminates before its parent ?

• Kernel keeps small amount of information (process ID, the termination status of the process, and the amount of CPU time taken by the process ) until parent asks for it

• a process that has terminated, but whose parent has not yet waited for it, is called a zombie

exec functions

• fork function creates a new process (the child). Then causes another program to be executed by calling one of the exec functions.

• When a process calls one of the exec functions, that process is completely replaced by the new program, and the new program starts executing at its main function.

• The process ID does not change across an exec, because a new process is not created;

• exec replaces the current process, its text, data, heap, and stack segments with a new program from disk.

#include <unistd.h> • int execl(const char *pathname, const char *arg0, ... /*

(char *)0 */ ); • int execv(const char *pathname, char *const argv []);• int execle(const char *pathname, const char *arg0, ... /*

(char *)0, char *const envp[] */ ); • int execve(const char *pathname, char *const argv[], char

*const envp []); • int execlp(const char *filename, const char *arg0, ... /*

(char *)0 */ ); • int execvp(const char *filename, char *const argv []);

Remembering arguments

Function pathname filename Arg list argv[] environ envp[]

execl • • • execlp • • • execle • • •execv • • •execvp • • • execve • • •(letter in

name) p l v e

Example

Output: Executes ls command with –l optionint main (){ execl ("/bin/ls", "ls", "-l", (char *) 0); printf ("hello");}

• Input: a command to execute and its arguments int main(int argc, char **argv){execvp(argv[1], argv+1);

Signals

• A signal is an asynchronous event which is delivered to a process.

• Asynchronous means that the event can occur at any time– may be unrelated to the execution of the process– e.g. user types ctrl-C, or the modem hangs

Signals

• Name Description Default ActionSIGINT Interrupt character typed terminate processSIGQUIT Quit character typed (^\) terminate + create

core imageSIGKILL kill -9 terminate processSIGSEGV Invalid memory reference terminate +

create core imageSIGPIPE Write on pipe but no reader terminate processSIGALRM alarm() clock ‘rings’ terminate processSIGUSR1 user-defined signal type terminate processSIGUSR2 user-defined signal type terminate process

• See man 7 signal

Signal Sources

• Terminal-generated signals: SIGINT, SIGQUIT• Hardware exceptions generate signals: SIGFPE, SIGSEGV• kill function allows a process to send any signal to another

process or process group• The kill command allows us to send signals to other processes. • Software conditions: SIGURG, SIGPIPE, SIGALRM

kill() and raise()function

• Send a signal to a process (or group of processes).

#include <signal.h>int kill( pid_t pid, int signo );int raise(int signo);

• pid > 0 send signal to process pid

pid== 0 send signal to all processeswhose process group ID equals the sender’s

pgid.e.g. parent kills all children

• Return 0 if ok, -1 on error.

Responding to a Signal

• A process can:– ignore/discard the signal (not possible with SIGKILL or SIGSTOP)

– Catch the signal and execute a signal handler function, and then possibly resume execution

– Let the default action apply. Every signal has a default action• The choice is called the signal disposition

Signal Handler Function

• Specify a signal handler function to deal with a signal type.• #include <signal.h>

typedef void Sigfunc(int); /* my defn */Sigfunc *signal( int signo, Sigfunc *handler );– signal returns a pointer to a function that takes an int (i.e. it returns a

pointer to Sigfunc)• Returns previous signal disposition if ok, SIG_ERR on error.

Example

int main(){

signal( SIGINT, foo ); :

/* do usual things until SIGINT */return 0;}

void foo( int signo ){

: /* deal with SIGINT signal */

return; /* return to program */}

Special Sigfunc * Values

• Value Meaning

SIG_IGN Ignore / discard the signal.

SIG_DFL Use default action to handle signal.

SIG_ERR Returned by signal() as an error.

Signals Overview• Three phases to processing signals:

– Signal is generated• when the event that causes the signal occurs

– Signal is delivered• signal is said to be delivered to the process when process takes

action for the signal– Signal is pending

• during the time between generation and delivery, the signal is said to be pending

Signal blocking

• Blocking the delivery of a signal– process informs the signal to be blocked to kernel– When such signal is generated for the process, if the action

is not ignore, that signal remains pending until the process either unblocks it or changes action to ignore

Multiple Signals

• If a blocked signal is generated more than once then in most systems the signal is delivered only once. That is the signal is not queued.

• If many signals of different types are ready to be delivered (e.g. a SIGINT, SIGSEGV, SIGUSR1), they are not delivered in any fixed order.

Signal Sets

• A data type to represent multiple signals• #include <signal.h>

– int sigemptyset(sigset_t *set); – int sigfillset(sigset_t *set); – int sigaddset(sigset_t *set, int signo); – int sigdelset(sigset_t *set, int signo);

All four return: 0 if OK, 1 on error int – sigismember(const sigset_t *set, int signo); – Returns: 1 if true, 0 if false, 1 on error

sigprocmask()

• A process uses a signal set to create a mask which defines the signals it is blocking from delivery. – good for critical sections where you want to block certain signals.

• #include <signal.h>int sigprocmask( int how,

const sigset_t *set,sigset_t *oldset);

• how – indicates how mask is modified

‘how’ Meanings

• Value Meaning

SIG_BLOCK set signals are added to mask

SIG_UNBLOCK set signals are removed from mask

SIG_SETMASK set becomes new mask

A Critical Code Region

sigset_t newmask, oldmask;

sigemptyset( &newmask );sigaddset( &newmask, SIGINT );

/* block SIGINT; save old mask */sigprocmask( SIG_BLOCK, &newmask, &oldmask );

/* critical region of code */

/* reset mask which unblocks SIGINT */sigprocmask( SIG_SETMASK, &oldmask, NULL );

sigaction()

• Supercedes (more powerful than) signal()– sigaction() can be used to code a non-

resetting signal()• #include <signal.h>

int sigaction(int signo, const struct sigaction *act, struct sigaction *oldact );

sigaction Structure

struct sigaction {

void (*sa_handler)( int ); /* action to be taken or SIG_IGN, SIG_DFL */

sigset_t sa_mask; /* additional signal to be blocked */ int sa_flags; /* modifies action of the signal */

void (*sa_sigaction)( int, siginfo_t *, void * );/*The sa_sigaction field is an alternate signal handler used when

the SA_SIGINFO flag is used with sigaction. */}

• sa_flags – – SIG_DFL reset handler to default upon return– SA_SIGINFO denotes extra information is passed to handler (.i.e. specifies the

use of the “second” handler in the structure.

sigaction() Behavior

• A signo signal causes the sa_handler signal handler to be called.

• While sa_handler executes, the signals in sa_mask are blocked. Any more signo signals are also blocked.

• sa_handler remains installed until it is changed by another sigaction() call. No reset problem.

• sa_sigaction specifies handler if SA_SIGINFO flag is set.

struct siginfo { int si_signo; /* signal number */ int si_errno; /* if nonzero, errno value from <errno.h> */int si_code; /* additional info (depends on signal) */ pid_t si_pid; /* sending process ID */ uid_t si_uid; /* sending process real user ID */ void *si_addr; /* address that caused the fault */ int si_status; /* exit value or signal number */ long si_band; /* band number for SIGPOLL */ /* possibly other fields also */

Other POSIX Functions

• sigpending() examine blocked signals

• sigsetjmp()siglongjmp() jump functions for use

in signal handlers whichhandle masks correctly

• sigsuspend() atomically reset maskand sleep

pause()

• Suspend the calling process until a signal is caught.• #include <unistd.h>

int pause(void);• Returns -1 with errno assigned EINTR.• pause() only returns after a signal handler has returned.

alarm()

• Set an alarm timer that will ‘ring’ after a specified number of seconds– a SIGALRM signal is generated

• #include <unistd.h>long alarm(long secs);

• Returns 0 or number of seconds until previously set alarm would have ‘rung’.

Some aspects of alarm()

• A process can have at most one alarm timer running at once.

• If alarm() is called when there is an existing alarm set then it returns the number of seconds remaining for the old alarm, and sets the timer to the new alarm value.

• An alarm(0) call causes the previous alarm to be cancelled.

setjmp() and longjmp()

• In C we cannot use goto to jump to a label in another function– use setjmp() and longjmp() for those ‘long jumps’

• Uses :– error handling which requires a deeply nested function to recover to

a higher level (e.g. back to main())– coding timeouts with signals

Prototypes

• #include <setjmp.h>int setjmp( jmp_buf env );

• Returns 0 if called directly, non-zero if returning from a call to longjmp().• #include <setjmp.h>

void longjmp( jmp_buf env, int val );• In the setjmp() call, env is initialized to information about the current

state of the stack.• The longjmp() call causes the stack to be reset to its env value.• Execution restarts after the setjmp() call, but this time setjmp()

returns val.

Examplejmp_buf env; /* global */int main(){

char line[MAX]; int errval;

if(( errval = setjmp(env) ) != 0 ) printf( “error %d: restart\n”, errval ); while( fgets( line, MAX, stdin ) != NULL ) process_line(line); return 0;

continued

:void process_line( char * ptr )

{:cmd_add():}

void cmd_add(){

int token;

token = get_token(); if( token < 0 ) /* bad error */ longjmp( env, 1 );

/* normal processing */}

int get_token(){if( some error )

longjmp( env, 2 );}

Stack Frames before calling longjmp()

top of stack

direction ofstack growth

main()stack frame

setjmp(env)returns 0;env records stackframes info

Stack Frames after longjmp()

top of stack

direction ofstack growth

main()stack frame

process_line()stack frame

cmd_add()stack frame

longjmp(env,1)causes stack framesto be reset

What happens if longjmp() is called in signal handler?

• Signal is automatically added to signal mask (which prevents it from further delivery) when a signal handler is is entered. When signal handler is exited, signal is removed from the mask.

• When longjmp() is called in signal handler, the signal remains blocked.

siglongjmp & sigsetjmp

• POSIX does not specify whether longjmp will restore the signal context. If you want to save and restore signal masks, use siglongjmp.

• POSIX does not specify whether setjmp will save the signal context. If you want to save signal masks, use sigsetjmp.

• #include <setjmp.h> • int sigsetjmp(sigjmp_buf env, int savemask);

Returns: 0 if called directly, nonzero if returning from a call to siglongjmp • void siglongjmp(sigjmp_buf env, int val);

Inter Process Communication

Why do processes communicate?

To share resourcesClient/server paradigmsInherently distributed applicationsReusable software componentsetc

Types of IPC

• Message Passing– Pipes, FIFOs, and Message Queues

• Synchronization– Mutexes, condition variables, read-write locks, file and record locks,

and semaphores• Shared memory• Remote Procedure Calls

– Solaris doors and Sun RPC

Sharing of information

What is IPC?

• Each process has a private address space. Normally, no process can write to another process’s space. How to get important data from process A to process B?

• Message passing between different processes running on the same operating system is IPC

• Synchronization is required in case of IPC through shared memory or file system

• Pipes are the oldest form of UNIX System IPC and are provided by all UNIX systems

• Most commonly used form of IPC • Historically, they have been half duplex (i.e., data flows in only

one direction). • Because they don’t have names, pipes can be used only

between processes that have a common ancestor. – Normally, a pipe is created by a process, that process calls fork,

and the pipe is used between the parent and the child.

UNIX Pipes

Info to beshared Info copy

pipe for p1 and p2

write function read function

int p[2];pipe(p);write(p[1], “hello”, size);….

read(p[0], inbuf, size);….

FIFO buffersize = 4096 characters

Parent process, p1 Child process, p2

• #include <unistd.h>• int pipe(int fd[2]); returns 0 if OK,

else -1• fd[0]-> for reading, fd[1] is for writing

• Pipes are rarely used in a single process. They are generally used between parent and child

main (){ int i; int p[2]; pid_t ret; pipe (p); //creating pipe char buf[100]; ret = fork (); if (ret == 0) { write (p[1], "hello", 6);//writing to parent through pipe } if (ret > 0) { read (p[0], buf, 6); //reading from child via pipe printf ("Child Said:%s\n", buf); //printing to stdout }}

Pipes: who|sort

stdout

who|sort

• Create a pipe in the parent• Fork a child• Duplicate the standard output descriptor to write end of pipe• Exec ‘who’ program• In the parent wait for the child. • Duplicate the standard input descriptor to read end of pipe• Exec ‘sort’ program

who|sort

main (){ int i; int p[2]; pid_t ret; pipe (p); ret = fork (); if (ret == 0) { close (1); dup (p[1]); close (p[0]); execlp (“who", “who", (char *) 0); } if (ret > 0) { close (0); dup (p[0]); close (p[1]); wait (NULL); execlp (“sort", “sort", (char *) 0); }}

dup and dup2 Functions

• #include <unistd.h> • int dup(int filedes); • int dup2(int filedes, int filedes2);

Both return: new file descriptor if OK, 1 on error• The new file descriptor returned by dup is guaranteed to be the lowest-

numbered available file descriptor. • With dup2, we specify the value of the new descriptor with the filedes2

argument. If filedes2 is already open, it is first closed. If filedes equals filedes2, then dup2 returns filedes2 without closing it.

dup and dup2

• #include <stdio.h> • FILE *popen(const char *cmdstring, const char *type);

• Returns: file pointer if OK, NULL on error• int pclose(FILE *fp);

• Popen does – creating a pipe, forking a child, closing the unused ends of

the pipe, executing a shell to run the command, and waiting for the command to terminate

– fp = popen("ls *.c", "r");

Name Spaces

• When two unrelated processes use some type of IPC to exchange information, the IPC object must have a name or identifier of some form

• The set of possible names for a given type of IPC is called its name space

• FIFOs have pathname in the file system as identifier

• Create a FIFO– #include <sys/types.h>– #include <sys/stat.h>– int mkfifo(const char *pathname, mode_t mode)

//returns 0 if OK or -1• Ex: if( mkfifo("fifo1", 0666)<0) perror();

– mkfifo returns error ‘EEXIST’ if the FIFO already exists at the given path

• Once a FIFO is created, it should be opened either for reading or writing– wfd=open("fifo1",O_WRONLY); or– FILE *fp = fopen(“fifo1”, “w”);

• FIFO can’t be opened both for reading and writing at the same time

• Unlike pipe, FIFO is not deleted as soon as all the processes referring to it exit. It has to be explicitly deleted from system.– unlink(“fifo1”)

FIFOs between parent and child

Properties of FIFO

FIFOs between parent and child

Swap these two calls and see

Non-blocking option

• A descriptor can be set non-blocking in one of the two ways

Read and write operations Pipe and FIFO

Writing to pipe/fifo when pipe/fifo is open for reading

• If data size is less than or equal to PIPE_BUF, the write is atomic i.e. either all the data is written or no data written

• If there is no room in the pipe for the requested data (<PIPE_BUF), by default it blocks.

– If O_NONBLOCK option is set, EAGAIN error is returned• If data is >PIPE_BUF and O_NONBLOCK option is set, even if 1 byte

space is available in the pipe, it will write that much data and return– Atomicity is not guaranteed

Message Queues

• A message queue is a linked list of messages stored within the kernel and identified by a message queue identifier

• Any process with adequate privileges can place the message into the queue and any process with adequate privileges can read from queue

• There is no requirement that some process must be waiting to receive message before sending the message

Message Queues

• Every message queue has following structure in kernel

Message Queues

Permissions

• struct ipc_perm { uid_t uid; /* owner's effective user id */ gid_t gid; /* owner's effective group id */ uid_t cuid; /* creator's effective user id */ gid_t cgid; /* creator's effective group id */ mode_t mode; /* access modes */ . . . };

• Permission Bit– user-read 0400– user-write (alter) 0200 – group-read 0040– group-write (alter) 0020– other-read 0004– other-write (alter) 0002

Message Queues

• First msgget is used to either open an existing queue or create a new queue

• #include <sys/msg.h>int msgget(key_t key, int flag); – Returns: message queue ID if OK, 1 on error

• Key value can be IPC_PRIVATE, key generated by ftok() or any key (long integer)

• Flag value must be– IPC_CREAT if a new queue has to be created– IPC_CREAT and IPC_EXCL if want to create a new a queue but don’t

reference existing one

Key Values

• The server can create a new IPC structure by specifying a key of IPC_PRIVATE

– Kernel generates a uniqe id• The client and the server can agree on a key by defining the key in a

common header. • The client and the server can agree on a pathname and project ID

and call the function ftok to convert these two values into a key.– #include <sys/ipc.h>– key_t ftok(const char *path, int id); – The path argument must refer to an existing file. Only the lower 8 bits of

id are used when generating the key.

Message Queues

• When a new queue is created, the following members of the msqid_ds structure are initialized.– The ipc_perm structure is initialized – msg_qnum, msg_lspid, msg_lrpid, msg_stime, and msg_rtime are

all set to 0.– msg_ctime is set to the current time.– msg_qbytes is set to the system limit.

• On success, msgget returns the non-negative queue ID. This value is then used with the other three message queue functions.

Messages

• Each message is composed of a positive long integer type field, and the actual data bytes. Messages are always placed at the end of the queue.

• Messaeg Template

• Most applications define their own message structure according to the needs of the application

Sending Messages

• #include <sys/msg.h>int msgsnd(int msqid, const void *ptr, size_t nbytes, int flag);

• msqid is the id returned by msgget sys call • The ptr argument is a pointer to a message structure • Nbytes is the length of the user data i.e. sizeof(struct mesg) – size

of(long). Length can be zero.• A flag value of 0 or IPC_NOWAIT can be specified • mssnd() is blocked until one of the following occurs

– Room exists for the message– Message queue is removed (EIDRM error is returned)– Interrupted by a signal ( EINTR is returned)

Receiving Messages

• ptr points to the message structure where message will be stord• Length points to the size available on the message structure excluding

size of (long) • Type indicates the message desired on the message queue• Flag can be 0 or IPC_NOWAIT or MSG_NOERROR

Receiving Messages

• The type argument lets us specify which message we want.– type == 0: The first message on the queue is returned.– type > 0:The first message on the queue whose message type equals type

is returned.– type < 0:The first message on the queue whose message type is the lowest

value less than or equal to the absolute value of type is returned.• A nonzero type is used to read the messages in an order other than

first in, first out. – Priority to messages, Multiplexing

Receiving Messages

• IPC_NOWAIT flag makes the operation nonblocking, causing msgrcv to return -1 with errno set to ENOMSG if a message of the specified type is not available.

• If IPC_NOWAIT is not specified, the operation blocks until – a message of the specified type is available, – the queue is removed from the system (-1 is returned with errno set to

EIDRM)– a signal is caught and the signal handler returns (causing msgrcv to return 1

with errno set to EINTR).

Receiving Messages

• If the returned message is larger than nbytes and the MSG_NOERROR bit in flag is set, the message is truncated. – no notification is given to us that the message was truncated, and

the remainder of the message is discarded. • If the message is too big and MSG_NOERROR is not specified,

an error of E2BIG is returned instead (and the message stays on the queue).

Control Operations on Message Queues

• #include <sys/msg.h> int msgctl(int msqid, int cmd, struct msqid_ds *buf );

• IPC_STAT: Fetch the msqid_ds structure for this queue, storing it in the structure pointed to by buf.

• IPC_SET: Copy the following fields from the structure pointed to by buf to the msqid_ds structure associated with this queue: msg_perm.uid, msg_perm.gid, msg_perm.mode, and msg_qbytes.

• IPC_RMID: Remove the message queue from the system and any data still on the queue. This removal is immediate.

– Any other process still using the message queue will get an error of EIDRM on its next attempted operation on the queue.

– Above two commands can be executed only by a process whose effective user ID equals msg_perm.cuid or msg_perm.uid or by a process with superuser privileges

Server.c

/*key.h*/#define MSGQ_PATH "/home/students/f2007045/msgq_server.c " struct my_msgbuf{ long mtype; char mtext[200];}; int main (void){ struct my_msgbuf buf; int msqid; key_t key; if ((key = ftok (MSGQ_PATH, 'B')) == -1) { perror ("ftok"); exit (1); }

if ((msqid = msgget (key, IPC_CREAT | 0644)) == -1) { perror ("msgget"); exit (1); } printf ("server: ready to receive messages\n"); for (;;) { if (msgrcv (msqid, &(buf.mtype), sizeof (buf), 0, 0) == -1)

{ perror ("msgrcv"); exit (1);}

printf ("server: \"%s\"\n", buf.mtext); } return 0;}

Client.c#include "key.h“struct my_msgbuf{ long mtype; char mtext[200];};

main (void){ struct my_msgbuf buf; int msqid; key_t key; if ((key = ftok (MSGQ_PATH, 'B')) == -1) { perror ("ftok"); exit (1); } if ((msqid = msgget (key, 0) == -1) { perror ("msgget"); exit (1); }

printf ("Enter lines of text, ^D to quit:\n"); buf.mtype = 1; while (gets (buf.mtext), !feof (stdin)) { if (msgsnd (msqid, &(buf.mtype), sizeof (buf), 0) == -1)perror ("msgsnd"); } if (msgctl (msqid, IPC_RMID, NULL) == -1) { perror ("msgctl"); exit (1); } return 0;}

Multiplexing Messages

• Possibility of dead lock

Multiplexing Messages

System V Semaphores

• A semaphore is a primitive used to provide synchronization between various processes (or between various threads in a given process)

• Binary Semaphores: a semaphore that can assume only values 0 or 1

• Counting Semaphores: semaphore is initialized to N indicating the number of resources

System V Semaphores

• Semaphores are maintained by kernel

Semaphore operations

• Create a semaphore and initialize it – should be atomically done

• Wait for a semaphore: This tests the value of the semaphore. waits (blocks) if the value is less than or equal to 0 and then decrements the semaphore value once it is greater than 0 (aka P, lock, wait)

– Testing and decrementing should be a single atomic operation• Post a semaphore. This increments the semaphore value. If any

processes are blocked waiting for this semaphores’s value o be greater than 0, one of those processes are woken up (aka V, unlock, signal)

Producer Consumer Problem

• Producer produces one item and keeps in buffer.• Consumer removes that item for processing• How to synchronize?

Producer Consumer Problem

• Semaphore put controls whether the producer can place an item into the shared buffer

• Semaphore get controls whether the consumer can remove an item from the shred buffer

System V Semaphores

• Add one more level of detail by defining “a set of counting semaphores”

• When we say System V semaphore it refers to a set of couting semaphores ( max size of set is 25)

System V Semaphores

• Kernel maintains the following structure for every set

• Sem structure maintains info about each semaphore. Sem_base contains pointer to an array of these structures

System V Semaphores

• Kernel structure for a semaphore set having 2 counting semaphores

Creating Semaphores

• The number of semaphores in the set is nsems. If a new set is being created, we must specify nsems. If we are referencing an existing set, we can specify nsems as 0.

• When a new set is created, the following members of the semid_ds structure are initialized.

– The ipc_perm structure – sem_otime is set to 0.– sem_ctime is set to the current time.– sem_nsems is set to nsems.

Initializing a semaphore value

• Semnum specifies which semaphore (0,1,2 …)• Semun union is used for some commands

• This union desn’t appear in any application, it should be declared in your program

Testing whether semaphore has been initilized

• When process P1 creates semaphore sem_otime is set to zero.

• When P1 calls semctl to initialize and then semop, sem_otime is set to current time.

• When process P2 checks sem_otime is non zero it understands that semaphore has been initialized.

semctl() commands

• IPC_STAT, IPC_SET, IPC_RMID same as in message queues• GETVAL: Return the value of semval for the member semnum.• SETVAL: Set the value of semval for the member semnum. The value is

specified by arg.val.• GETPID: Return the value of sempid for the member semnum.• GETNCNT: Return the value of semncnt for the member semnum.• GETZCNT: Return the value of semzcnt for the member semnum.• GETALL: Fetch all the semaphore values in the set. These values are stored in

the array pointed to by arg.array.• SETALL: Set all the semaphore values in the set to the values pointed to by

arg.array

Semaphore opearions

• Opsptr points to an array of following structure

• nops specifies number of structures in the array• Semop gurantees that either all these operations are done or

none are done

• The operation on each member of the set is specified by the corresponding sem_op value. This value can be negative, 0, or positive.

• If sem_op>0:– returning of resources by the process. – Semval+=sem_op– If the SEM_UNDO flag is specified, semadj -=sem_op – subtracted from the semaphore's adjustment value for this process.

• If sem_op <0– obtain resources that the semaphore controls.

• If semval>= |sem_op| – the resources are available– Semva -= |sem_op|– If the SEM_UNDO flag is specified, – semadj += sem_op – added to the semaphore's adjustment value for this process.

• If semval < |sem_op| – the resources are not available– If IPC_NOWAIT is specified, semop returns with an error of EAGAIN.– If IPC_NOWAIT is not specified, the semncnt value for this semaphore is incremented

(since the caller is about to go to sleep), and the calling process is suspended until one of the following occurs.

• Semval>=|sem_op| i.e. some other process has released some resources. Semncnt--• The semaphore is removed from the system. In this case, the function returns an error of

EIDRM.• A signal is caught by the process, and the signal handler returns. and the function returns an

error of EINTR. semncnt--

• If sem_op = 0,– this means that the calling process wants to wait until the semaphore's value becomes 0.

• If the semaphore's value is currently 0, the function returns immediately.• If the semaphore's value is nonzero, the following conditions apply.

– If IPC_NOWAIT is specified, return is made with an error of EAGAIN.– If IPC_NOWAIT is not specified, semzcnt++, and the calling process is suspended until one of the

following occurs.• The semaphore's value becomes 0. semzcnt--• The semaphore is removed from the system. In this case, the function returns an error of EIDRM.• A signal is caught by the process, and the signal handler returns. the function returns an error of EINTR. Semzcnt--

Semval adjustment on process termination

• it is a problem if a process terminates while it has resources allocated through a semaphore.

• Whenever we specify the SEM_UNDO flag for a semaphore operation and we allocate resources (a sem_op value less than 0), the kernel remembers how many resources we allocated from that particular semaphore (the absolute value of sem_op).

• When the process terminates, either voluntarily or involuntarily, the kernel checks whether the process has any outstanding semaphore adjustments and, if so, applies the adjustment to the corresponding semaphore value.

• If we set the value of a semaphore using semctl, with either the SETVAL or SETALL commands, the adjustment value for that semaphore in all processes is set to 0.

Producer Consumer unsigned short val[1]; id = semget (KEY, 1, IPC_CREAT | 0666);setval.val = 2; semctl (id, 0, SETVAL, setval);

operations[0].sem_num = 0;operations[0].sem_op = 0;operations[0].sem_flg = 0; operations[1].sem_num = 0;operations[1].sem_op = 10;operations[1].sem_flg = 0; for (;;) { retval = semop (id, operations, 2); if (retval == 0)

{ printf ("Producer: Adding 10 objects\n"); getval.array = val;

semctl (id, 0, GETALL, getval); printf ("Sem Val: %d\n", getval.array[0]);

id = semget (KEY, 1, 0666);operations[0].sem_num = 0;operations[0].sem_op = -1;operations[0].sem_flg = 0; for (;;) { retval = semop (id, operations, 1); if (retval == 0)

{printf ("Consumer: Getting one object from shelf.\n"); setval.array=val;semctl (id, 0, GETALL, setval);printf("Sem Value: %d\n", setval.array[0]);

Shared Memory

• Shared memory allows two or more processes to share a given region of memory.

• This is the fastest form of IPC, because the data does not need to be copied between the client and the server

Message Passing

• Takes 4 copies to transfer data between two processes

Shared Memory

• Takes only two steps • Kernel is not involved in transferring data but it is involved in

creating shared memory

Memory mapped files

• proto argument for read-write access is PROT_READ|PROTO_WRITE

• Flags must be either MAP_SHARED or MAP_PRIVATE

• MAP_SHARED is used to share memory with other processes

Why mmap()?

• It makes file handling easy. We open some file and map that file into our process address space. To write or read from file we don’t have to use read(), write() or lseek()

• Another use is to provide shared memory between unrelated processes

Counter Example

• Closing file has no effect on memory mapping

• Memory mappings are propagated to newly created child

System V Shared Memory

• For every shared memory segment kernel maintains the following structure

System V Shared Memory

• Creating or opening shared memory– #include <sys/shm.h> – int shmget(key_t key, size_t size, int flag); – Size is given as zero if we are referencing existing shared

memory segment– When a new segment is created, the contents of the

segment are initialized with zeros

Attaching shared memory to a process

• Once a shared memory segment has been created, a process attaches it to its address space by calling shmat.

– #include <sys/shm.h> – void *shmat(int shmid, const void *addr, int flag);

Returns: pointer to shared memory segment if OK, 1 on error• The address in the calling process at which the segment is attached

depends on the addr argument • If addr is 0, the segment is attached at the first available address

selected by the kernel. This is the recommended technique.

Dettaching shared memory from a process

• #include <sys/shm.h>• int shmdt(void *addr); • this does not remove the identifier and its associated data

structure from the system. • The identifier remains in existence until some process (often a

server) specifically removes it by calling shmctl with a command of IPC_RMID.

shmctl

• #include <sys/shm.h>• int shmctl(int shmid, int cmd, struct shmid_ds *buf); • IPC_STAT, IPC_SET same as other XSI IPC.• IPC_RMID: • Remove the shared memory segment set from the system. The

segment is not removed until the last process using the segment terminates or detaches it.

Memory Mapping of /dev/zero

• Shared memory can be used between unrelated processes. But if the processes are related, some implementations provide a different technique.

• The device /dev/zero is an infinite source of 0 bytes when read. This device also accepts any data that is written to it, ignoring the data.

• An unnamed memory region is created and is initialized to 0.• Multiple processes can share this region if a common ancestor specifies the

MAP_SHARED flag to mmap.

void *area;if ((fd = open("/dev/zero", O_RDWR)) < 0) perror("open error");if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)) == MAP_FAILED) perror(); close(fd);

Anonymous Memory Mapping

• A facility similar to the /dev/zero feature. To use this facility, we specify the MAP_ANON flag to mmap and specify the file descriptor as -1.

• The resulting region is anonymous (since it's not associated with a pathname through a file descriptor) and creates a memory region that can be shared with descendant processes.

• this call, we specify the MAP_ANON flag and set the file descriptor to -1.

void *area;if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0)) == MAP_FAILED) perror();

Shared Memory

• Between unrelated processes:– XSI or System V shared memory– can use mmap to map the same file into another process

address spaces using the MAP_SHARED flag.• Between related processes

– Memory mapping of /dev/zero– Unonymous memory mapping

• Pipes and FIFOS• System V Message

Queues, Semaphores, Shared Memory

• Posix Message Queues, semaphores, shared memory

Effect of fork, exec, _exit on IPC

TCP/UDP

TCP/IP

TCP or UDP

• At the internet layer, a destination address identifies a host computer; no further distinction is made regarding which process will receive the datagram

• TCP or UDP add a mechanism that distinguishes among destinations within a given host, allowing multiple processes to send and receive datagrams independently

UDP (User Datagram Protocol)

• UDP provides an unreliable connectionless delivery service

• UDP uses IP to deliver datagrams to the right host.• UDP uses ports to provide communication services to

individual processes.

• TCP/IP uses an abstract destination point called a protocol port.

• Ports are identified by a positive integer.• Operating systems provide some mechanism that

processes use, to specify a port.

Port Numbers

• The port numbers are divided into three ranges by Internet Assigned Numbers Authority

• The well-known ports: 0 through 1023. These port numbers are controlled and assigned by the IANA.

• The registered ports: 1024 through 49151. These are not controlled by the IANA, but the IANA registers and lists the uses of these ports as a convenience to the community.

• The dynamic or private ports, 49152 through 65535. The IANA says nothing about these ports. These are what we call ephemeral ports. (The magic number 49152 is three-fourths of 65536.)

UDP header

• Header size is 8 bytes• Lack of reliability: If a datagram reaches its final destination but the checksum

detects an error, or if the datagram is dropped in the network, it is not delivered to the UDP socket and is not automatically retransmitted.

• If we want to be certain that a datagram reaches its destination, we can build lots of features into our application: acknowledgments from the other end, timeouts, retransmissions, and the like.

Some standard UDP based services and their ports

TCPTransmission Control Protocol

• TCP provides connections between clients and servers. • TCP uses the connection, not the protocol port, as its fundamental

abstraction.• Connections are identified by a pair of endpoints.

– Endpoint means (ip, port)• TCP provides:

– Connection-oriented– Reliable– Full-duplex– Byte-Stream

Connection-Oriented

• Connection oriented means that a virtual connection is established before any user data is transferred.

• A TCP client establishes a connection with a given server, exchanges data with that server across the connection, and then terminates the connection.

• If the connection cannot be established - the user program is notified.

• If the connection is ever interrupted - the user program(s) is notified.

Reliable

• TCP also provides reliability. When TCP sends data to the other end, it requires an acknowledgment in return.

• If an acknowledgment is not received, TCP automatically retransmits the data and waits a longer amount of time.

• After some number of retransmissions, TCP will give up– the total amount of time spent trying to send data typically between

4 and 10 minutes (depending on the implementation).

Reliable

• How can TCP provide reliable transfer if the underlying communication system offers only unreliable packet delivery?

• Answer is positive acknowledgement with retransmission.

Positive Acknowledgement with Retransmission

Reliability - duplicates

• When an underlying packet delivery system duplicates packets.– Duplicates can arise when networks experience high delays that cause

premature retransmission. – Both packets and acknowledgements can be duplicated.

• To detect duplicate packets by assigning each packet a sequence number and requiring the receiver to remember which sequence numbers it has received.

• To avoid confusion caused by delayed or duplicated acknowledgements, TCP acknowledgement specifies the sequence number of the next octet that the receiver expects to receive.

Byte Stream

• Stream means that the connection is treated as a stream of bytes. – If payroll data is being sent, there are no boundaries in the

stream differentiating employee records• The user application does not need to package data

in individual datagrams (as with UDP).

Buffering

• TCP is responsible for buffering data and determining when it is time to send a datagram.

• It is possible for an application to tell TCP to send the data it has buffered without waiting for a buffer to fill up.

Full Duplex

• TCP provides transfer in both directions.• To the application program these appear as 2

unrelated data streams, although TCP can piggyback control and data communication by providing control information (such as an ACK) along with user data.

TCP Ports

• Interprocess communication via TCP is achieved with the use of ports (just like UDP).

• UDP ports have no relation to TCP ports (different name spaces).

TCP Segments

• TCP views the data stream as a sequence of bytes that it divides into segments for transmission. Segments carry varying sizes of data.

• The chunk of data that TCP asks IP to deliver is called a TCP segment.

• Each segment contains:– data bytes from the byte stream– control information that identifies the data bytes

TCP Segment Format

TCP Segments

• Segments are exchanged to establish connections, transfer data, send acknowledgements, advertise window sizes, and close connections.

• Because TCP uses piggybacking, acknowledgement can be sent along with data– an acknowledgement traveling from machine A to machine B may

travel in the same segment as data traveling from machine A to machine B, even though the acknowledgement refers to data sent from B to A

• TCP advertises how much data it is willing to accept every time it sends segment by specifying its buffer size in the WINDOW field.

Sliding Window

• TCP uses a specialized sliding window mechanism to solve two important problems

– efficient transmission – flow control.

• The TCP window mechanism makes it possible to send multiple segments before an acknowledgement arrives.

• The TCP form of a sliding window protocol also solves the end-to-end flow control problem, by allowing the receiver to restrict transmission until it has sufficient buffer space to accommodate more data.

TCP Sliding Window

• Three markers are maintained

• octets upto 2 have been sent and acknowledged,• octets 3 through 6 have been sent but not acknowledged,• octets 7 though 9 have not been sent but will be sent without delay• octets 10 and higher cannot be sent until the window moves

Variable Window Size and Flow Control

• Each acknowledgement contains a window advertisement that specifies how many additional octets of data the receiver is prepared to accept.

• In response to an increased window advertisement, the sender increases the size of its sliding window

• In response to a decreased window advertisement, the sender decreases the size of its window and stops sending octets beyond the boundary.

• In the extreme case, the receiver advertises a window size of zero to stop all transmissions.

TCP Connection Establishment

• Three-way handshake • It accomplishes two important functions.

– It guarantees that both sides are ready to transfer data (and that they know they are both ready)

– it allows both sides to agree on initial sequence numbers. • Sequence numbers are sent and acknowledged during the

handshake. Each machine must choose an initial sequence number at random that it will use to identify bytes in the stream it is sending.

• When a client requests a connection, it sends a “SYN” segment (a special TCP segment) to the server port.

• SYN stands for synchronize. The SYN message includes the client’s ISN.

• ISN is Initial Sequence Number.

• Every TCP segment includes a Sequence Number that refers to the first byte of data included in the segment.

• Every TCP segment includes a Request Number (Acknowledgement Number) that indicates the byte number of the next data that is expected to be received.– All bytes up through this number have already been

received.

• A server accepts a connection.– Must be looking for new connections!

• A client requests a connection.– Must know where the server is!

Client Starts

• A client starts by sending a SYN segment with the following information:– Client’s ISN (generated pseudo-randomly)– Maximum Receive Window for client.– Optionally (but usually) MSS (largest datagram accepted).– No payload! (Only TCP headers)

Sever Response

• When a waiting server sees a new connection request, the server sends back a SYN segment with:– Server’s ISN (generated pseudo-randomly)– Request Number is Client ISN+1– Maximum Receive Window for server.– Optionally (but usually) MSS – No payload! (Only TCP headers)

Finally

• When the Server’s SYN is received, the client sends back an ACK with:– Request Number is Server’s ISN+1

• Why is the third message necessary?– HINTS:

• TCP is a reliable service.• IP delivers each TCP segment.• IP is not reliable.

• Why not each connection start with the initial sequence number 1?

TCP Options

• MSS option. the maximum amount of data that it is willing to accept in each TCP segment, on this connection.

• Window scale option. The maximum window that either TCP can advertise to the other TCP is 65,535. This option specifies that the advertised window in the TCP header must be scaled (left-shifted) by 0–14 bits, providing a maximum window of almost one gigabyte (65,535 x 214).

• Timestamp option. This option is needed for high-speed connections to prevent possible data corruption caused by old, delayed, or duplicated segments.

TCP Buffers

• Both the client and server allocate buffers to hold incoming and outgoing data– The TCP layer does this.

• Both the client and server announce with every ACK how much buffer space remains (the Window field in a TCP segment).

Send Buffers

• The application gives the TCP layer some data to send.• The data is put in a send buffer, where it stays until the data is

ACK’d.– it has to stay, as it might need to be sent again!

• The TCP layer won’t accept data from the application unless (or until) there is buffer space.

Connection Termination

• The TCP layer can send a RST segment that terminates a connection if something is wrong.

• Usually the application tells TCP to terminate the connection gracefully with a FIN segment.

• Either end of the connection can initiate termination.• A FIN is sent, which means the application is done

sending data.• The FIN is ACK’d.• The other end must now send a FIN.• That FIN must be ACK’d.

TCP Connection State Diagram

• There are 11 different states defined for a connection– based on the current state and the segment received in that state.

• One reason for showing the state transition diagram is to show the 11 TCP states with their names. These states are displayed by netstat, which is a useful tool when debugging client/server applications

What is the purpose of TIME_WAIT?

• Once a TCP connection has been terminated (the last ACK sent) there is some unfinished business:– What if the ACK is lost? The last FIN will be resent and it must be

ACK’d.– What if there are lost or duplicated segments that finally reach the

incarnation of the previous connection after a long delay?• The MSL is the maximum amount of time that any given IP

datagram can live in a network

Socket Pair

• The socket pair for a TCP connection is the four-tuple that defines the two endpoints of the connection:

– the local IP address, local port, foreign IP address, and foreign port. • A socket pair uniquely identifies every TCP connection on a network. • The two values that identify each endpoint, an IP address and a port

number, are often called a socket.• We can extend the concept of a socket pair to UDP, even though UDP

is connectionless.

Socket Pair

Writing to TCP Socket

Writing to UDP Socket

Sockets

TCP/IP Model

TCP/IP

• TCP/IP does not include an API definition.• There are a variety of APIs for use with TCP/IP:

– Sockets– TLI, XTI– Winsock– MacTCP

Functions needed:

• Specify local and remote communication endpoints• Initiate a connection• Wait for incoming connection• Send and receive data• Terminate a connection gracefully• Error handling

Berkeley Sockets

• Generic:– support for multiple protocol families.– address representation independence

• Uses existing I/O programming interface as much as possible.– Socket api is similar to file I/O

Socket

• A socket is an abstract representation of a communication endpoint.

• Sockets work with Unix I/O services just like files, pipes & FIFOs.

• Sockets (obviously) have special needs over files:– establishing a connection– specifying communication endpoint addresses

Unix Descriptor Table

Socket Descriptor Data Structure

Creating a Socket

int socket(int family,int type,int proto);

• family specifies the protocol family (AF_INET for TCP/IP).

• type specifies the type of service (SOCK_STREAM, SOCK_DGRAM).

• protocol specifies the specific protocol (usually 0, which means the default).

socket()

• The socket() system call returns a socket descriptor (small integer) or -1 on error.

• socket() allocates resources needed for a communication endpoint - but it does not deal with endpoint addressing.

Specifying an Endpoint Address

• Remember that the sockets API is generic.• There must be a generic way to specify endpoint

addresses.• TCP/IP requires an IP address and a port number for

each endpoint address.• Other protocol suites (families) may use other

schemes.

Necessary Background Information: POSIX data types

int8_t signed 8bit intuint8_t unsigned 8 bit intint16_t signed 16 bit intuint16_t unsigned 16 bit intint32_t signed 32 bit intuint32_t unsigned 32 bit int

u_char, u_short, u_int, u_long

More POSIX data types

sa_family_t address familysocklen_t length of structin_addr_t IPv4 addressin_port_t IP port number

Generic socket addresses

struct sockaddr {uint8_t sa_len;sa_family_t sa_family; char sa_data[14];

• sa_family specifies the address type.• sa_data specifies the address value.

AF_INET

• For AF_INET we need:– 16 bit port number – 32 bit IP address

struct sockaddr_in (IPv4)

struct sockaddr_in {uint8_t sin_len;sa_family_t sin_family;in_port_t sin_port;

struct in_addr sin_addr; char sin_zero[8];

};A special kind of sockaddr structure – used for IPV4 sockets

struct in_addr

struct in_addr { in_addr_t s_addr;

Byte Order

Network Byte Order

• Network communication uses Bigendian style, also known as Network Byte Order (NBO)

• All values stored in a sockaddr_in must be in network byte order.– sin_port a TCP/IP port number.– sin_addr an IP address.

Network Byte Order Functions

‘h’ : host byte order ‘n’ : network byte order‘s’ : short (16bit) ‘l’ : long (32bit)

uint16_t htons(uint16_t);uint16_t ntohs(uint_16_t);

uint32_t htonl(uint32_t);uint32_t ntohl(uint32_t);

TCP/IP Addresses

• We don’t need to deal with sockaddr structures since we will only deal with a real protocol family.

• We can use sockaddr_in structures.

BUT: The C functions that make up the sockets API expect structures of type sockaddr.

Assigning an address to a socket

• The bind() system call is used to assign an address to an existing socket.

int bind( int sockfd, const struct sockaddr *myaddr, int

addrlen);

• bind returns 0 if successful or -1 on error.const!

bind()

• calling bind() assigns the address specified by the sockaddr structure to the socket descriptor.

• You can give bind() a sockaddr_in structure: bind( mysock, (struct sockaddr*) &myaddr, sizeof(myaddr) );

bind() Example

int mysock,err;struct sockaddr_in myaddr;

mysock = socket(PF_INET,SOCK_STREAM,0);myaddr.sin_family = AF_INET;myaddr.sin_port = htons( portnum );myaddr.sin_addr = htonl( ipaddress);

err=bind(mysock, (sockaddr *) &myaddr, sizeof(myaddr));

Uses for bind()

• There are a number of uses for bind():– Server would like to bind to a well known address (port

number).

– Client can bind to a specific port.

– Client can ask the O.S. to assign any available port number.

IPv4 Address Conversion

int inet_aton( char *, struct in_addr *);

Convert ASCII dotted-decimal IP address to network byte order 32 bit value. Returns 1 on success, 0 on failure.

char *inet_ntoa(struct in_addr);

Convert network byte ordered value to ASCII dotted-decimal (a string).

TCP Client Serversocket()

bind()

listen()

accept() socket()

connect()

write()

read()

Client

(Block until connection) “Handshake”

read()

write()

Data (request)

Data (reply)

close()End-of-Fileread()

close()

“well-known”

Server

TCP Client

sd = socket (family, type, protocol);

STREAMDGRAM

PF_INETPF_INET6PF_UNIXPF_X25

0, used by RAW socket

sd = connect (sd, server_addr, addr_len);

Server PORT#

IP-ADDR

familyport

read (sd, *buff, mbytes);

write (sd, *buff, mbytes);

close (sd);

ephemeral portip addr (routing)

three way handshaking

disconnect sequence

CONNECT actions1. socket is valid2. fill remote endpoint addr/port3. choose local endpoint add/port4. initiate 3-way handshaking

TCP Server

sd = socket (family, type, protocol);

bind (sd, *server_addr, len);well-known port

#INADDR_ANYaddr

familyport

read (ssd, *buff, mbytes);

write (ssd, *buff, mbytes);

close (ssd);

three way handshaking

disconnect sequence

listen (sd, backlog);

ssd = accept (sd, *cliaddr, *len);

LISTENSOCKET

familyport

CONNECTSOCKET

1. Turn sd from active to passive

2. Queue length

bind port #

closes socket for R/Wnon-blockingattempts to send unsent data

socket option SO_LINGERblock until data sent

socket() Create a socket

• family is one of– PF_INET (IPv4), PF_INET6 (IPv6), PF_LOCAL (local Unix),– PF_ROUTE (access to routing tables), PF_KEY (encryption)

• type is one of– SOCK_STREAM (TCP), SOCK_DGRAM (UDP)– SOCK_RAW (for special IP packets, PING, etc. Must be root)

• protocol is 0 (used for some raw socket options)• upon success returns socket descriptor

– Integer, like file descriptor– Return -1 if failure

int socket(int family, int type, int protocol);

connect()Connect to server

• sockfd is socket descriptor from socket()• servaddr is a pointer to a structure with:

– port number and IP address– must be specified (unlike bind())

• addrlen is length of structure• client doesn’t need bind()

– OS will pick ephemeral port• returns socket descriptor if ok, -1 on error

int connect(int sockfd, const struct sockaddr *servaddr, socklen_t addrlen);

bind() Assign a local protocol address (“name”) to a socket

• sockfd is socket descriptor from socket()• myaddr is a pointer to address struct with:

– port number and IP address– if port is 0, then

• host will pick ephemeral port (very rare for server)• How do you know assigned port number?

– if IP address is wildcard: INADDR_ANY (multiple net cards) • host kernel will choose IP address• INADDR_ defined in <netinet/in.h>• INADDR_ in host byte order => htonl(INADDR_ANY)

• addrlen is length of structure• returns 0 if ok, -1 on error

– EADDRINUSE (“Address already in use”)

int bind(int sockfd, const struct sockaddr *myaddr,

socklen_t addrlen);

process specifies resultIP address port

wildcard 0 kernel chooses IP addr and port

wildcard nonzero kernel chooses IP, process specifies port

local IP addr 0 process specifies IP, kernel chooses port

local IP addr nonzero process specifies IP and port

bind() address and port

Wildcard specified as INADDR_ANY

listen()Change socket state to TCP server

• Sockets default to active (for a client)– change to passive so OS will accept connection

• sockfd is socket descriptor from socket()• backlog is maximum number of connections that the server

should queue for this socket– historically 5– rarely above 15 on a even moderate Web server!

int listen(int sockfd, int backlog);

listen()

• Possibility of SYN flooding attack

accept() Return next completed connection

• sockfd is socket descriptor from socket()• cliaddr and addrlen return protocol address from client• returns brand new descriptor, created by OS• if used with fork(), can create concurrent server

int accept(int sockfd, struct sockaddr *cliaddr, socklen_t *addrlen);

read() and write()

int read (int sockfd, void *buff, size_t mbytes);int write (int sockfd, void *buff, size_t mbytes);

• Reading and writing packets• Both are system calls

close() Close socket for use

• sockfd is socket descriptor from socket()• closes socket for reading/writing

– returns (doesn’t block)– attempts to send any unsent data– socket option SO_LINGER

• block until data sent• or discard any remaining data

– Returns -1 if error

int close(int sockfd);

Descriptor Reference Counts

• For every socket a reference count is maintained, as to how many processes are accessing that socket

• When close() is called on socket descriptor reference count is decreased by 1

• When close() is called on socket descriptor, TCP 4 packet termination sequence will be initiated only if the reference count goes to zero

getsockname() and getpeername() Functions

• getsockname return the local endpoint address associated with a socket

• getpeername return the foreign protocol address associated with a socket

• #include <sys/socket.h> int getsockname(int sockfd, struct sockaddr

*localaddr, socklen_t *addrlen); int getpeername(int sockfd, struct sockaddr *peeraddr,

socklen_t *addrlen);

getsockname()

• TCP client that does not call bind, getsockname returns the local IP address and local port number assigned to the connection by the kernel.

• After calling bind with a port number of 0, getsockname returns the local port number that was assigned.

• getsockname can be called to obtain the address family of a socket• In a TCP server that binds the wildcard IP address, once a connection

is established with a client (accept returns successfully), the server can call getsockname to obtain the local IP address assigned to the connection.

getpeername()

• When a server is execed by the process that calls accept, the only way the server can obtain the identity of the client is to call getpeername

• inetd server works by execing the respective server’s image

getpeername() : inetd

TCP Echo Client

intmain(int argc, char **argv){ int sockfd; struct sockaddr_in servaddr; if (argc != 2) err_quit("usage: tcpcli <IPaddress>"); sockfd = Socket(PF_INET, SOCK_STREAM, 0);

bzero(&servaddr, sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_port = htons(SERV_PORT); Inet_pton(AF_INET, argv[1], &servaddr.sin_addr); Connect(sockfd, (SA *) &servaddr, sizeof(servaddr)); str_cli(stdin, sockfd); exit(0); }

str_cli function

2 void 3 str_cli(FILE *fp, int sockfd) 4 { 5 char sendline[MAXLINE], recvline[MAXLINE];

6 while (Fgets(sendline, MAXLINE, fp) != NULL) {

7 Write(sockfd, sendline, strlen (sendline));

8 if (Read(sockfd, recvline, MAXLINE) == 0) 9 err_quit("str_cli: server terminated prematurely");

10 Fputs(recvline, stdout);11 }12 }

TCP Concurrent Server

TCP Concurrent Server2 int 3 main(int argc, char **argv) 4 { 5 int listenfd, connfd; 6 pid_t childpid; 7 socklen_t clilen; 8 struct sockaddr_in cliaddr, servaddr;

9 listenfd = Socket (AF_INET, SOCK_STREAM, 0);

10 bzero(&servaddr, sizeof(servaddr));11 servaddr.sin_family = AF_INET;12 servaddr.sin_addr.s_addr = htonl (INADDR_ANY);13 servaddr.sin_port = htons (SERV_PORT);

14 Bind(listenfd, (SA *) &servaddr, sizeof(servaddr));

15 Listen(listenfd, LISTENQ);16 for ( ; ; ) {17 clilen = sizeof(cliaddr);18 connfd = Accept(listenfd, (SA *) &cliaddr, &clilen);

19 if ( (childpid = Fork()) == 0) { /* child process */20 Close(listenfd); /* close listening socket */21 str_echo(connfd); /* process the request */22 exit (0);23 }24 Close(connfd); /* parent closes connected socket */25 }26 }

str_echo function

void str_echo(int sockfd) { ssize_t n; char buf[MAXLINE]; again: while ( (n = read(sockfd, buf, MAXLINE)) > 0) Write(sockfd, buf, n);

if (n < 0 && errno == EINTR) goto again; else if (n < 0) err_sys("str_echo: read error"); }

TCP Concurrent Server

• Handling zombies– while ( (pid = waitpid(-1, &stat, WNOHANG)) > 0) in SIGCHLD

signal handler• Handling interrupted system calls

– when writing network programs that catch signals, we must be cognizant of interrupted system calls, and we must handle them

– Slow system call is any system call that can block forever

Handling interrupted system calls

for ( ; ; ) {clilen = sizeof (cliaddr);if ( (connfd = accept (listenfd, (SA *) &cliaddr,

&clilen)) < 0) { if (errno == EINTR) continue; /* back to for () */ else err_sys ("accept error"); }

Connection Abort before accept Returns

• SVR4 and POSIX return an error of EPROTO or ECONNABORTED

• Berkeley-derived kernels never return any error

Termination of Server Process

• FIN is sent to client• Client tcp sends ACK to server • What if client application doesn’t take not of it, and

sends data to server?

SIGPIPE Signal

• When a process writes to a socket that has received an RST, the SIGPIPE signal is sent to the process. The default action of this signal is to terminate the process, so the process must catch the signal to avoid being involuntarily terminated.

Crashing of Server Host

• Nothing is sent to client• Client will try to reach the host, but will get errors

such as ETIMEDOUT, EHOSTUNREACH, ENETWORKUNREACH

Crashing and Rebooting of Server Host

• When client sends packets, server will respond with RST

Shutdown of Server Host

• Init sends SIGTERM to all processes• Then sends SIG KILL to all processes• Fin is sent to the client

I/O Multiplexing

• We often need to be able to monitor multiple descriptors:– a generic TCP client (like telnet)– need to be able to handle unexpected situations, perhaps a

server that shuts down without warning.– A server that handles both TCP and UDP

Example - generic TCP client

• Input from standard input should be sent to a TCP socket.

• Input from a TCP socket should be sent to standard output.

• How do we know when to check for input from each source?

Generic TCP Client

STDOUTTCP

Different Solutions

• Use nonblocking I/O.– use fcntl() to set O_NONBLOCK

• Use alarm and signal handler to interrupt slow system calls.

• Use multiple processes/threads.• Use functions that support checking of multiple input

sources at the same time.

Non blocking I/O

• use fcntl() to set O_NONBLOCK:int flags;flags = fcntl(sock,F_GETFL,0);fcntl(sock,F_SETFL,flags | O_NONBLOCK);• Now calls to read() (and other system calls) will return an

error and set errno to EWOULDBLOCK.

while (! done) {if ( (n=read(STDIN_FILENO,…)<0))

if (errno != EWOULDBLOCK)/* ERROR */

else write(tcpsock,…)

if ( (n=read(tcpsock,…)<0)) if (errno != EWOULDBLOCK)

/* ERROR */ else write(STDOUT_FILENO,…)}

The problem with nonblocking I/O• Using blocking I/O allows the Operating System to

put your program to sleep when nothing is happening (no input). Once input arrives the OS will wake up your program and read() (or whatever) will return.

• With nonblocking I/O the process will waste processor time in a busy-wait

Using alarms

signal(SIGALRM, sig_alrm);alarm(MAX_TIME);read(STDIN_FILENO,…);...

signal(SIGALRM, sig_alrm);alarm(MAX_TIME);read(tcpsock,…);...

Alarming Problem

• What will be the effect on response time ?

• What is the ‘right’ value for MAX_TIME?

Select()

• The select() system call allows us to use blocking I/O on a set of descriptors (file, socket, …).

• For example, we can ask select to notify us when data is available for reading on either STDIN or a TCP socket.

I/O Models

• Blocking• Non-Blocking• IO Multiplexing• Signal-driven IO• Asynchronous IO

IO Models

• Two phases– Waiting for the data– Copying the data

Blocking I/Oapplication

recvfrom

Processdatagram

System call

Return OK

No datagram ready

Datagram readycopy datagram

Copy complete

kernel

Process blocks in a call to recvfrom

Wait for data

Copy datafrom kernel to user

nonblocking I/O

application

recvfrom

Processdatagram

System call

Return OK

No datagram ready

copy datagram

application

kernel

Wait for data

EWOULDBLOCK

recvfrom No datagram readyEWOULDBLOCK

System call

recvfrom datagram readySystem call

Process repeatedlycall recvfromwating for an OK return(polling)

I/O multiplexing(select and poll)

application

select

Processdatagram

System call

Return OK

No datagram ready

Copy complete

kernel

Wait for data

Return readable

recvfromCopy datafrom kernel to user

Process blockin a call toselect waitingfor one ofpossibly manysockets tobecome readable

Process blockswhile data copiedinto applicationbuffer

System call

signal driven I/O(SIGIO)

application

Establish SIGIO

Processdatagram

System call

Return OK

Copy complete

kernel

Wait for data

Deliver SIGIO

recvfrom Copy datafrom kernel to user

Process continues executing

Process blockswhile data copiedinto applicationbuffer

Sigaction system call

Return Signal handler

Signal handler

asynchronous I/O

application

aio_read

Signal handlerProcessdatagram

System call

Delever signal

No datagram ready

Copy complete

kernel

Process continuesexecuting

Wait for data

Return

Specified in aio_read

Comparison of the I/O Models

blocking nonblocking I/O multiplexing

signal-drivenI/O

asynchronous I/O

initiate

complete

check check check check check check

complete

blocked

readyinitiate blocked

complete

notificationinitiate blocked

complete

initiate

notification

wait fordata

copy datafrom kernelto user

ist phase handled differently,2nd phase handled the same

handles both phases

Select()int select( int maxfd,

fd_set *readset, fd_set *writeset, fd_set *excepset, const struct timeval *timeout);

maxfd : highest number assigned to a descriptor.weadset: set of descriptors we want to read from.writeset: set of descriptors we want to write to.excepset: set of descriptors to watch for exceptions.timeout: maximum time select should wait

struct timeval

struct timeval {long tv_usec; /* seconds */long tv_usec; /* microseconds */

struct timeval max = {1,0};

Condition of select function

• Wait forever : return only descriptor is ready(timeval = NULL)

• wait up to a fixed amount of time:• Do not wait at all : return immediately after checking

the descriptors(timeval = 0)wait: normally interrupt if the process catches a signal

and returns from the signal handler

• Readset => descriptor for checking readable• writeset => descriptor for checking writable• exceptset => descriptor for checking two exception conditions :arrival of out of band data for a socket :the presence of control status information to be read from the

master side of a pseudo terminal

Select Function

Descriptor sets

• Array of integers : each bit in each integer correspond to a descriptor.

• fd_set: an array of integers, with each bit in each integer corresponding to a descriptor.

• Void FD_ZERO(fd_set *fdset); /* clear all bits in fdset */• Void FD_SET(int fd, fd_set *fdset); /* turn on the bit for fd in fdset */• Void FD_CLR(int fd, fd_set *fdset); /* turn off the bit for fd in fdset*/• int FD_ISSET(int fd, fd_set *fdset);/* is the bit for fd on in fdset ? */

Example of Descriptor sets function

fd_set rset;

FD_ZERO(&rset);/*all bits off : initiate*/FD_SET(1, &rset);/*turn on bit fd 1*/FD_SET(4, &rset); /*turn on bit fd 4*/FD_SET(5, &rset); /*turn on bit fd 5*/

• specifies the number of descriptors to be tested.• Its value is the maximum descriptor to be tested,

plus one– (example:fd1,2,5 => maxfdp1: 6)

• constant FD_SETSIZE defined by including <sys/select.h>, is the number of descriptors in the fd_set datatype.(1024)

Maxfdp1

When is the descriptor ready for reading?

• The number of bytes of data in the socket receive buffer is greater than or equal to the current size of the low-water mark for the socket receive buffer. SO_RCVLOWAT socket option. It defaults to 1 for TCP and UDP sockets

• The read half of the connection is closed (i.e., a TCP connection that has received a FIN)

• The socket is a listening socket and the number of completed connections is nonzero.

• A socket error is pending. A read operation on the socket will not block and will return an error (–1) with errno set to the specific error condition.

– These pending errors can also be fetched and cleared by calling getsockopt and specifying the SO_ERROR socket option.

When the socket is ready for writing?

• The number of bytes of available space in the socket send buffer is greater than or equal to the current size of the low-water mark for the socket send buffer and eit

• The write half of the connection is closed. A write operation on the socket will generate SIGPIPE

• A socket using a non-blocking connect has completed the connection, or the connect has failed

• A socket error is pending. A write operation on the socket will not block and will return an error (–1) with errno set to the specific error condition.

– These pending errors can also be fetched and cleared by calling getsockopt with the SO_ERROR socket option.

When is the socket descriptor returned in exception list?

• A socket has an exception condition pending if there is out-of-band data for the socket

• or the socket is still at the out-of-band mark

Condition that cause a socket to be ready for select

Condition Readable? writable? Exception?

Data to readread-half of the connection closednew connection ready for listening socketSpace available for writingwrite-half of the connection closed

•••

••

• •

Pending error

TCP out-of-band data

Condition handled by select in str_cli

Data of EOF

client

• stdinSocket•

error EOF

data FIN

select() for readability on either standard input or socket

Three conditions are handled with the socket

• Peer TCP send a data,the socket becomr readable and read returns greater than 0

• Peer TCP send a FIN(peer process terminates), the socket become readable and read returns 0(end-of-file)

• Peer TCP send a RST(peer host has crashed and rebooted), the socket become readable and returns -1 and errno contains the specific error code

Implimentation of str_cli function using select

Void str_cli(FILE *fp, int sockfd){int maxfdp1;fd_set rset;charsendline[MAXLINE], recvline[MAXLINE];

FD_ZERO(&rset);for ( ; ; ) {FD_SET(fileno(fp), &rset);FD_SET(sockfd, &rset);maxfdp1 = max(fileno(fp), sockfd) + 1;

Select(maxfdp1, &rset, NULL, NULL, NULL);

Continue…..

if (FD_ISSET(sockfd, &rset)) { /* socket is readable */if (Readline(sockfd, recvline, MAXLINE) == 0)err_quit("str_cli: server terminated prematurely");Fputs(recvline, stdout);}

if (FD_ISSET(fileno(fp), &rset)) { /* input is readable */if (Fgets(sendline, MAXLINE, fp) == NULL)return; /* all done */Writen(sockfd, sendline, strlen(sendline));}}//for}//str_cli

Stop and waitsends a line to the server and then waits for the reply

request

serverrequest

request

serverreply

client

Batch input

request8 request7 request6 request5

reply1 reply2 reply3 reply4

Time 7:

request9 request8 request7 request6

reply2 reply3 reply4 reply5

Time 8:

Handling batch input

• The problem with our revised str_cli function– After the handling of an end-of-file on input, the send function

returns to the main function, that is, the program is terminated.– However, in batch mode, there are still other requests and replies in

the pipe.• A way to close one-half of the TCP connection

– send a FIN to the server, telling it we have finished sending data, but leave the socket descriptor open for reading <= shutdown function

Shutdown function

• Close one half of the TCP connection• Close function :

– decrements the descriptor’s reference count and closes the socket only if the count reaches 0, terminate both directions of data transfer(reading and writing)

• Shutdown function closes just one of them (reading or writing)

Calling shutdown to close half of a TCP connection

client serverdata

dataFIN

Ack of data and FIN

datadata

FINAck of data and FIN

Read returns > 0Read returns > 0Read returns 0

writewriteclose

writewrite

shutdown

Read returns > 0Read returns > 0

Read returns 0

• #include<sys/socket.h> int shutdown(int sockfd, int howto); /* return : 0 if OK, -1 on error */• howto argument SHUT_RD : read-half of the connection closed. No more reads can be issued SHUT_WR : write-half of the connection closed. Also called half-close. Buffered

data will be sent followed by termination sequence. SHUT_RDWR : both closed

Shutdown function

Str_cli function using select and shutdown

#include "unp.h"void str_cli(FILE *fp, int sockfd){

int maxfdp1, stdineof;fd_set rset;charsendline[MAXLINE], recvline[MAXLINE];

stdineof = 0;FD_ZERO(&rset);for ( ; ; ) {

if (stdineof == 0) // select on standard input for readabilityFD_SET(fileno(fp), &rset);

FD_SET(sockfd, &rset);maxfdp1 = max(fileno(fp), sockfd) + 1;Select(maxfdp1, &rset, NULL, NULL, NULL);

Continue…..

if (FD_ISSET(sockfd, &rset)) { /* socket is readable */if (Readline(sockfd, recvline, MAXLINE) == 0) {if (stdineof == 1)

return; /* normal termination */elseerr_quit("str_cli: server terminated prematurely");}Fputs(recvline, stdout);}if (FD_ISSET(fileno(fp), &rset)) { /* input is readable */if (Fgets(sendline, MAXLINE, fp) == NULL) {

stdineof = 1;Shutdown(sockfd, SHUT_WR);/* send FIN */FD_CLR(fileno(fp), &rset);continue;}Writen(sockfd, sendline, strlen(sendline));}}

Str_cli function using select and shutdown

TCP echo server

• Single process server that uses select to handle any number of clients, instead of forking one child per client.

Data structure TCP server(1)

Client[][0]

[1][2]

-1-1-1

-1[FD_SETSIZE -1]

rset:fd0 fd1 fd2 fd3

0 0 0 1

Maxfd + 1 = 4

fd:0(stdin),1(stdout),2(stderr)fd:3 => listening socket fd

Before first client has established a connection

Client[][0]

[1][2]

-1[FD_SETSIZE -1]

0 0 0 1

Maxfd + 1 = 5

* fd3 => listening socket fd

*fd4 => client socket fd

After first client connection is established

Client[][0]

[1][2]

-1[FD_SETSIZE -1]

0 0 0 1

Maxfd + 1 = 6

* fd4 => client1 socket fd

Data structure TCP server(3)After second client connection is established

Client[][0]

[1][2]

-1[FD_SETSIZE -1]

0 0 0 1

Maxfd + 1 = 6

* fd4 => client1 socket fd deleted

*Maxfd does not change

After first client terminates its connection

TCP echo server using single process#include "unp.h"int main(int argc, char **argv){

int i, maxi, maxfd, listenfd, connfd, sockfd;int nready, client[FD_SETSIZE];ssize_t n;fd_set rset, allset;char line[MAXLINE];socklen_t clilen;struct sockaddr_in cliaddr, servaddr;listenfd = Socket(AF_INET, SOCK_STREAM, 0);bzero(&servaddr, sizeof(servaddr));servaddr.sin_family = AF_INET;servaddr.sin_addr.s_addr = htonl(INADDR_ANY);servaddr.sin_port = htons(SERV_PORT);Bind(listenfd, (SA *) &servaddr, sizeof(servaddr));Listen(listenfd, LISTENQ);

maxfd = listenfd; /* initialize */maxi = -1; /* index into client[] array */for (i = 0; i < FD_SETSIZE; i++)client[i] = -1; /* -1 indicates available entry */

FD_ZERO(&allset);FD_SET(listenfd, &allset);for ( ; ; ) {

rset = allset; /* structure assignment */nready = Select(maxfd+1, &rset, NULL, NULL, NULL);

if (FD_ISSET(listenfd, &rset)) { /* new client connection */clilen = sizeof(cliaddr);

connfd = Accept(listenfd, (SA *) &cliaddr, &clilen);for (i = 0; i < FD_SETSIZE; i++)

if (client[i] < 0) {client[i] = connfd; /* save descriptor */break;}

if (i == FD_SETSIZE)err_quit("too many clients");FD_SET(connfd, &allset); /* add new descriptor to set */

if (connfd > maxfd)maxfd = connfd; /* maxfd for select */

if (i > maxi)maxi = i; /* max index in client[] array */

if (--nready <= 0)continue; /* no more readable descriptors */

for (i = 0; i <= maxi; i++) { /* check all clients for data */if ( (sockfd = client[i]) < 0)

continue;if (FD_ISSET(sockfd, &rset)) {

if ( (n = Readline(sockfd, line, MAXLINE)) == 0) {/*connection closed by client */Close(sockfd);FD_CLR(sockfd, &allset);client[i] = -1;

} elseWriten(sockfd, line, n);if (--nready <= 0)break; /* no more readable descriptors */

Denial of service attacks

• If malicious client connect to the server, send 1 byte of data(other than a newline), and then goes to sleep.

=>call readline, server is blocked.

Denial of service attacks

• Solution – use nonblocking I/O– have each client serviced by a separate thread of control

(spawn a process or a thread to service each client)– place a timeout on the I/O operation

pselect function

#include <sys/select.h>#include <signal.h>#include <time.h>

int pselect(int maxfdp1, fd_set *readset, fd_set *writeset, fd_set *exceptset, const struct timespec *timeout, const sigset_t *sigmask)

pselect function was invented by Posix.1g.

pselect function

• struct timespec{ time_t tv_sec; /*seconds*/ long tv_nsec; /* nanoseconds */• sigmask => pointer to a signal mask.

Name and Address Conversions

RFC 1034RFC 1035

Hierarchical Namespace

Naming Authorities

DNS Record Types

Sample DNS Records

aix IN A 192.168.42.2 IN AAAA 3ffe:b80:1f8d:2:204:acff:fe17:bf38 IN MX 5 aix.unpbook.com. IN MX 10 mailhost.unpbook.com.aix-4 IN A 192.168.42.2aix-6 IN AAAA 3ffe:b80:1f8d:2:204:acff:fe17:bf38aix-611 IN AAAA fe80::204:acff:fe17:bf38

Resolvers and Name Servers

DNS library functions

gethostbyname

gethostbyaddr

getservbyname

getservbyport

getaddrinfo

gethostbyname

struct hostent *gethostbyname( const char *hostname);

struct hostent is defined in netdb.h:

#include <netdb.h>

struct hostent

struct hostent {char *h_name;char **h_aliases; int h_addrtype;int h_length;char **h_addr_list;

official name (canonical)other names

AF_INET or AF_INET6address length (4 or

16) array of ptrs to

addresses

struct hostent

gethostbyname and errors

• On error gethostbyname return null.• Gethostbyname sets the global variable h_errno to indicate

the exact error:– HOST_NOT_FOUND– TRY_AGAIN– NO_RECOVERY– NO_DATA– NO_ADDRESS

Sample code using gethostbyname()

char *ptr, **pptr; char str [INET_ADDRSTRLEN]; struct hostent *hptr;

while (--argc > 0) { ptr = *++argv;if ( (hptr = gethostbyname (ptr) ) ==

NULL) {err_msg ("gethostbyname error for host:

%s: %s", ptr, hstrerror (h_errno) ); continue; } printf ("official hostname: %s\n",

hptr->h_name); for (pptr = hptr->h_aliases; *pptr ! =

NULL; pptr++) printf ("\talias: %s\n", *pptr);

switch (hptr->h_addrtype) { case AF_INET: pptr = hptr->h_addr_list; for ( ; *pptr != NULL; pptr++) printf ("\taddress: %s\n", Inet_ntop (hptr->h_addrtype, *pptr,

str, sizeof (str))); break; default: err_ret ("unknown address type"); break; } }

gethostbyaddr

• #include <netdb.h>struct hostent *gethostbyaddr (const char *addr, socklen_t

len, int family);• The addr argument is not a char*, but is really a pointer to an in_addr

structure containing the IPv4 address. len is the size of this structure: 4 for an IPv4 address. The family argument is AF_INET.

• The function gethostbyaddr takes a binary IPv4 address and tries to find the hostname corresponding to that address. This is the reverse of gethostbyname

getservbyname and getservbyport

• Services are often known by names.• mapping from the name to port number is contained

in a file (normally /etc/services)• if the port number changes, all we need to modify is

one line in the /etc/services file instead of having to recompile the applications.

getservbyname

• #include <netdb.h>struct servent *getservbyname (const char *servname, const

char *protoname); struct servent { char *s_name; /* official service name */ char **s_aliases; /* alias list */ int s-port; /* port number, network-byte order */ char *s_proto; /* protocol to use */};

• The service name servname must be specified. If a protocol is also specified (protoname is a non-null pointer), then the entry must also have a matching protocol. Some Internet services are provided using either TCP or UDP

Usage of getservbyname

struct servent *sptr;

sptr = getservbyname("domain", "udp"); /* DNS using UDP */sptr = getservbyname("ftp", "tcp"); /* FTP using TCP */sptr = getservbyname("ftp", NULL); /* FTP using TCP */sptr = getservbyname("ftp", "udp"); /* this call will fail */

/etc/services file

• freebsd % grep -e ^ftp -e ^domain /etc/services

ftp-data 20/tcp #File Transfer [Default Data]ftp 21/tcp #File Transfer [Control]domain 53/tcp #Domain Name Serverdomain 53/udp #Domain Name Serverftp-agent 574/tcp #FTP Software Agent Systemftp-agent 574/udp #FTP Software Agent Systemftps-data 989/tcp # ftp protocol, data, over TLS/SSLftps 990/tcp # ftp protocol, control, over TLS/SSL

getservbyport

• looks up a service given its port number and an optional protocol• usagestruct servent *sptr;

sptr = getservbyport (htons (53), "udp"); /* DNS using UDP */sptr = getservbyport (htons (21), "tcp"); /* FTP using TCP */sptr = getservbyport (htons (21), NULL); /* FTP using TCP */sptr = getservbyport (htons (21), "udp"); /* this call will fail */

getaddrinfo

• The gethostbyname and gethostbyaddr functions only support IPv4 • handles both

– name-to-address – service-to-port translation,

• returns – sockaddr structures instead of a list of addresses.

• hides all the protocol dependencies • The application deals only with the socket address structures that are

filled in by getaddrinfo

getaddrinfo

• #include <netdb.h>int getaddrinfo (const char *hostname, const char *service,

const struct addrinfo *hints, struct addrinfo **result) ;

struct addrinfo { int ai_flags; /* AI_PASSIVE, AI_CANONNAME */ int ai_family; /* AF_xxx */ int ai_socktype; /* SOCK_xxx */ int ai_protocol; /* 0 or IPPROTO_xxx for IPv4 and IPv6 */ socklen_t ai_addrlen; /* length of ai_addr */ char *ai_canonname; /* ptr to canonical name for host */ struct sockaddr *ai_addr; /* ptr to socket address structure */ struct addrinfo *ai_next; /* ptr to next structure in linked list */};

Hints structure

• hints is either a null pointer or a pointer to an addrinfo structure that the caller fills in with hints about the types of information the caller wants returned.

• The members of the hints structure that can be set by the caller are:– ai_flags (zero or more AI_XXX values OR'ed together)– ai_family (an AF_xxx value)– ai_socktype (a SOCK_xxx value)– ai_protocol

• For example, – if the specified service is provided for both TCP and UDP, set ai_socktype

member of the hints structure to SOCK_DGRAM. The only information returned will be for datagram sockets.

ai_flags

• AI_PASSIVE The caller will use the socket for a passive open.• AI_CANONNAME Tells the function to return the canonical name of the host.• AI_NUMERICHOST Prevents any kind of name-to-address mapping; the hostname argument

must be an address string.• AI_NUMERICSERV Prevents any kind of name-to-service mapping; the service argument must

be a decimal port number string.•

ai_flags

• AI_V4MAPPED If specified along with an ai_family of AF_INET6, then returns IPv4-mapped IPv6

addresses corresponding to A records if there are no available AAAA records.• AI_ALL If specified along with AI_V4MAPPED, then returns IPv4-mapped IPv6 addresses

in addition to any AAAA records belonging to the name.• AI_ADDRCONFIG Only looks up addresses for a given IP version if there is one or more interface that

is not a loopback interface configured with an IP address of that version.

Result

• linked list of addrinfo structures, linked through the ai_next pointer.

• There are two ways that multiple structures can be returned:– Multiple ips per hostname; one sockaddr structure for each

ip– Service is provided for multiple socket types;

SOCK_STREAM or SOCK_DGRAM

• Sockaddr structure in addrinfo structures is ready for – a call to socket – then either a call to connect or sendto (for a client), or bind (for a

server). • The arguments to socket are the members ai_family,

ai_socktype, and ai_protocol. • The second and third arguments to either connect or bind are

ai_addr, and ai_addrlen

• struct addrinfo hints, *res;

• bzero(&hints, sizeof(hints) ) ;• hints.ai_flags = AI_CANONNAME;• hints.ai_family = AF_INET;

• getaddrinfo("freebsd4", "domain", &hints, &res);

Passive sockets

• specifies the service but not the hostname, and specifies the AI_PASSIVE flag in the hints structure.

• The socket address structures returned should contain an IP address of INADDR_ANY (for IPv4) or IN6ADDR_ANY_INIT (for IPv6).

Errors: gai_strerror

• const char *gai_strerror (int error);

freeaddrinfo

• Storage returned by getaddrinfo, the addrinfo structures, the ai_addr structures, and the ai_canonname string are obtained dynamically (e.g., from malloc).

• This storage is returned by calling freeaddrinfo• void freeaddrinfo (struct addrinfo *ai);

getnameinfo function

• Takes a socket address and returns a character string describing the host and another character nstring describing the service

int getnameinfo(const struct sockaddr *sockaddr, socklen_t addrlen, char *host, size_t hostlen, char *serv, size_t servlen, int flags);

Elementary UDP Socket

Contents recvfrom and sendto Function UDP Echo Server( main, de_echo Function) UDP Echo Client( main, de_cli Function) Lost datagrams Verifying Received Response Sever not Running Connect Function with UDP Lack of Flow Control with UDP Determining Outgoing Interface with UDP TCP and UDP Echo Server Using select

connectionless unreliable datagram protocol popular using

DNS(the Domain Name System) NFS(the Network File System) SNMP(Simple Network Management Protocol)

UDP Server

socket( )

bind( )

recvfrom( )

sendto( )

socket( )

sendto( )

recvfrom( )

close( )

Process request

block until datagramreceived from a client

UDP Client

data(request)

data(reply)

Socket functions for UDP client-server

recvfrom and sendto functions

#include<sys/socket.h>

ssize_t recvfrom(int sockfd, void *buff, size_t nbyte, int flag, struct sockaddr *from, socklen_t *addrlen);

ssize_t sendto(int sockfd, const void *buff, size_t nbyte, int flag, const struct sockaddr *to, socklen_t addrlen); Both return: number of bytes read or written if OK,-1 on error

Sending UDP Datagramsssize_t sendto( int sockfd,

void *buff,size_t nbytes,int flags,

const struct sockaddr* to, socklen_t addrlen);

sockfd is a UDP socketbuff is the address of the data (nbytes long)to is the address of a sockaddr containing the destination address.Return value is the number of bytes sent, or -1 on error.

sendto()

• You can send 0 bytes of data!• Some possible errors :

EBADF, ENOTSOCK: bad socket descriptorEFAULT: bad buffer addressEMSGSIZE: message too largeENOBUFS: system buffers are full

More sendto()

• The return value of sendto() indicates how much data was accepted by the O.S. for sending as a datagram - not how much data made it to the destination.

• There is no error condition that indicates that the destination did not get the data!!!

Receiving UDP Datagramsssize_t recvfrom( int sockfd,

void *buff,size_t nbytes,int flags,

struct sockaddr* from, socklen_t *fromaddrlen);

sockfd is a UDP socketbuff is the address of a buffer (nbytes long)from is the address of a sockaddr.Return value is the number of bytes received and put into buff, or -1 on

error.

recvfrom()• If buff is not large enough, any extra data is lost forever...• You can receive 0 bytes of data!• The sockaddr at from is filled in with the address of the sender.• You should set fromaddrlen before calling.• If from and fromaddrlen are NULL we don’t find out who sent

the data.

More recvfrom()

• Same errors as sendto, but also:– EINTR: System call interrupted by signal.

• Unless you do something special - recvfrom doesn’t return until there is a datagram available.

server as we had with TCP

connection fock fock connection

connection connection

client client

TCP TCP TCP

serverchild

listening

server

Summary of TCP client-server with two clients.

Socket receivebuffer

client clientserver

UDP UDP UDP

datagram datagram

Summary of UDP client-server with two clients.

server as with UDP

UDP Echo client: main Function#include “unp.h”

int main(int argc, char **argv)

int sockfd;

struct sockaddr_in servaddr;

if (argc != 2)

err_quit( “usage : udpcli <Ipaddress>”);

bzero(&servaddr, sizeof(servaddr);

servaddr.sin_family = AF_INET;

servaddr.sin_port = htons(SERV_PORT);

Inet_pton(AF_INET, argv[1], &servaddr.sin_addr);

sockfd = Socket(AF_INET, SOCK_DGRAM, 0);

dg_cli(stdin, sockfd, (SA *) &servaddr, sizeof(servaddr);

exit(0);

UDP Echo Client: dg_cli Function

#include “unp.h”

void dg_cli(FILE *fp, int sockfd, const SA *pservaddr, soklen_t servlen)

int n;

char sendline[MAXLINE], recvline[MAXLINE+1];

while(Fgets(sendline, MAXLINE, fp) != NULL) {

sendto(sockfd, sendline, strlen(sendline), 0, pservaddr, servlen);

n = Recvfrom(sockfd, recvline, MAXLINE, 0, NULL, NULL);

recvline[n] = 0; /* null terminate */

Fputs(recvline,stdout);

dg_cli function: client processing loop

Lost Datagrams

If the client datagram arrives at the server but the server’s reply is lost, the client will again block forever in its call to recvfrom.

The only way to prevent this is to place a timeout on the recvfrom.

Verify Received Response#include “unp.h”

void dg_cli(FILE *fp, int sock, const SA *pseraddr, socklen_t servlen)

int n;

char sendline[MAXLINE], recvline[MAXLINE];

socklen_t len;

struct sockaddr *preply_addr;

preply_addr = Malloc(servlen);

while(Fget(sendline, MAXLINE, fp) ! = NULL) {

Sendto(sockfd,sendline, strlen(sendline), 0, pservaddr, servlen);

len = servlen;

n = Recvfrom(sockfdm, recvline, MAXLINE, 0, preply_addr,&len)

continue

If(len != servlen || memcmp(pservaddr, preply_addr, len) != 0) { printf(“reply from %s (ignore)\n”, Sock_ntop(preply_addr, len); continue; } recvline[n] = 0; /*NULL terminate */ Fputs(recvline, stdout); }}

The server has not bound an IP address to its socket, the kernel choose the source address for the IP datagram. It is chosen to be the primary IP address of the outgoing interface.

Verify Received Response

Server Not Running

Client blocks forever in the call to recvfrom. ICMP error is asynchronous error.The basic rule is that asynchronous errors are not returned for UDP sockets unless the socket has been connected.

connect Function with UDP

This does not result in anything like a TCP connection: there is no three-way handshake. Instead, the kernel just records the IP address and port number of the peer.

With a connect UDP socket three change:1. We can no long specify the destination IP address and port for an output

operation. That is, we do not use sendto but use write or send instead.2. We do not use recvfrom but read or recv instead.3. Asynchronous errors are returned to the process for a connected UDP socket.

} Stores peer IP address and port#from connectUDP UDP

UDP datagram

application peer

UDP datagram from some otherIP address and/or port#

connect Function with UDP

Lack of Flow Control with UDP

#include “unp.h”

#define NDG 2000#define DGLEN 1400

void dg_cli(FILE *fp, int sockfd, const SA *pservaddr, socklen_t, servlen){ int i; char sendline[MAXLINE]; for(I = 0; I< NDG ; I++) { Sendto(sockfd, sendline, DGLEN, 0, pservaddr, servlen); }}

dg_cli function that writes a fixed number of datagram to server

#include “unp.h”static void recvfrom_int(int);static int count;void dg_echo(int sockfd, SA *pcliaddr, socklen_t clilen){ socklen_t len; char mesg[MAXLINE]; Signal(SIGHT, recvfrom_int); for( ; ; ) { len=clilen; Recvfrom(sockfd, mesg, MAXLINE, 0, pcliaddr, &len); count++; }}

static void recvfrom_int(int signo){ printf(“\nreceived %d datagram\n”, count); exit(0);}

The interface’s buffers were full or they could have been discarded by the sending host.

The counter “dropped due to full socket buffers” indicates how many datagram were received by UDP but were discarded because the receiving socket’s receive queue was full

The number of datagrams received by the server in this example is nondeterministic. It depends on many factors, such as the network load, the processing load on the client host, and the processing load in the server host.

Solution fast server, slow client. Increase the size of socket receive buffer.

TCP and UDP Echo Server Using select

#include “unp.h”int main(int argc, char **argv){ int listenfd, connfd, udpfd, nready, maxfd1; char mesg[MAXLINE]; pid_t childpid; fd_set rset; ssize_t n; socklen_t len; const int on = 1; struct sockaddr_in cliaddr, servaddr; void sig_chld(int);

/* Create listening TCP socket */ listenfd = Socket(AF_INET,SOCK_STREAM, 0); bzero(&seraddr, sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htol(INADDR_ANY); servaddr.sin_port = htos(SERV_PORT); Setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on)); Bind(listenfd, (SA *)&servaddr, sizeof(servaddr));

Listenfd, LISTENQ); /* Create UDP socket */ udpfd = Socket(AF_INET, SOCK_DGRAM, 0); bzero(&seraddr, sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htol(INADDR_ANY); servaddr.sin_port = htos(SERV_PORT);

Bind(udpfd, (SA *) &servaddr, sizeof(servaddr));

Signal(SIGCHLD, sig_chld); /* must call waitpd( )*/ FD_ZERO(&rset); maxfdp1=max(listenfd, udpfd)+1; for( ; ; ) { FD_SET(listenfd, &rset); FD_SET(udpfd, &rset); if((nready = selext[,axfdp1, &rset, NULL, NULL,NULL) < 0) { if(errno == EINTR) continue; else err_sys(“select error”); } if(FD_ISSET(listenfd,&rset)) { len = sizeof(cliaddr); connfd = Accept(listenfd, (SA *) &cliaddr, &len);

if((childpid = fork( )) == 0) { /* child process */ Close(listenfd); /* Close listening socket */ str_echo(connfd); /* process the request */ exit(0); } Close(connfd); }

if(FD_ISSET(udpfd, &rset)) { len = sizeof(cliaddr); n = Recvfro,(udp, mesg, MAXLINE, 0, (SA *) &cliaddr, &len); Sendto(udpfd, ,esg, n, 0, (SA *) &cliaddr, len); } } /* for */} /* main */

Advanced UDP Sockets

When to use UDP instead of TCP?

• Advantages of UDP:– UDP supports broadcasting and multicasting– UDP has no connection setup or teardown

• For a two packet request-reply, we need 8 extra packets to be transmitted in TCP

• UDP: RTT+SPT, TCP: 2 *RTT + SPT

• Features of TCP not provided by UDP:– Positive acknowledgments, retransmission of lost packets,

duplicate detection, and sequencing of packets reordered by the network

• Seq nos, estimate RTO– Windowed flow control– Slow start and congestion avoidance

• to determine the current network capacity and to handle periods of congestion

• Recommendations:– UDP must be used for broadcast and multicast applications

• Error control or reliability be added if reqd at appl layer– UDP can be used for simple request-reply applications, but error

detection must be built into the application • Acknowledgements, timeouts, retransmissions

– UDP should not be used for bulk data transfer• Bulk transfer requires flow control along with error control which is like

replicating TCP at appl layer

Adding Reliability to a UDP Application

• UDP for a request-reply application– Timeout and retransmission to handle datagrams that are

discarded– Sequence numbers so the client can verify that a reply is for

the appropriate request• Examples which use simple request-reply with

reliability: – DNS resolvers, SNMP agents, TFTP, and RPC

Handling Timeout and Retransmission

• Old fashioned: Send a request and wait for N seconds linear retransmit timer

• RTT on a network can vary from fractions of a second on a LAN to many seconds on a WAN.

• Factors affecting the RTT are distance, network speed, and congestion

• Timeout should take into account the actual RTTs that we measure along with the changes in the RTT over time

Retransmission Timeout (RTO) Jacobson's algorithm

• two statistical estimators: srtt is the smoothed RTT estimator and rttvar is the smoothed mean deviation estimator

• When the retransmission timer expires, an exponential backoff must be used for the next RTO– For example, if our first RTO is 2 seconds and the reply is

not received in this time, then the next RTO is 4 seconds. If there is still no reply, the next RTO is 8 seconds, and then 16, and so on.

Retransmission ambiguity problem

• Jacobson's algorithms tell us how to calculate the RTO each time we measure an RTT and how to increase the RTO when we retransmit.

• But, a problem arises when we have to retransmit a packet and then receive a reply. This is called the retransmission ambiguity problem

Retransmission ambiguity problem

Retransmission ambiguity problem: Karns Algorithm

• the following rules that apply whenever a reply is received for a request that was retransmitted:– If an RTT was measured, do not use it to update the estimators

since we do not know to which request the reply corresponds.– Since this reply arrived before our retransmission timer expired,

reuse this RTO for the next packet. Only when we receive a reply to a request that is not retransmitted will we update the RTT estimators and recalculate the RTO

Concurrent UDP Servers

• two different types of servers:– First is a simple UDP server that reads a client request, sends a

reply, and is then finished with the client• fork a child and let it handle the request

– Second is a UDP server that exchanges multiple datagrams with the client.

• Create a new socket for each client, bind an ephemeral port to that socket, and use that socket for all its replies.

• The client look at the port number of the server's first reply and send subsequent datagrams for this request to that port.

Concurrency in UDP server that exchanges multiple datagrams with the client

Socket Options

abstraction

• Introduction• getsockopt and setsockopt function• socket state• Generic socket option• IPv4 socket option• ICMPv6 socket option• IPv6 socket option• TCP socket option• fcnl function

Introduction

• Three ways to get and set the socket option that affect a socket– getsockopt , setsockopt function=>IPv4 and IPv6

multicasting options– fcntl function =>nonblocking I/O, signal driven I/O– ioctl function =>chapter16

getsockopt and setsockopt function

#include <sys/socket.h>int getsockopt(int sockfd, , int level, int optname, void *optval, socklent_t *optlen);int setsockopt(int sockfd, int level , int optname, const void *optval, socklent_t optlen);

•sockfd => open socket descriptor•level => code in the system to interprete the option(generic, IPv4, IPv6, TCP)•optval => pointer to a variable from which the new value of option is fetched by setsockopt, or into which the current value of the option is stored by setsockopt.•optlen => the size of the option variable.

Generic socket option

• SO_BROCAST =>enable or disable the ability of the process to send broadcast message.(only datagram socket : Ethernet, token ring..)

• SO_DEBUG =>kernel keep track of detailed information about all packets sent or received by TCP(only supported by TCP)

• SO_DONTROUTE=>outgoing packets are to bypass the normal routing mechanisms of the underlying protocol.

• SO_ERROR=>when error occurs on a socket, the protocol module in a Berkeley-derived kernel sets a variable named so_error for that socket. Process can obtain the value of so_error by fetching the SO_ERROR socket option

• SO_KEEPALIVE=>wait 2hours, and then TCP automatically sends a keepalive probe to the peer.– Peer response

• ACK(everything OK)• RST(peer crashed and rebooted):ECONNRESET• no response:ETIMEOUT =>socket closed

– example: Rlogin, Telnet…– Normally used by servers

SO_KEEPALIVE

SO_LINGER

• SO_LINGER =>specify how the close function operates for a connection-oriented protocol(default:close returns immediately)

– struct linger{ int l_onoff; /* 0 = off, nonzero = on */ int l_linger; /*linger time : second*/

};• l_onoff = 0 : turn off , l_linger is ignored• l_onoff = nonzero and l_linger is 0:TCP abort the connection (send RST),

discard any remaining data in send buffer.• l_onoff = nonzero and l_linger is nonzero : process wait until remained data

sending, or until linger time expired. If socket has been set nonblocking it will not wait for the close to complete, even if linger time is nonzero.

SO_LINGER

client server

Closeclose returns

Data queued by TCP

Application reads queued data and FINclose

Ack of data and FIN

Default operation of close:it returns immediately

SO_LINGER

client server

Close Data queued by TCP

Ack of data and FIN

close returns

Close with SO_LINGER socket option set and l_linger a positive value

SO_LINGER

client server

Shutdown read block

Data queued by TCP

Ack of data and FIN

read returns 0

Using shutdown to know that peer has received our data

• An way to know that the peer application has read the data– use an application-level ack or application ACK– client

char ack;Write(sockfd, data, nbytes); // data from client to servern=Read(sockfd, &ack, 1); // wait for application-level ack

– servernbytes=Read(sockfd, buff, sizeof(buff)); //data from client//server verifies it received the correct amount of data from// the clientWrite(sockfd, “”, 1);//server’s ACK back to client

SO_RCVBUF , SO_SNDBUF

• let us change the default send-buffer, receive-buffer size.

– Default TCP send and receive buffer size : • 4096bytes• 8192-61440 bytes

– Default UDP buffer size : 9000bytes, 40000 bytes• SO_RCVBUF option must be setting before connection

established.– For client, it should be before calling connect()– For server it should be before calling listen()

• TCP socket buffer size should be at least three times the MSSs

SO_RCVLOWAT , SO_SNDLOWAT

• Every socket has a receive low-water mark and send low-water mark.(used by select function)

• Receive low-water mark: – the amount of data that must be in the socket receive buffer for select to

return “readable”.– Default receive low-water mark : 1 for TCP and UDP

• Send low-water mark: – the amount of available space that must exist in the socket send buffer for

select to return “writable”– Default send low-water mark : 2048 for TCP– UDP send buffer never change because dose not keep a copy of send

datagram.

SO_RCVTIMEO, SO_SNDTIMEO

• allow us to place a timeout on socket receives and sends.

• Default disabled

SO_REUSEADDR, SO_REUSEPORT

• Allow a listening server to start and bind its well known port even if previously established connection exist that use this port as their local port.

• Allow multiple instance of the same server to be started on the same port, as long as each instance binds a different local IP address.

• Allow a single process to bind the same port to multiple sockets, as long as each bind specifies a different local IP address.

• Allow completely duplicate bindings : multicasting

SO_TYPE

• Return the socket type.• Returned value is such as SOCK_STREAM,

SOCK_DGRAM...

SO_USELOOPBACK

• This option applies only to sockets in the routing domain(AF_ROUTE).

• The socket receives a copy of everything sent on the socket.

IPv4 socket option

• Level => IPPROTO_IP• IP_HDRINCL => If this option is set for a raw IP

socket, we must build our IP header for all the datagrams that we send on the raw socket.

IPv4 socket option

• IP_OPTIONS=>allows us to set IP option in IPv4 header.(chapter 24)

• IP_RECVDSTADDR=>This socket option causes the destination IP address of a received UDP datagram to be returned as ancillary data by recvmsg.(chapter20)

IP_RECVIF

• Cause the index of the interface on which a UDP datagram is received to be returned as ancillary data by recvmsg.(chapter20)

IP_TOS

• lets us set the type-of-service(TOS) field in IP header for a TCP or UDP socket.

• If we call getsockopt for this option, the current value that would be placed into the TOS(type of service) field in the IP header is returned

IP_TTL

• We can set and fetch the default TTL(time to live field).

ICMPv6 socket option

• This socket option is processed by ICMPv6 and has a level of IPPROTO_ICMPV6.

• ICMP6_FILTER =>lets us fetch and set an icmp6_filter structure that specifies which of the 256possible ICMPv6 message types are passed to the process on a raw socket.(chapter 25)

IPv6 socket option

• This socket option is processed by IPv6 and have a level of IPPROTO_IPV6.

• IPV6_ADDRFORM=>allow a socket to be converted from IPv4 to IPv6 or vice versa.(chapter 10)

• IPV6_CHECKSUM=>specifies the byte offset into the user data of where the checksum field is located.

IPV6_DSTOPTS

• Specifies that any received IPv6 destination options are to be returned as ancillary data by recvmsg.

IPV6_HOPLIMIT

• Setting this option specifies that the received hop limit field be returned as ancillary data by recvmsg.(chapter 20)

• Default off.

IPV6_HOPOPTS

• Setting this option specifies that any received IPv6 hop-by-hop option are to be returned as ancillary data by recvmsg.(chapter 24)

IPV6_NEXTHOP

• This is not a socket option but the type of an ancillary data object that can be specified to sendmsg. This object specifies the next-hop address for a datagram as a socket address structure.(chapter20)

IPV6_PKTINFO

• Setting this option specifies that the following two pieces of infoemation about a received IPv6 datagram are to be returned as ancillary data by recvmsg:the destination IPv6 address and the arriving interface index.(chapter 20)

IPV6_PKTOPTIONS

• Most of the IPv6 socket options assume a UDP socket with the information being passed between the kernel and the application using ancillary data with recvmsg and sendmsg.

• A TCP socket fetch and store these values using IPV6_ PKTOPTIONS socket option.

IPV6_RTHDR

• Setting this option specifies that a received IPv6 routing header is to be returned as ancillary data by recvmsg.(chapter 24)

• Default off

IPV6_UNICAST_HOPS

• This is similar to the IPv4 IP_TTL.• Specifies the default hop limit for outgoing datagram

sent on the socket, while fetching the socket option returns the value for the hop limit that the kernel will use for the socket.

TCP socket option

• There are five socket option for TCP, but three are new with Posix.1g and not widely supported.

• Specify the level as IPPROTO_TCP.

TCP_KEEPALIVE

• This is new with Posix.1g• It specifies the idle time in second for the connection

before TCP starts sending keepalive probe.• Default 2hours• this option is effective only when the

SO_KEEPALIVE socket option enabled.

TCP_MAXRT

• This is new with Posix.1g.• It specifies the amount of time in seconds before a

connection is broken once TCP starts retransmitting data.– 0 : use default– -1:retransmit forever– positive value:rounded up to next transmission time

TCP_MAXSEG

• This allows us to fetch or set the maximum segment size(MSS) for TCP connection.

TCP_NODELAY

• This option disables TCP’s Nagle algorithm. (default this algorithm enabled)• purpose of the Nagle algorithm.

==>prevent a connection from having multiple small packets outstanding at any time.

• Small packet => any packet smaller than MSS.

Nagle algorithm

• Default enabled.• Reduce the number of small packet on the WAN.• If given connection has outstanding data , then no

small packet data will be sent on connection until the existing data is acknowledged.

0250500750

1000125015001500

17502000

hello!

Nagle algorithm disabled

Nagle algorithm enabled

0250500750

1000125015001500

17502000

hello!

22502500

fcntl function

• File control• This function perform various descriptor control

operation.• Provide the following features

– Nonblocking I/O(chapter 15)– signal-driven I/O(chapter 22)– set socket owner to receive SIGIO signal. (chapter 21,22)

#include <fcntl.h>int fcntl(int fd, int cmd, …./* int arg */); Returns:depends on cmd if OK, -1 on error

O_NONBLOCK : nonblocking I/OO_ASYNC : signal driven I/O notification

Nonblocking I/O using fcntl

Int flags; /* set socket nonblocking */if((flags = fcntl(fd, f_GETFL, 0)) < 0) err_sys(“F_GETFL error”);flags |= O_NONBLOCK;if(fcntl(fd, F_SETFL, flags) < 0) err_sys(“F_ SETFL error”);

each descriptor has a set of file flags that fetched with the F_GETFL command

and set with F_SETFL command.

Misuse of fcntl

/* wrong way to set socket nonblocking */if(fcntl(fd, F_SETFL,O_NONBLOCK) < 0) err_sys(“F_ SETFL error”);

/* because it also clears all the other file status flags.*/

Turn off the nonblocking flag

Flags &= ~O_NONBLOCK;if(fcntl(fd, F_SETFL, flags) < 0) err_sys(“F_SETFL error”);

F_SETOWN

• The integer arg value can be either positive(process ID) or negative (group ID)value to receive the signal.

• F_GETOWN => retrurn the socket owner by fcntl function, either process ID or process group ID.

Unix Domain Protocols

Chapter 14

Unix domain protocol

contents

• Introduction• unix domain socket address structure• socketpair• socket function• unix domain stream client-server• unix domain datagram client-server• passing descriptors• receiving sender credentials

Unix Domain Protocol

• perform client-server communication on a single host using same API that is used for client-server model on the different hosts.

• Faster than internet protocol suite– UNIX domain sockets only copy data; they have no protocol processing to

perform, no network headers to add or remove, no checksums to calculate, no sequence numbers to generate, and no acknowledgements to send.

• The Unix domain protocols are an alternative to the interprocess communication (IPC) methods described

• Two types of sockets are provided in the Unix domain: – stream sockets (similar to TCP) – datagram sockets (similar to UDP).

• The UNIX domain datagram service is reliable, however. Messages are neither lost nor delivered out of order

• Unix domain sockets are used for three reasons:– Unix domain sockets are often twice as fast as a TCP socket when

both peers are on the same host – used when passing descriptors between processes on the same

host. – Unix domain sockets provide the client's credentials (user ID and

group IDs) to the server, which can provide additional security checking

• End Point Address– pathnames within the normal filesystem – The pathname associated with a Unix domain socket should

be an absolute pathname

unix domain socket address structure

• <sys/un.h>struct sockaddr_un{ uint8_t sun_len; sa_family_t sun_family; /*AF_LOCAL*/ char sun_path[104]; /*null terminated pathname*/};• sun_path => must null terminated

socketpair Function

• Create two sockets that are then connected together(only available in unix domain socket)

• family must be AF_LOCAL• protocol must be 0

#include<sys/socket.h>int socketpair(int family, int type, int protocol, int sockfd[2]); return: nonzero if OK, -1 on error

socketpair Function

• Although the socketpair function creates sockets that are connected to each other, the individual sockets don't have names.

• This means that they can't be addressed by unrelated processes.

unix domain stream client-server

#include "unp.h"int main(int argc, char **argv){

int listenfd, connfd;pid_t childpid;socklen_t clilen;struct sockaddr_un cliaddr, servaddr;void sig_chld(int);

listenfd = Socket(AF_LOCAL, SOCK_STREAM, 0);

unlink(UNIXSTR_PATH);bzero(&servaddr, sizeof(servaddr));servaddr.sun_family = AF_LOCAL;strcpy(servaddr.sun_path, UNIXSTR_PATH);

Bind(listenfd, (SA *) &servaddr, sizeof(servaddr));Listen(listenfd, LISTENQ);Signal(SIGCHLD, sig_chld);

unix domain stream client-server(2)

for ( ; ; ) {clilen = sizeof(cliaddr);if ( (connfd = accept(listenfd, (SA *) &cliaddr,

&clilen)) < 0) {if (errno == EINTR)

continue; /* back to for() */else

err_sys("accept error");}

if ( (childpid = Fork()) == 0) { /* child process */Close(listenfd); /* close listening socket */str_echo(connfd); /* process the request */exit(0);}

Close(connfd); /* parent closes connected socket */}

passing descriptors

• Current unix system provide a way to pass any open descriptor from one process to any other process.(using sendmsg)

• The ability to pass an open file descriptor between processes is powerful. It can lead to different ways of designing clientserver applications.

• It allows one process (typically a server) to do everything that is required to open a file (involving such details as translating a network name to a network address, dialing a modem, negotiating locks for the file, etc.) and simply pass back to the calling process a descriptor that can be used with all the I/O functions.

• All the details involved in opening the file or device are hidden from the client.

passing descriptors(2)

1. Create a unix domain socket(stream or datagram)2. one process opens a descriptor by calling any of the unix function that

returns a descriptor3. the sending process build a msghdr structure containing the

descriptor to be passed4. the receiving process calls recvmsg to receive the descriptor on the

unix domain socketPassing a descriptor is not passing a descriptor number, but involves creating a new descriptor in the receiving process that refers to the same file table entry within the kernel as the descriptor that was sent by the sending process.

Passing Descriptor

Descriptor passing example

[0] [1]

After creating stream pipe using socketpair

[1][0]Exec(command-line args)

mycat openfile

descriptor

mycat program after invoking openfile program

recvmsg and sendmsg

#include <sys/socket.h>

ssize_t recvmsg (int sockfd, struct msghdr *msg, int flags);

ssize_t sendmsg (int sockfd, struct msghdr *msg, int flags);

Struct msghdr {

void *msg_name; /* starting address of buffer */ socklen_t msg_namelen; /* size of protocol address */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data; must be aligned

for a cmsghdr structure */ socklen_t msg_controllen; /* length of ancillary data */ int msg_flags; /* flags returned by recvmsg() */};

recvmsg and sendmsg

m sg _ n a m e

m sg _ fla gsm sg _ co n tro lle nm sg _ co n tro lm sg _ io v le nm sg _ io vm sg _ n a m e le n

io v_ b a se

io v_ le nio v_ b a seio v_ le nio v_ b a seio v_ le n

iovec{}

F igure 13.8 Data structures when recvmsg is called for a UDP socket.

msghdr{}

recvmsg and sendmsg

m sg_ na m e

m sg_ flag sm sg_ con tro lle nm sg_ con tro lm sg_ io v lenm sg_ io vm sg_ na m e len

io v_b ase

io v_ lenio v_b aseio v_ lenio v_b aseio v_ len

iovec{} [ ]

F igure 13.9 Update o f F igure 13.8 when recvmsg return.

msghdr{}

cm sg_ typ ecm sg_ leve lcm sg_ len

sockaddr_ in{}16, AF_ INET, 2000198.69.10.2

16IP P R O TP _IPIP _R E C V D S TA D D R206 .62 .22 6 .35

Ancillary Data• Ancillary data can be sent and received using the msg_control and

msg_controllen members of the msghdr structure with sendmsg and recvmsg functions.

Protocol cmsg_level Cmsg_type Description IPv4 IPPROTO_IP IP_RECVDSTADD

R IP_RECVIF

receive destination address with UDP datagram receive interface index with UDP datagram

IPv6 IPPROTO_IPV6

IPV6_DSTOPTS IPV6_HOPLIMIT IPV6_HOPOPTS IPV6_NEXTHOP IPV6_PKTINFO IPV6_RTHDR

specify / receive destination options specify / receive hop limit specify / receive hop-by-hop options specify next-hop address specify / receive packet information specify / receive routing header

Unix domain

SOL_SOCKET SCM_RIGHTS SCM_CREDS

send / receive descriptors send / receive user credentials

Ancillary Data

cmsg_len cmsg_level cmsg_type

c msghdr{}

ac c illarydata objec t

C MSG _ SPAC E()

msg_control

Figure 13.12 Ancillary data containing two ancillary data objects.

Ancillary Data

cmsghdr{}

F igure 13.13 cmsghdr structure when used with Unix domain sockets .

d iscr ip to r

16SOL_SOC KETSC M_RIGHTS

cmsghdr{} cmsg_len cmsg_level cmsg_type

16SOL_SOCKETSC M_C REDS

fcred{}

Control Message Header

struct cmsghdr { socklen_t cmsg_len; /* data byte count, including header */ int cmsg_level; /* originating protocol */ int cmsg_type; /* protocol-specific type */ /* followed by the actual control message data */ };

• To send a file descriptor, – set cmsg_len to the size of the cmsghdr structure, plus the size

of an integer (the descriptor). – The cmsg_level field is set to SOL_SOCKET, and cmsg_type is

set to SCM_RIGHTS, to indicate that we are passing access rights. (SCM stands for socket-level control message.)

– Access rights can be passed only across a UNIX domain socket. The descriptor is stored right after the cmsg_type field, using the macro CMSG_DATA to obtain the pointer to this integer.

#include <sys/socket.h>/* size of control buffer to send/recv one file

descriptor */#define CONTROLLEN CMSG_LEN(sizeof(int))static struct cmsghdr *cmptr = NULL; /*

malloc'ed first time *//* * Pass a file descriptor to another process. * If fd<0, then -fd is sent back instead as the

error status. */intsend_fd(int fd, int fd_to_send){ struct iovec iov[1]; struct msghdr msg; char buf[2]; /*

send_fd()/recv_fd() 2-byte protocol */

iov[0].iov_base = buf; iov[0].iov_len = 2; msg.msg_iov = iov; msg.msg_iovlen = 1; msg.msg_name = NULL; msg.msg_namelen = 0;

if (fd_to_send < 0) { msg.msg_control = NULL; msg.msg_controllen = 0; buf[1] = -fd_to_send; /* nonzero status

means error */ if (buf[1] == 0) buf[1] = 1; } else {if (cmptr == NULL && (cmptr = malloc(CONTROLLEN))

== NULL) return(-1); cmptr->cmsg_level = SOL_SOCKET; cmptr->cmsg_type = SCM_RIGHTS; cmptr->cmsg_len = CONTROLLEN; msg.msg_control = cmptr; msg.msg_controllen = CONTROLLEN; *(int *)CMSG_DATA(cmptr) = fd_to_send;

/* the fd to pass */ buf[1] = 0; /* zero status means

OK */ } buf[0] = 0; /* null byte flag to

recv_fd() */ if (sendmsg(fd, &msg, 0) != 2) return(-1); return(0);}

#include "apue.h"#include <sys/socket.h> /* struct msghdr */

/* size of control buffer to send/recv one file descriptor */#define CONTROLLEN CMSG_LEN(sizeof(int))

static struct cmsghdr *cmptr = NULL; /* malloc'ed first time *//* * Receive a file descriptor from a server process. Also, any data * received is passed to (*userfunc)(STDERR_FILENO, buf, nbytes). * We have a 2-byte protocol for receiving the fd from send_fd(). */intrecv_fd(int fd, ssize_t (*userfunc)(int, const void *, size_t)){ int newfd, nr, status; char *ptr; char buf[MAXLINE]; struct iovec iov[1]; struct msghdr msg;

status = -1; for ( ; ; ) { iov[0].iov_base = buf; iov[0].iov_len = sizeof(buf); msg.msg_iov = iov; msg.msg_iovlen = 1; msg.msg_name = NULL; msg.msg_namelen = 0; if (cmptr == NULL && (cmptr = malloc(CONTROLLEN)) == NULL) return(-1);

if (cmptr == NULL && (cmptr = malloc(CONTROLLEN)) == NULL)

return(-1);

msg.msg_control = cmptr;

msg.msg_controllen = CONTROLLEN;

if ((nr = recvmsg(fd, &msg, 0)) < 0) {

err_sys("recvmsg error");

} else if (nr == 0) {

err_ret("connection closed by server");

return(-1);

for (ptr = buf; ptr < &buf[nr]; ) {

if (*ptr++ == 0) {

if (ptr != &buf[nr-1])

err_dump("message format error");

status = *ptr & 0xFF; /* prevent sign extension */

if (status == 0) {

if (msg.msg_controllen != CONTROLLEN)

err_dump("status = 0 but no fd");

newfd = *(int *)CMSG_DATA(cmptr);

} else {

newfd = -status;

nr -= 2;

if (nr > 0 && (*userfunc)(STDERR_FILENO, buf, nr) != nr)

return(-1);

if (status >= 0) /* final data has arrived */

return(newfd); /* descriptor, or -status */

if (cmptr == NULL && (cmptr = malloc(CONTROLLEN)) == NULL)

return(-1);

msg.msg_control = cmptr;

msg.msg_controllen = CONTROLLEN;

if ((nr = recvmsg(fd, &msg, 0)) < 0) {

err_sys("recvmsg error");

} else if (nr == 0) {

err_ret("connection closed by server");

return(-1);

for (ptr = buf; ptr < &buf[nr]; ) {

if (*ptr++ == 0) {

if (ptr != &buf[nr-1])

err_dump("message format error");

status = *ptr & 0xFF; /* prevent sign extension */

if (status == 0) {

if (msg.msg_controllen != CONTROLLEN)

err_dump("status = 0 but no fd");

newfd = *(int *)CMSG_DATA(cmptr);

} else {

newfd = -status;

nr -= 2;

if (nr > 0 && (*userfunc)(STDERR_FILENO, buf, nr) != nr)

return(-1);

if (status >= 0) /* final data has arrived */

return(newfd); /* descriptor, or -status */

Ancillary Data

#include "unp.h"int my_open(const char *, int);int main(int argc, char **argv){

int fd, n;charbuff[BUFFSIZE];

if (argc != 2)err_quit("usage: mycat <pathname>");

if ( (fd = my_open(argv[1], O_RDONLY)) < 0)err_sys("cannot open %s", argv[1]);

while ( (n = Read(fd, buff, BUFFSIZE)) > 0)Write(STDOUT_FILENO, buff, n);

exit(0);}

mycat program show in Figure 14.7)

#include "unp.h"

intmy_open(const char *pathname, int mode){

int fd, sockfd[2], status;pid_t childpid;char c, argsockfd[10], argmode[10];

Socketpair(AF_LOCAL, SOCK_STREAM, 0, sockfd);

if ( (childpid = Fork()) == 0) { /* child process */Close(sockfd[0]);snprintf(argsockfd, sizeof(argsockfd), "%d", sockfd[1]);snprintf(argmode, sizeof(argmode), "%d", mode);execl("./openfile", "openfile", argsockfd, pathname, argmode,

(char *) NULL);err_sys("execl error");

myopen function(1) : open a file and return a descriptor

/* parent process - wait for the child to terminate */Close(sockfd[1]); /* close the end we don't use */

Waitpid(childpid, &status, 0);if (WIFEXITED(status) == 0)

err_quit("child did not terminate");if ( (status = WEXITSTATUS(status)) == 0)

Read_fd(sockfd[0], &c, 1, &fd);else {

errno = status; /* set errno value from child's status */fd = -1;

Close(sockfd[0]);return(fd);

myopen function(2) : open a file and return a descriptor

receiving sender credentials

• User credentials via fcred structure

Struct fcred{uid_t fc_ruid; /*real user ID*/gid_t fc_rgid; /*real group ID*/char fc_login[MAXLOGNAME];/*setlogin() name*/uid_t fc_uid; /*effectivr user ID*/short fc_ngroups; /*number of groups*/gid_t fc_groups[NGROUPS]; /*supplemenary group IDs*/};#define fc_gid fc_groups[0] /* effective group ID */

receiving sender credentials(2)

• Usally MAXLOGNAME is 16• NGROUP is 16• fc_ngroups is at least 1

• the credentials are sent as ancillary data when data is sent on unix domain socket.(only if receiver of data has enabled the LOCAL_CREDS socket option)

• on a datagram socket , the credentials accompany every datagram.• Credentials cannot be sent along with a descriptor• user are not able to forge credentials

Advanced I/O Functions

Outline

• Socket Timeouts• recv and send Functions• readv and writev Functions• recvmsg and sendmsg Function• Ancillary Data• How much Data is Queued?• Sockets and Standard I/O

Socket Timeouts

• Three ways to place a timeout on an I/O operation involving a socket– Call alarm, which generates the SIGALRM signal when the

specified time has expired.– Block waiting for I/O in select, which has a time limit built in, instead

of blocking in a call to read or write.– Use the newer SO_RCVTIMEO and SO_SNDTIMEO socket

options.

Connect with a Timeout Using SIGALRM

static void connect_alarm(int);int connect_timeo(int sockfd, const SA *saptr, socklen_t salen, int nsec){

Sigfunc *sigfunc;int n;sigfunc = Signal(SIGALRM, connect_alarm);if (alarm(nsec) != 0)

err_msg("connect_timeo: alarm was already set");if ( (n = connect(sockfd, (struct sockaddr *) saptr, salen)) < 0) {

close(sockfd);if (errno == EINTR)

errno = ETIMEDOUT;}alarm(0); /* turn off the alarm */return(n);

}static voidconnect_alarm(int signo){

return; /* just interrupt the connect() */}

recvfrom with a Timeout Using SIGALRM

static void sig_alrm(int);void dg_cli(FILE *fp, int sockfd, const SA *pservaddr, socklen_t servlen){

int n;char sendline[MAXLINE], recvline[MAXLINE + 1];Signal(SIGALRM, sig_alrm);while (Fgets(sendline, MAXLINE, fp) != NULL) {

Sendto(sockfd, sendline, strlen(sendline), 0, pservaddr, servlen);alarm(5);if ( (n = recvfrom(sockfd, recvline, MAXLINE, 0, NULL, NULL)) < 0) {

if (errno == EINTR)fprintf(stderr, "socket timeout\n");

elseerr_sys("recvfrom error");

} else {alarm(0);recvline[n] = 0; /* null terminate */Fputs(recvline, stdout);

}static void sig_alrm(int signo){

return; /* just interrupt the recvfrom() */}

recvfrom with a Timeout Using select

intreadable_timeo(int fd, int sec){

fd_set rset;struct timeval tv;

FD_ZERO(&rset);FD_SET(fd, &rset);

tv.tv_sec = sec;tv.tv_usec = 0;

return(select(fd+1, &rset, NULL, NULL, &tv));/* > 0 if descriptor is readable */

Timeout Using the SO_RCVTIMEO SO_SNDTIMEO Socket Option

• We set this option once for a descriptor, specifying the timeout value, and this timeout then applies to all read operations on that descriptor.

• we set the option only once, compared to the previous two methods, which required doing something before every operation on which we wanted to place a time limit.

• neither socket option can be used to set a timeout for a connect.

recvfrom with a Timeout Using the SO_RCVTIMEO Socket Option

int n;char sendline[MAXLINE], recvline[MAXLINE + 1];struct timeval tv;tv.tv_sec = 5;tv.tv_usec = 0;Setsockopt(sockfd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));while (Fgets(sendline, MAXLINE, fp) != NULL) {

Sendto(sockfd, sendline, strlen(sendline), 0, pservaddr, servlen);n = recvfrom(sockfd, recvline, MAXLINE, 0, NULL, NULL);if (n < 0) {

if (errno == EWOULDBLOCK) {fprintf(stderr, "socket timeout\n");continue;

} elseerr_sys("recvfrom error");

}recvline[n] = 0; /* null terminate */Fputs(recvline, stdout);

recv and send Functions

ssize_t recv (int sockfd, void *buff, size_t nbytes, int flags);

ssize_t send (int sockfd, const void *buff, size_t nbytes, int flags);

Flags Description recv send

MSG_DONTROUTE MSG_DONTWAIT MSG_OOB MSG_PEEK MSG_WAITALL

bypass routing table lookup only this operation is nonblocking send or receive out-of-band data peek at incoming message wait for all the data

readv and writev Functions

– readv and writev let us read into or write from one or more buffers with a single function call.

• are called scatter read and gather write.

#include <sys/uio.h>

ssize_t readv (int filedes, const struct iovec *iov, int iovcnt);

ssize_t writev (int filedes, const struct iovec *iov, int iovcnt);

Struct iovec {void *iov_base; /* starting address of buffer */size_t iov_len; /* size of buffer */

readv and writev Functions

– The readv and writev functions can be used with any descriptor, not just sockets. – writev is an atomic operation. For a record-based protocol such as UDP, one call

to writev generates a single UDP datagram.– One use of writev with the TCP_NODELAY socket option. //modify

• a write of 4 bytes followed by a write of 396 bytes could invoke the Nagle algorithm and a preferred solution is to call writev for the two buffers.

Nagle’s Algorithm

if there is new data to sendif the window size >= MSS and available data is >= MSS send complete MSS segment now

else if there is unconfirmed data still in the pipe enqueue data in the buffer until an acknowledge is received else send data immediately end if end ifend if

recvmsg and sendmsg

ssize_t recvmsg (int sockfd, struct msghdr *msg, int flags);

ssize_t sendmsg (int sockfd, struct msghdr *msg, int flags);

Struct msghdr {

void *msg_name; /* starting address of buffer */ socklen_t msg_namelen; /* size of protocol address */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data; must be aligned

for a cmsghdr structure */ socklen_t msg_controllen; /* length of ancillary data */ int msg_flags; /* flags returned by recvmsg() */};

recvmsg and sendmsg

Flag Examined by:

Send flags Sendto flags

Sendmsg flags

Examined by: recv flags

recvfrom flags recvmsg flags

Returned by:

Recvmsg msg_flags

MSG_DONTROUTE MSG_DONTWAIT MSG_PEEK MSG_WAITALL

MSG_EOR MSG_OOB

MSG_BCAST MSG_MCAST MSG_TRUNC MSG_CTRUNC

recvmsg and sendmsg

m sg _ n a m e

m sg _ fla gsm sg _ co n tro lle nm sg _ co n tro lm sg _ io v le nm sg _ io vm sg _ n a m e le n

io v_ b a se

io v_ le nio v_ b a seio v_ le nio v_ b a seio v_ le n

iovec{}

F igure 13.8 Data structures when recvmsg is called for a UDP socket.

msghdr{}

recvmsg and sendmsg

m sg_ na m e

m sg_ flag sm sg_ con tro lle nm sg_ con tro lm sg_ io v lenm sg_ io vm sg_ na m e len

io v_b ase

io v_ lenio v_b aseio v_ lenio v_b aseio v_ len

iovec{} [ ]

F igure 13.9 Update o f F igure 13.8 when recvmsg return.

msghdr{}

cm sg_ typ ecm sg_ leve lcm sg_ len

sockaddr_ in{}16, AF_ INET, 2000198.69.10.2

16IP P R O TP _IPIP _R E C V D S TA D D R206 .62 .22 6 .35

Ancillary Data• Ancillary data can be sent and received using the msg_control and

msg_controllen members of the msghdr structure with sendmsg and recvmsg functions.

Protocol cmsg_level Cmsg_type Description IPv4 IPPROTO_IP IP_RECVDSTADD

R IP_RECVIF

receive destination address with UDP datagram receive interface index with UDP datagram

IPv6 IPPROTO_IPV6

IPV6_DSTOPTS IPV6_HOPLIMIT IPV6_HOPOPTS IPV6_NEXTHOP IPV6_PKTINFO IPV6_RTHDR

specify / receive destination options specify / receive hop limit specify / receive hop-by-hop options specify next-hop address specify / receive packet information specify / receive routing header

Unix domain

SOL_SOCKET SCM_RIGHTS SCM_CREDS

send / receive descriptors send / receive user credentials

Ancillary Data

c msghdr{}

C MSG _ SPAC E()

msg_control

Figure 13.12 Ancillary data containing two ancillary data objects.

Ancillary Data

cmsghdr{}

F igure 13.13 cmsghdr structure when used with Unix domain sockets .

d iscr ip to r

16SOL_SOC KETSC M_RIGHTS

cmsghdr{} cmsg_len cmsg_level cmsg_type

16SOL_SOCKETSC M_C REDS

fcred{}

How Much Data Is Queued?

• nonblocking I/O • MSG_PEEK with MSG_DONTWAIT flag• FIONREAD command of ioctl

Sockets and Standard I/O

• The standard I/O stream can be used with sockets, but there are a few items to consider.

– A standard I/O stream can be created from any desciptor by calling the fdopen function. Similarly, given a standard I/O stream, we can obtain the corresponding descriptor by calling fileno.

– fseek, fsetpos, rewind functions is that they all call lseek, which fails on a socket.

– The easiest way to handle this read-write problem is to open two standard I/O streams for a given socket: one for reading, and one for writing.

Standard i/O buffers

• Fully buffered: i/O takes place only when the buffer is full, fflush() or exit() 8192 bytes

• Line buffered: i/O takes place when a new line is encountered, fflush(), or exit()

• Unbuffered: i/O take place each time a standard i/O output function is called.

Standard i/O buffers

• Standard error is always unbuffered• Standard input and standard output are fully buffered,

unless they refer to a terminal device in which case they are line buffered.

• All other streams are fully buffered unless they refer to terminal device in which case they are line buffered.

Sockets and Standard I/O

#include "unp.h"

voidstr_echo(int sockfd){

char line[MAXLINE];FILE *fpin, *fpout;

fpin = Fdopen(sockfd, "r");fpout = Fdopen(sockfd, "w");

for ( ; ; ) {if (Fgets(line, MAXLINE, fpin) == NULL) return; /* connection closed by other end */

Fputs(line, fpout);}

Chapter 12.

Daemon Processes and inetd Superserver

12.1 Introduction

• A daemon is a process that runs in the background and is independent of control from all terminals.

• There are numerous ways to start a daemon1. the system initialization scripts ( /etc/rc )2. the inetd superserver3. croncron deamon4. the at command5. from user terminals

• Since a daemon does not have a controlling terminal, it needs some way to output message when something happens, either normal informational messages, or emergency messages that need to be handled by an administrator.

12.2 syslogd daemon

• Berkeley-derived implementation of syslogd perform the following actions upon startup.

1. The configuration file is read, specifying what to do with each type of log message that the daemon can receive.

2. A Unix domain socket is created and bound to the pathname /var/run/log ( /dev/log on some system).

3. A UDP socket is created and bound to port 5144. The pathname /dev/klog is opened. Any error messages from

within the kernel appear as input on this device.

• We could send log messages to the syslogd daemon from our daemons by creating a Unix domain datagram socket and sending our messages to the pathname that the daemon has bound, but an easier interface is the syslog function.

syslogd

syslogdUDP socket

port 514

Unix domain socket/dev/log

/dev/klog

Filesystem/var/log/messages

Remote syslogd

Console

12. 3 syslog function

– the priority argument is a combination of a level and a facility.

– The message is like a format string to printf, with the addition of a %m specification, which is replaced with the error message corresponding to the current value of errno.

Ex) Syslog(LOG_INFO|LOG_LOCAL2, “rename(%s, %s): %m”,file1,file2);

#include <syslog.h>

void syslog(int priority, const char *message, . . . );

• Log message have a level between 0 and 7.level value descriptionLOG_EMERG 0 system is unusable ( highest priority )LOG_ALERT 1 action must be taken immediatelyLOG_CRIT 2 critical conditionsLOG_ERR 3 error conditionsLOG_WARNING 4 warning conditionsLOG_NOTICE 5 normal but significant condition (default)LOG_INFO 6 informationalLOG_DEBUG 7 debug-level message ( lowest priority )

Figure 12.1 level of log message.

• A facility to identify the type of process sending the message.

facility DescriptionLOG_AUTH security / authorization messagesLOG_AUTHPRIV security / authorization messages (private)LOG_CRON cron daemonLOG_DAEMON system daemonsLOG_FTP FTP daemonLOG_KERN kernel messagesLOG_LOCAL0 local useLOG_LOCAL1 local useLOG_LOCAL2 local useLOG_LOCAL3 local useLOG_LOCAL4 local useLOG_LOCAL5 local useLOG_LOCAL6 local useLOG_LOCAL7 local useLOG_LPR line printer systemLOG_MAIL mail systemLOG_NEWS network news systemLOG_SYSLOG messages generated internally by syslogLOG_USER random user-level messages(default)LOG_UUCP UUCP system

Figure 12.2 facility of log messages.

• Openlog and closelog– openlog can be called before the first call to syslog and

closelog can be called when the application is finished sending is finished log messages.

#include <syslog.h>

void openlog(const char *ident, int options, int facility);

void closelog(void);

options Description LOG_CONS Log to console if cannot send to syslog daemon LOG_NDELAY Do not delay open, create socket now LOG_PERROR Log to standard error as well as sending to syslogd

daemon LOG_PDI Log the process ID with each message

Figure 12.3 options for openlog

Unix Login

Process Group

• process group is a collection of one or more processes, usually associated with the same job

• int setpgid(pid_t pid, pid_t pgid);• pid_t getpgid(pid_t pid); • It is possible for a process group leader to create a

process group, create processes in the group, and then terminate. The process group still exists, as long as at least one process is in the group, regardless of whether the group leader terminates

Process Groups in a Session

• The processes in a process group are usually placed there by a shell pipeline – proc1 | proc2 & – proc3 | proc4 | proc5

Creating Session

• A process establishes a new session by calling the setsid function

• If the calling process is not a process group leader, this function creates a new session. Three things happen.– The process becomes the session leader of this new session.

(A session leader is the process that creates a session.) The process is the only process in this new session.

– The process becomes the process group leader of a new process group. The new process group ID is the process ID of the calling process.

– The process has no controlling terminal. If the process had a controlling terminal before calling setsid, that association is broken.

setsid

• pid_t setsid(void); • This function returns an error if the caller is already a

process group leader. • To ensure this is not the case, the usual practice is to

call fork and have the parent terminate and the child continue. We are guaranteed that the child is not a process group leader, because the process group ID of the parent is inherited by the child, but the child gets a new process ID. Hence, it is impossible for the child's process ID to equal its inherited process group ID

Controlling Terminal

12.4 daemon_init Function#include <syslog.h>#define MAXFD 64extern int daemon_proc; /* defined in error.c */void daemon_init(const char *pname, int facility){

int i;pid_t pid;

if ( (pid = Fork()) != 0)exit(0); /* parent terminates */

/* 1st child continues */setsid(); /* become session leader */Signal(SIGHUP, SIG_IGN);if ( (pid = Fork()) != 0) exit(0); /* 1st child terminates */

/* 2nd child continues */daemon_proc = 1; /* for our err_XXX() functions */chdir("/"); /* change working directory */umask(0); /* clear our file mode creation mask */

for (i = 0; i < MAXFD; i++)close(i);

openlog(pname, LOG_PID, facility);}

Daemon_init

1. We first call fork and then the parent terminates, and the child continues. If the process was started as a shell command in the foreground, when the parent terminates, the shell thinks the command is done. This automatically runs the child process in the background. Also, the child inherits the process group ID from the parent but gets its own process ID. This guarantees that the child is not a process group leader, which is required for the next call to setsid

2. The process becomes the session leader of the new session, becomes the process group leader of a new process group, and has no controlling terminal

Daemon_init

• We ignore SIGHUP and call fork again. When this function returns, the parent is really the first child and it terminates, leaving the second child running. The purpose of this second fork is to guarantee that the daemon cannot automatically acquire a controlling terminal should it open a terminal device in the future. When a session leader without a controlling terminal opens a terminal device (that is not currently some other session's controlling terminal), the terminal becomes the controlling terminal of the session leader. But by calling fork a second time, we guarantee that the second child is no longer a session leader, so it cannot acquire a controlling terminal. We must ignore SIGHUP because when the session leader terminates (the first child), all processes in the session (our second child) receive the SIGHUP signal.

12.5 inetd Daemon

• A typical Unix system’s problems1. All these daemons contained nearly identical startup code.2. Each daemon took a slot in the process table, but each daemon

was asleep most of the time.

• inetd daemon fixes the two problems.1. It simplifies writing daemon processes, since most of the startup

details are handled by inetd.2. It allow a single process(inetd) to be waiting for incoming client

requests for multiple services, instead of one process for each service.

12.5 inetd daemon

• Figure 12.7

socket()

bind()

listen()(if TC P socke t)

select()fo r readab ility

accpet()( if TC P socke t)

fork()

close a ll descrip to rs o the rthan socke t

dup socke t to desc rip to rs0 ,1 and 2 ;

close socke t

setgid()setuid()

( if use r no t roo t)

exec() se rve r

close connec tedsocke t(if TC P )

F or each service lis ted in the /etc/inetd.conf file

parent child

inetd service specification

• For each service, inetd needs to know:– the socket type and transport protocol– wait/nowait flag.– login name the process should run as.– pathname of real server program.– command line arguments to server program.

• Servers that are expected to deal with frequent requests are typically not run from inetd– mail, web, NFS.

# Syntax for socket-based Internet services:

# <service_name> <socket_type> <proto> <flags> <user> <server_pathname> <args>

# # comments start with #echo stream tcp nowait root internalecho dgram udp wait root internalchargen stream tcp nowait root internalchargen dgram udp wait root internalftp stream tcp nowait root /usr/sbin/ftpd ftpd -ltelnet stream tcp nowait root /usr/sbin/telnetd telnetdfinger stream tcp nowait root /usr/sbin/fingerd fingerd# Authenticationauth stream tcp nowait nobody /usr/sbin/in.identd in.identd -l -e -o# TFTPtftp dgram udp wait root /usr/sbin/tftpd tftpd -s /tftpboot

Example /etc/inetd.conf

wait/nowait

• WAIT specifies that inetd should not look for new clients for the service until the child (the real server) has terminated.

• TCP servers usually specify nowait - this means inetd can start multiple copies of the TCP server program - providing concurrency

• Most UDP services run with inetd told to wait until the child server has died.

Broadcasting 578

• Many networks support the notion of sending a message from one host to all other hosts on the network.

• A special address called the “broadcast address” is often used.

• Some popular network services are based on broadcasting (YP/NIS, rup, rusers)

Broadcasting

Broadcasting 579

Broadcasting

• TCP works only with unicast addresses, UDP supports also broadcasting and multicasting

• Multicasting support is optional in IPv4, but mandatory in IPv6• Broadcasting support is not provided in IPv6; if an IPv4 application uses

broadcasting, recode with IPv6 to use multicasting instead of broadcasting

Type IPv4 IPv6 TCP UDP

Unicast

Broadcast

Multicast opt.

Broadcasting 580

Broadcasting

Types of Casting:Unicast: One to OneAnycast: a set to one in a setMulticast: a set to all in a setBroadcast: all to all

Useful over LAN only, and with UDP

Broadcasting 581

Uses of Broadcasting

• Mainly used for resource discovery purposes (server is known to exist in the local subnet, but IP address is not known)

– ARP (Address Resolution Protocol) • Broadcast to find MAC address for known IP address – The owner of the

IP address is to reply– BOOTP (Bootstrap Protocol)

• For a diskless workstation to discover its own IP address, the IP address of a BOOTP server on the network, and a file to be loaded into memory to boot the machine

– NTP (Network Time Protocol) • To synchronize time and coordinate time distribution in a large network

– Routing Daemons :broadcasts routing table on LAN

Broadcasting 582

Broadcast Address Types

• IPv4 address: {netid; subnetid; hostid}– Subnet-directed Broadcast Address:

• {netid; subnetid; -1} //-1 means all bits are 1’s• netid = 128.7, subnetid: 6

Broadcast Address: 128.7.6.255• Normally, routers do not forward these broadcasts

– All-subnets-directed Broadcast Address:• {netid; -1; -1}• All subnets on the specified network – very rarely used

– Network-directed Broadcast Address:• {netid: -1}• If a network has no subnetting – almost non-existent

Broadcasting 583

Broadcast Address Types

– Limited Broadcast Address:• {-1; -1; -1} or 255.255.255.255• Must never be forwarded by a router

• Subnet-directed broadcast and limited broadcast are the most common• Old systems do not understand subnet-directed broadcast• For protocols like BOOTP, 255.255.255.255 is the only option

Broadcasting 584

Unicast Vs Broadcast

In Unicast, only peers participate In Broadcast, every host on the subnet has to receive the packet and

process it up to the transport layer i.e through DL,IP, and UDP Every non-IP host also must receive at the datalink layer If broadcast datagrams arrive at higher rate, processing can affect

severely the performance

Broadcasting 585

Unicast

SendingAppl

DataLink

ReceivingAppl

DataLink

subnet 128.7.6

SendtoDest IP: 128.7.6.5Dest Port: 7433

02:60:8c:2f:4e:00

128.7.6.99 = unicast128.7.6.255 = broadcast

Enethdr

IPv4hdr

UDPhdr

UDPData

Dest Enet: 08:00:20:03:f6:42Frame type: 0800

Dest IP: 128.7.6.5Protocol: UDP

Dest Port: 7433

08:00:20:03:f6:42

Frame type= 0800

Protocol=UDP

Port=7433

Broadcasting 586

Broadcast

SendingAppl

DataLink

ReceivingAppl

DataLink

subnet 128.7.6

sendtoDest IP: 128.7.6.255Dest Port: 520

02:60:8c:2f:4e:00

Enethdr

IPv4hdr

UDPhdr

UDPData

Dest Enet: ff:ff:ff:ff:ff:ffFrame type: 0800

Dest Port: 520

02:60:20:03:f6:42

Frame type= 0800

Protocol=UDP

Port=520

Frame type= 0800

Protocol=UDP

Discard

Set SO_BROADCASToption using setsockopt()

Broadcasting 587

Programming Requirements

• Socket option has to be set with SO_BROADCAST

• Setsockopt(sockfd, SOL_SOCKET,SO_BROADCAST,&on,sizeof(on)).

• IP Fragmentation: BSD generates EMSGSIZE if size exceeds outgoing MTU

Broadcasting 588

Race Condition

void dg_cli(…) {setsockopt(sockfd, SOL_SOCKET,SO_BROADCAST,&on,sizeof(on));

signal(SIGALRM, func);while(fgets(…)!=NULL) {

sendto(…);alarm(1);for(; ; ) {

if (n=recvfrom(…) <0) {if (errno==EINTR) break;else err_sys(…);

} else {recvline[n]=0;sleep(1);printf(…);

}}}Void func( int signo) { return; }

Problem?

- When multiple processes accessing shared data output depends on the execution order of the processes.

Broadcasting 589

Solutions to Race Condition

1. By Un-blocking and Blocking SIGALRMsigemptyset(&sig1);

sigaddset(&sig1, SIGALRM);

signal(SIGALRM, func);

while(fgets(…) !=NULL))

sendto(…);

alarm(5);

for(; ; ){

sigprocmask(SIG_UNBLOCK, &sig1,NULL);

n=recvfrom(…);

sigprocmask(SIG_BLOCK,&sig1, NULL);

if(n<0) {

if (errno==EINTR) break; else err_sys(…);

} else { recvline[n]=0; printf(…); }}}

void func(…)

{return;}

Signal Generation and Delivery is controlled

Window is reduced but the problem still persists

Broadcasting 590

2. pselect can be used with SIGALRM first blocked and then pselect being called with an empty signal set as it’s last argument.

pselect, blocking and unblocking being atomic calls, earlier

problem does not persist.

Broadcasting 591

3. Using non-local goto siglongjmp to jump from signal handler to the caller.signal(SIGALRM, func);

while (fgets(…)!=NULL) {sendto(…);alarm(5);for(; ;) {

if (sigsetjmp(jmpbuf, 1) != 0)break;

n=recvfrom(…);recvline[n]=0;printf(…);

}void func(…) {siglongjmp(jmpbuf, 1);}

Broadcasting 592

4. Using IPC from signal handler to function

void dg_cli(…) {setsockopt(…);pipe (pipefd);FD_ZERO(&rset);signal(SIGALRM, func);while(fgets(…)!=NULL){

sendto(…);alarm(5);for(; ;) {

FD_SET(sockfd, &rset);FD_SET(pipefd[0],&rset);if(n = select (…) <0) {

if (errno==EINTR) continue; else err_sys(…); }

if (FD_ISSET(sockfd, &rset) ) {recvfrom(…); printf(…); }

if (FD_ISSET(pipefd[0], &rset)) {read(pipefd[0], &n, 1); break; }

void func(int signo) {write (pipefd[1], “ ”, 1); return;}

Multicasting 593

• IPv4 Class D addresses are multicast addresses– Range 224.0.0.0 to 239.255.255.255

– 32 bit Class D address is called the group address

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0 NET-ID(7b) HOST-ID (24b)

1 0 NET-ID (12b) HOST-ID (14b)

1 1 0 NET-ID (21b) HOST-ID (8b)

1 1 1 0 GROUP-ID (28b)

CLASS A:

CLASS B:

CLASS C:

CLASS D:

Multicasting

Multicasting 594

• A mapping from IPv4 multicast addresses to Ethernet addresses is also defined– High order 24 bits always 01:00:5e– 25th bit is 0– Low order 23 bits from lowest 23 bits of multicast group address– Not one-to-one, many (32) multicast addresses to a single Ethernet

address

• Broadcasting is normally limited to LANs, whereas Multicasting can be done in LANs or WANs

multicast address• IPv4 class D address

– 224.0.0.0 ~ 239.255.255.255 – (224.0.0.1: all hosts group), (224.0.0.2: all-routers group)

Multicast Addresses Scope

Multicast Session

• Especially in the case of streaming multimedia, the combination of an IP multicast address (either IPv4 or IPv6) and a transport-layer port (typically UDP) is referred to as a session.

• For example, an audio/video teleconference may comprise two sessions; one for audio and one for video. These sessions almost always use different ports and sometimes also use different groups for flexibility in choice when receiving.

Multicasting 598

Multicast vs Broadcast

SendingAppl

DataLink

ReceivingAppl

DataLink

subnet 128.7.6

02:60:8c:2f:4e:00

Enethdr

IPv4hdr

UDPhdr

UDPData

Dest Enet: 01:00:5e:00:01:01Frame type: 0800

Dest Port: 123

02:60:20:03:f6:42

Frame type= 0800

Protocol=UDP

Port=123 join

224.0.1.1

receive01:00:5e:00:01:01

Imperfect hw filteringbased on dest Enet

Perfect sw filteringbased on dest IP

Multicasting 599

Multicasting on a WAN

MR2 MR3

Multicasting 600

Hosts joining a Multicast Group

MR2 MR3

H2 H3 H4 H5

joingroup

MRPMRP MRP

Multicasting 601

Sending packets on a WAN

MR2 MR3

H2 H3 H4 H5

joingroup

Multicasting 602

Multicasting

• Specifically note that;– All interested multicast routers receive the packets, MR5 does not

receive any since there are no interested hosts in its LAN– Packets are put to the specific LAN only if there are hosts in that LAN

to receive those packets, MR3 only forwards– Multicast router MR2 both puts packets on its LAN for hosts H2 & H3,

and also makes a copy of the packets and forwards them to MR3.– This behavior is something unique to multicast forwarding.

Source-Specific Multicast

• Multicasting on a WAN has been difficult to deploy for several reasons.– The biggest problem is that the MRP; needs to get the data from all

the senders, which may be located anywhere in the network, to all the receivers, which may similarly be located anywhere.

– Another large problem is multicast address allocation: There are not enough IPv4 multicast addresses to statically assign them to everyone who wants one, as is done with unicast addresses.

• combines the group address with a system's source address, which solves the problems as follows:

– The receivers supply the sender's source address to the routers as part of joining the group.

– This removes the rendezvous problem from the network, as the network now knows exactly where the sender is.

– However, it retains the scaling properties of not requiring the sender to know who all the receivers are. This simplifies multicast routing protocols immensely.

• It redefines the identifier from simply being a multicast group address to being a combination of a unicast source and multicast destination (which SSM now calls a channel.

• An SSM session is the combination of source, destination, and port

• struct ip_mreq {• struct in_addr imr_multiaddr; /* IPv4 class D multicast addr */• struct in_addr imr_interface; /* IPv4 addr of local interface */• };

• struct ipv6_mreq {• struct in6_addr ipv6mr_multiaddr; /* IPv6 multicast addr */• unsigned int ipv6mr_interface; /* interface index, or 0 */• };

• struct group_req {• unsigned int gr_interface; /* interface index, or 0 */• struct sockaddr_storage gr_group; /* IPv4 or IPv6 multicast addr */• }

struct ip_mreq_source { struct in_addr imr_multiaddr; /* IPv4 class D multicast addr */ struct in_addr imr_sourceaddr; /* IPv4 source addr */ struct in_addr imr_interface; /* IPv4 addr of local interface */};

struct group_source_req { unsigned int gsr_interface; /* interface index, or 0 */ struct sockaddr_storage gsr_group; /* IPv4 or IPv6 multicast addr */ struct sockaddr_storage gsr_source; /* IPv4 or IPv6 source addr */}

Multicasting 609

Multicast Socket Options

• Use setsockopt() to modify socket options– IP_ADD_MEMBERSHIP

• Join a multicast group on a specified local interface– IP_DROP_MEMBERSHIP

• Leave a multicast group– IP_MULTICAST_IF

• Specify the interface for outgoing multicast datagrams sent on this socket– IP_MULTICAST_TTL

• Set the IPv4 TTL parameter (if not specified, default=1)– IP_MULTICAST_LOOP

• Enable or disable local loopback (default is enabled)

Multicasting 610

• IPv4 Class D addresses are multicast addresses– Range 224.0.0.0 to 239.255.255.255

– 32 bit Class D address is called the group address

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0 NET-ID(7b) HOST-ID (24b)

1 0 NET-ID (12b) HOST-ID (14b)

1 1 0 NET-ID (21b) HOST-ID (8b)

1 1 1 0 GROUP-ID (28b)

CLASS A:

CLASS B:

CLASS C:

CLASS D:

Multicasting

Multicasting 611

• A mapping from IPv4 multicast addresses to Ethernet addresses is also defined– High order 24 bits always 01:00:5e– 25th bit is 0– Low order 23 bits from lowest 23 bits of multicast group address– Not one-to-one, many (32) multicast addresses to a single Ethernet

address

• Broadcasting is normally limited to LANs, whereas Multicasting can be done in LANs or WANs

multicast address• IPv4 class D address

– 224.0.0.0 ~ 239.255.255.255 – (224.0.0.1: all hosts group), (224.0.0.2: all-routers group)

Multicast Addresses Scope

Multicast Session

• Especially in the case of streaming multimedia, the combination of an IP multicast address (either IPv4 or IPv6) and a transport-layer port (typically UDP) is referred to as a session.

• For example, an audio/video teleconference may comprise two sessions; one for audio and one for video. These sessions almost always use different ports and sometimes also use different groups for flexibility in choice when receiving.

Multicasting 615

Multicast vs Broadcast

SendingAppl

DataLink

ReceivingAppl

DataLink

subnet 128.7.6

02:60:8c:2f:4e:00

Enethdr

IPv4hdr

UDPhdr

UDPData

Dest Enet: 01:00:5e:00:01:01Frame type: 0800

Dest Port: 123

02:60:20:03:f6:42

Frame type= 0800

Protocol=UDP

Port=123 join

224.0.1.1

receive01:00:5e:00:01:01

Imperfect hw filteringbased on dest Enet

Perfect sw filteringbased on dest IP

Multicasting 616

Multicasting on a WAN

MR2 MR3

Multicasting 617

Hosts joining a Multicast Group

MR2 MR3

H2 H3 H4 H5

joingroup

MRPMRP MRP

Multicasting 618

Sending packets on a WAN

MR2 MR3

H2 H3 H4 H5

joingroup

Multicasting 619