TCP: Transmission Control Protocol-TCP provides a connection
oriented, reliable, byte stream service
- TCP communication is reliable but not guaranteed
TCP provides reliable communication only by detecting failed
transmissions and resending them. It cannot guarantee any
particular transmission, because it relies on IP, which is
unreliable. All it can do is keep trying if an initial delivery
attempt fails.
-TCP Byte streaming ServiceTCP is designed to have applications
send data to it as a stream of bytes, rather than requiring
fixed-size messages to be used. This provides maximum flexibility
for a wide variety of uses; because applications dont need to worry
about data packaging and can send files or messages of any size.
TCP takes care of packaging these bytes into messages called
segments.-TCP Connection Establishment TerminologyTransmission
Control Block (TCB)
For each TCP session, both devices create a data structure,
called a transmission control block (TCB) that is used to hold
important data related to the connection.
The TCB contains all the important information about the
connection, such as the two socket numbers that identify it,
pointers to buffers that hold incoming and outgoing data, variables
that keep track of the number of bytes received and acknowledged,
bytes received and not yet acknowledged, current window size, and
so forthThe TCB for a connection is maintained throughout the
connection and destroyed when the connection is completely
terminated
Active and Passive OpensA client process using TCP takes the
active role and initiates the connection by sending a TCP message
to start the connection (a SYN message).
A server process using TCP prepares for an incoming connection
request by performing a passive Open-TCP Connection Establishment
Process: The Three-Way HandshakeThe normal process of establishing
a connection between a TCP client and server involves the following
three steps: The client sends a SYN message. The server sends a
message that combines an ACK for the clients SYN and contains the
servers SYN. And the client sends an ACK for the servers SYN. This
is called the TCP three-way handshake
-TCP Connection Establishment Sequence Number Synchronization
and parameter ExchangeSequence Number Synchronization
TCP device, at the time a connection is initiated, chooses a
32-bit initial sequence number (ISN) for the connection. Each
device has its own ISN, and those ISNs normally wont be the same.
Each device chose the ISN using a random number.Once each device
chooses its ISN, it sends the ISN value to the other device in the
Sequence Number field in the devices initial SYN message. The
device receiving the SYN responds with an ACK message that
acknowledges the SYN (which may also contain its own SYN, as in
step 2 of the three-way handshake). In the ACK message, the
Acknowledgment Number field is set to the value of the ISN that is
received from the other device plus one. This represents the next
sequence number the device expects to receive from its peer for the
first data transmission. This process is called sequence number
synchronization.TCP Parameter ExchangeIn addition to the initial
sequence numbers, SYN messages also are designed to convey
important parameters about how the connection should operate. The
variable length Options field in the TCP segment is used to carry
these parameters. Some of these include the following.
Maximum segment size (MSS) The maximum size of segment that each
end of TCP connection can send to the other
Selective Acknowledgment Permitted Allows a pair of devices to
use the optional selective acknowledgment feature to allow only
certain lost segments to be retransmitted.
Alternate Checksum Method Lets devices specify an alternative
method of performing checksums than the standard TCP checksum
mechanism.
-Simultaneous Open Connection EstablishmentIt is possible,
although improbable, for two applications to both perform an active
open to each other at the same time. Each end must transmit a SYN,
and the SYNs must pass each other on the network. It also requires
each end to have a local port number that is well known to the
other end. This is called a simultaneous open.
TCP was purposely designed to handle simultaneous opens and the
rule is that only one Connection results from this, not two
connections. We don't call either end a client or a server, because
both ends act as client and server.Both ends send a SYN at about
the same time, entering the SYN_SENT state. When each end receives
the SYN, the state changes to SYN_RCVD, and each end resends the
SYN and acknowledges the received SYN. When each end receives the
SYN plus the ACK, the state changes to ESTABLISHED
-TCP Connection TerminationA TCP connection is terminating using
a special procedure by which each side independently closes its end
of the link. The connection termination normally begins with one of
the application processes signaling to its TCP layer that the
session is no longer needed. That device sends a FIN message to
tell the other device that it wants to end the connection, which
the other device acknowledges. When the responding device is ready,
it too sends a FIN that the other device acknowledges; after
waiting for a period of time (2 MSL) for the device to receive the
ACK, the device closes the session.
During CLOSE-WAIT state the server may continue sending data,
and the client will receive it. However, the client will not send
data to the server-The TIME-WAIT StateThe TIME_WAIT state is also
called the 2MSL wait state. Every implementation must choose a
value for the maximum segment lifetime (MSL). It is the maximum
amount of time any segment can exist in the network before being
discarded. We know this time limit is bounded, since TCP segments
are transmitted as IP data grams, and the IP datagram has the TTL
field that limits its lifetime. The TCP standard defines MSL as
being a value of 120 seconds (2 minutes), Common implementation
values, however, are 30 seconds, 1 minute, or 2 minutes.The
TIME-WAIT state is required for two main reasons:-To provide enough
time to ensure that the other device receives the ACK, and to
retransmit it if it is lost- while the TCP connection is in the
2MSL wait, the socket pair defining that connection (client IP
address, client port number, server IP address, and server port
number) cannot be reused. This prevents packets from different
connections being mixed-Quiet Time ConceptIf a host with ports in
the 2MSL wait crashes, reboots within MSL seconds, and
immediately establishes new connections using the same local and
foreign IP addresses and port numbers corresponding to the local
ports that were in the 2MSL wait before the crash, delayed segments
from the connections that existed before the crash can be
misinterpreted as belonging to the new connections created after
the reboot. This can happen regardless of how the initial sequence
number is chosen after the reboot. To protect against this
scenario, RFC 793 states that TCP should not create any connections
for MSL seconds after rebooting. This is called the quiet
time.-Simultaneous Connection TerminationTwo devices can
simultaneously terminate a TCP connection. In this case, a
different state sequence is followed, with each device responding
to the others FIN with an ACK, then waiting for receipt of its own
ACK, and pausing for a period of time to ensure that the other
device received its ACK before ending the connection
-TCP Finite State Machine (FSM) States, Events and
TransitionsStateState DescriptionEvent and Transition
CLOSEDThis is the default state that each connection starts in
before the process of establishing it begins. The state is called
fictional in the standard. The reason is that this state represents
the situation where there is no connection between devicesit either
hasn't been created yet, or has just been destroyed. If that makes
sense. Passive Open: A server begins the process of connection
setup by doing a passive open on a TCP port. At the same time, it
sets up the data structure (transmission control block or TCB)
needed to manage the connection. It then transitions to the LISTEN
state.
Active Open, Send SYN: A client begins connection setup by
sending a SYN message, and also sets up a TCB for this connection.
It then transitions to the SYN-SENT state.
LISTENA device (normally a server) is waiting to receive a
synchronize (SYN) message from a client. It has not yet sent its
own SYN message.Receive Client SYN, Send SYN+ACK: The server device
receives a SYN from a client. It sends back a message that contains
its own SYN and also acknowledges the one it received. The server
moves to the SYN-RECEIVED state.
SYN-SENTThe device (normally a client) has sent a synchronize
(SYN) message and is waiting for a matching SYN from the other
device (usually a server).Receive SYN, Send ACK: If the device that
has sent its SYN message receives a SYN from the other device but
not an ACK for its own SYN, it acknowledges the SYN it receives and
then transitions to SYN-RECEIVED to wait for the acknowledgment to
its SYN.
Receive SYN+ACK, Send ACK: If the device that sent the SYN
receives both an acknowledgment to its SYN and also a SYN from the
other device, it acknowledges the SYN received and then moves
straight to the ESTABLISHED state.
SYN-RECEIVEDThe device has both received a SYN (connection
request) from its partner and sent its own SYN. It is now waiting
for an ACK to its SYN to finish connection setup.Receive ACK: When
the device receives the ACK to the SYN it sent, it transitions to
the ESTABLISHED state.
ESTABLISHEDThe steady state of an open TCP connection. Data can
be exchanged freely once both devices in the connection enter this
state. This will continue until the connection is closed for one
reason or another.Close, Send FIN: A device can close the
connection by sending a message with the FIN (finish) bit sent and
transition to the FIN-WAIT-1 state.
Receive FIN: A device may receive a FIN message from its
connection partner asking that the connection be closed. It will
acknowledge this message and transition to the CLOSE-WAIT
state.
CLOSE-WAITThe device has received a close request (FIN) from the
other device. It must now wait for the application on the local
device to acknowledge this request and generate a matching
request.Close, Send FIN: The application using TCP, having been
informed the other process wants to shut down, sends a close
request to the TCP layer on the machine upon which it is running.
TCP then sends a FIN to the remote device that already asked to
terminate the connection. This device now transitions to
LAST-ACK.
LAST-ACKA device that has already received a close request and
acknowledged it, has sent its own FIN and is waiting for an ACK to
this request.Receive ACK for FIN: The device receives an
acknowledgment for its close request. We have now sent our FIN and
had it acknowledged, and received the other device's FIN and
acknowledged it, so we go straight to the CLOSED state.
FIN-WAIT-1A device in this state is waiting for an ACK for a FIN
it has sent, or is waiting for a connection termination request
from the other device. Receive ACK for FIN: The device receives an
acknowledgment for its close request. It transitions to the
FIN-WAIT-2 state.
Receive FIN, Send ACK: The device does not receive an ACK for
its own FIN, but receives a FIN from the other device. It
acknowledges it, and moves to the CLOSING state.
FIN-WAIT-2A device in this state has received an ACK for its
request to terminate the connection and is now waiting for a
matching FIN from the other device.Receive FIN, Send ACK: The
device receives a FIN from the other device. It acknowledges it and
moves to the TIME-WAIT state.
CLOSINGThe device has received a FIN from the other device and
sent an ACK for it, but not yet received an ACK for its own FIN
message.Receive ACK for FIN: The device receives an acknowledgment
for its close request. It transitions to the TIME-WAIT state.
TIME-WAITThe device has now received a FIN from the other device
and acknowledged it, and sent its own FIN and received an ACK for
it. We are done, except for waiting to ensure the ACK is received
and prevent potential overlap with new connections. Timer
Expiration: After a designated wait period, device transitions to
the CLOSED state.
-TCP Message (Segment) Format
Source Port and Destination PortThe source and destination port
number identify the sending and receiving application. These two
values, along with the source and destination IP Addresses in the
IP header, uniquely identify each TCP connectionSequence Number
For normal transmissions, this is the sequence number of the
first byte of data in this segment. In a connection request (SYN)
message, this carries the initial sequence number (ISN) chosen by
this host for this connection. The sequence number of the first
byte of data sent by this host will be the ISN plus
oneAcknowledgment Number
The acknowledgment number contains the next sequence number that
the sender of the acknowledgment expects to receive. This is
therefore the sequence number plus 1 of the last successfully
received byte of data.
Header Length
The header length gives the length of the header in 32-bit
words. This is required because the length of the options field is
variable. With a 4-bit field, TCP is limited to a 60-byte header.
Without options, however, the normal size is 20 bytes
Reserved
This field is 6 bits reserved for future use; sent as zero.
Six Flag BitsThere are six flag bits in the TCP header. One or
more of them can be turned on at the same time.
URG The urgent pointer is valid ACK The acknowledgment number is
valid.
PSH The receiver should pass this data to the application as
soon as possible
RST Reset the connection
SYN Synchronize sequence numbers to initiate a connection
FIN The sender is finished sending data.
Window Size
This is the number of bytes, starting with the one specified by
the acknowledgment number field, that the receiver is willing to
accept. This is a 16-bit field, limiting the window to 65535
bytes
Checksum
The checksum covers the TCP segment: the TCP header and the TCP
data. This is a
Mandatory field that must be calculated and stored by the
sender, and then verified by the receiver. The TCP checksum is
calculated similar to the UDP checksum, using a pseudo header
Urgent Pointer
The urgent pointer is valid only if the URG flag is set. This
pointer is a positive offset that must be added to the sequence
number field of the segment to yield the sequence number of the
last byte of urgent data. TCP's urgent mode is a way for the sender
to transmit emergency data to the other end.
Options
Specifies one or more options that specify how the TCP
connection should be operated. Common options field include Maximum
segment size (MSS) ,Alternate checksum etc.-The TCP Reset
FunctionTCP uses Reset segments with RST flag set to handle
problems that happen during an established connection. The device
detecting the problem sends a TCP segment with the RST (reset) flag
set to 1. The receiving device either returns to the LISTEN state,
if it was in the process of connection establishment, or closes the
connection and returns to the CLOSED state.The following are some
of the most common cases in which the TCP software generates a
reset:a) Half Open ConnectionA TCP connection is said to be
half-open if one end has closed or aborted the connection without
the knowledge of the other end. This can happen any time one of the
two hosts crashes. As long as there is no attempt to transfer data
across a half-open connection, the end that's still up won't detect
that the other end has crashed.
Another common cause of a half-open connection is when a client
host is powered off, instead of terminating the client application
and then shutting down the client host.
b) Connection Request to Nonexistent PortWhen a connection
request arrives and no process is listening on the destination
portc) Receipt of any TCP segment from any device with which the
device receiving the segment does not currently have a connection
(other than a SYN requesting a new connection)
d) Receipt of a message with an invalid or incorrect Sequence
Number or Acknowledgment Number field, indicating that the message
may belong to a prior connection or is spurious in some other
way-TCP Checksum Calculation and the TCP Pseudo HeaderTo provide
basic protection against errors in transmission, TCP includes a
16-bit Checksum field in its header. Instead of computing the
checksum over only the actual data fields of the TCP segment, a
12-byte TCP pseudo header is created prior to checksum
calculation.
TCP pseudo header for checksum calculationOnce this 96-bit
pseudo header has been formed, it is placed in a buffer, followed
by the TCP segment itself. Then the checksum is computed over the
entire set of data (pseudo header plus TCP segment). The value of
the checksum is placed in the Checksum field of the TCP header, and
the pseudo header is discarded
The Checksum field is itself part of the TCP header and thus one
of the fields over which the checksum is calculated. This field is
assumed to be all zeros during calculation of the checksum.
When the TCP segment arrives at its destination, the receiving
TCP software
performs the same calculation. It forms the pseudo header,
prepends it to the actual TCP segment, and then performs the
checksum (setting the Checksum field
to zero for the calculation as before). If there is a mismatch
between its calculation
and the value the source device put in the Checksum field, this
indicates that an error of some sort occurred, and the segment is
normally discardedAdvantages of the Pseudo Header MethodThe
checksum protects against not just errors in the TCP segment
fields, but also against the following problems:
Incorrect Segment Delivery If there is a mismatch in the
Destination/Source Address between what the source specified and
what the destination that received the segment used, the checksum
will fail. Incorrect Protocol If a datagram is routed to TCP that
actually belongs to a different protocol for whatever reason, this
can be immediately detected.
Incorrect Segment Length If part of the TCP segment has been
omitted by accident, the lengths the source and destination used
wont match, and the checksum will fail.TCP also supports an
optional method of having two devices agree on an alternative
checksum algorithm. This must be negotiated during connection
establishment-TCP Immediate Data Transfer: Push FunctionTCP
includes a special push function to handle cases where data given
to TCP needs to be sent immediately. An application can send data
to its TCP software and indicate that it should be pushed. The
segment will be sent right away rather than being buffered. The
pushed segments PSH control bit will be set to 1 to tell the
receiving TCP that it should immediately pass the data up to the
receiving applicationThere's no API to set the PSH flag. Typically
it is set by the kernel when it empties the bufferExample
Scenarios-Telnet session
-HTTP request to a web server
-TCP Priority Data Transfer: Urgent FunctionTo deal with
situations where a certain part of a data stream needs to be sent
with a higher priority than the rest, TCP incorporates an urgent
function. When critical data needs to be sent, the application
signals this to its TCP layer, which transmits it with the URG bit
set in the TCP segment, bypassing any lower-priority data that may
have already been queued for transmission. TCP also sets the Urgent
Pointer field to an offset value that points to the last byte of
urgent data in the segment. So, for example, if the segment
contained 400 bytes of urgent data followed by 200 bytes of regular
data, the URG bit would be set, and the Urgent Pointer field would
have a value of 400.The URG flag causes the receiving TCP to
forward the urgent data on a separate channel to the application
(for instance on Unix your process gets a SIGURG). This allows the
application to process the data out of band.As a side note, it's
important to be aware that urgent data is rarely used today and not
very well implementedExample ScenariosAborting a file transfer the
abort command data should be transferred urgently, not after the
file data is sent-TCP Sliding Window Data Transfer and
Acknowledgment MechanicsTCP Sliding Window System forms the basis
of TCP data transfer and Flow Control.Each of the two devices on a
connection must keep track of the data it is sending, as well as
the data it is receiving from the other device. This is done by
conceptually dividing the bytes into categories. Sliding Window
Transmit CategoriesFor data being transmitted, there are four
transmit categories:
Transmit Category 1 Bytes sent and acknowledged
Transmit Category 2 Bytes sent but not yet acknowledged
Transmit Category 3 Bytes not yet sent for which recipient is
ready
Transmit Category 4 Bytes not yet sent for which recipient is
not readyThe Send Window and Usable WindowThe send window
represents the maximum number of unacknowledged bytes that a device
is allowed to have outstanding (send) at one time. The send window
is often called just WindowThe usable window is the amount of the
send window that the sender is still allowed to send at any point
in time; it is equal to the size of the send window less the number
of unacknowledged bytes already transmitted.Basic Sliding Window
Mechanism
When a device gets an acknowledgment for a range of bytes, it
knows the destination has successfully received them. It moves them
from the sent but unacknowledged to the sent and acknowledged
category. This causes the send window to slide to the right,
allowing the device to send more data.Example Sliding Window
Now lets suppose that the sender transmits all the bytes from
usable window (6 bytes)
Now lets suppose that the sender received acknowledgement for
bytes 32 to 36, the send window slides right by 5 bytes as
below
Three terms are used to describe the movement of the right and
left edges of the window.
1. The window closes as the left edge advances to the right.
This happens when data is sent and acknowledged.2. The window opens
when the right edge moves to the right, allowing more data to be
sent. This happens when the receiving process on the other end
reads acknowledged data, freeing up space in its TCP receive
buffer.3. The window shrinks when the right edge moves to the left.
The TCP Standarad strongly discourages this, but TCP must be able
to cope with a peer that does this. -Send (SND) PointersThe four
transmit categories are divided using three send (SND) pointersSend
Unacknowledged (SND.UNA) The sequence number of the first byte
of
data that has been sent but not yet acknowledged. This marks the
first byte of
Transmit Category 2; all previous sequence numbers refer to
bytes in Transmit
Category 1.
Send Next (SND.NXT) The sequence number of the next byte of data
to be sent
to the other device (the server, in this case). This marks the
first byte of Transmit
Category 3.Send Window (SND.WND) The size of the send window.
Recall that the window
specifies the total number of bytes that any device may have
outstanding (unacknowledged) at any one time. Thus, adding the
sequence number of the first unacknowledged byte (SND.UNA) and the
send window (SND.WND) marks the first byte of Transmit Category
4.
Usable window is given by the following formula:
SND.UNA + SND.WND - SND.NXT-Receive Categories and PointersFor
data being received, there are three receive categories:Receive
Category 1+2 Bytes received and acknowledged. This is the
receivers
complement to Transmit Categories 1 and 2.
Receive Category 3 Bytes not yet received for which recipient is
ready. This is the receivers complement to Transmit Category 3.
Receive Category 4 Bytes not yet received for which recipient is
not ready. This is the receivers complement to Transmit Category
4.-Receive (RCV) PointersThe three receive categories are divided
using two pointers:
Receive Next (RCV.NXT) The sequence number of the next byte of
data that is
expected from the other device. This marks the first byte in
Receive Category 3. All
previous sequence numbers refer to bytes already received and
acknowledged, in
Receive Categories 1 and 2.
Receive Window (RCV.WND) The size of the receive window
advertised to the
other device. This refers to the number of bytes the device is
willing to accept at
one time from its peer, which is usually the size of the buffer
allocated for receiving
data for this connection. When added to the RCV.NXT pointer,
this pointer marks
the first byte of Receive Category 4.
Both the client and server keep track of both streams being sent
over the connection. This is done using a set of special variables
called pointers.A devices send pointers keep track of its outgoing
data, and its receive pointers keep track of the incoming data.The
receive pointers are the complement of the send (SND) pointers.The
RCV.WND of one device equals the SND.WND of the other device on the
Connection-TCP Segment Fields Used to Exchange Pointer
InformationSequence Number This will normally be equal to the value
of the SND.UNA
pointer at the time that data is sent.
Acknowledgment Number This field will normally be equal to the
RCV.NXT pointer
of the device that sends it.
Window The size of the receive window of the device sending the
segment (and
thus, the send window of the device receiving the segment).-TCP
Segment Retransmission Timers and the Retransmission
Queue
To retransmit lost segments, TCP employs one retransmission
timer (for the whole connection period) that handles the
retransmission time-out (RTO), the waiting time for an
acknowledgment of a segment. We can define the following rules for
the retransmission timer:
1. When TCP sends the segment in front of the sending queue, it
starts the timer.
2. When the timer expires, TCP resends the first segment in
front of the queue, and
restarts the timer.
3. When a segment (or segments) are cumulatively acknowledged,
the segment (or
segments) are purged from the queue.
4. If the queue is empty, TCP stops the timer; otherwise, TCP
restarts the timer.
The retransmissions-times and the number of attempts isn't
enforced by the standard. It is implemented differently by
different operating systems, but the methodology is fixed. The
retransmission timeouts(RTO) are measured in terms of the RTT
(Round Trip Time)Round-Trip Time (RTT)
The typical time it takes to send a segment from a client to a
server and the server to send an acknowledgment back to the client.
Measured RTT The measured round-trip time for a segment is the time
required for the segment to reach the destination and be
acknowledged, although the acknowledgment may include other
segments. Note that in TCP, only one
RTT measurement can be in progress at any time. This means that
if an RTT measurement is started, no other measurement starts until
the value of this RTT is finalized. We use the notation RTTM to
stand for measured RTT.
Smoothed RTT The measured RTT, RTTM, is likely to change for
each round trip.
The fluctuation is so high in todays Internet that a single
measurement alone cannot be used for retransmission time-out
purposes. Most implementations use a smoothed RTT, called RTTS,
which is a weighted average of RTTM and the previous RTTS as shown
belowInitially No value
After first measurement RTTS = RTTM(Old)After each measurement
RTTS = (1 ) RTTS + RTTMRTT Deviation Most implementations do not
just use RTTS; they also calculate the RTT deviation, called RTTD,
based on the RTTS and RTTMRetransmission Time-out (RTO)
The value of RTO is based on the smoothed round-trip time and
its deviation. Most implementations use the following formula to
calculate the RTO
Original Initial value
After any measurement RTO = RTTS + 4 RTTD-Retransmission
Ambiguity and Karns Algorithm
Suppose a packet is transmitted and timeout occurs, the RTO is
backed off and the packet is retransmitted with the longer RTO. Now
an acknowledgment is received. Is the ACK for the first
transmission or the second? This is called the retransmission
ambiguity problem.
This ambiguity was solved by Karns algorithm. Karns Algorithm is
simple. Do not consider the round-trip time of a retransmitted
segment in the calculation of RTTs. Do not update the value of RTTs
until you send a segment and receive an acknowledgment without the
need for retransmission
Exponential Backoff
What is the value of RTO if a retransmission occurs? Most TCP
implementations use
an exponential backoff strategy. The value of RTO is doubled for
each retransmission.
So if the segment is retransmitted once, the value is two times
the RTO. If it transmitted twice, the value is four times the RTO,
and so on.-TCP Acknowledgment Handling and Selective Acknowledgment
(SACK)
TCP uses a cumulative acknowledgment system. The
Acknowledgment
Number field in a segment received by a device indicates that
all bytes of data with sequence numbers less than that value have
been successfully received by the other device
TCPs acknowledgment system is cumulative. This means that if a
segment is
lost in transit, no subsequent segments can be acknowledged
until the missing one is retransmitted and successfully
received
There are two approaches to handling retransmission in TCP. In
the more conservative approach, only the segments whose timers
expire are retransmitted. This saves bandwidth, but it may cause
performance degradation if many segments in a row are lost. The
alternative is that when a segments retransmission timer expires,
both it and all subsequent unacknowledged segments are
retransmitted. This provides better performance if many segments
are lost, but it may waste bandwidth on unnecessary
retransmissions
The optional TCP selective acknowledgment feature provides a
more elegant way of handling subsequent segments when a
retransmission timer expires. When a device
receives a noncontiguous segment, it includes a special
Selective Acknowledgment (SACK) option in its duplicate
acknowledgment(duplicate because of the missing segment) that
identifies noncontiguous segments that have already been received,
even if they are not yet acknowledged. This saves the original
sender from needing to retransmit them.
To use SACK, the two devices on the connection must both support
the feature,
and must enable it by negotiating the Selective Acknowledge
Permitted (SACK Permitted) option in the SYN segment they use to
establish the connectionExample Scenario
Step 1 Response segment #2 is lost.
Step 2 The client realizes it is missing a segment between
segments #1 and #3. It sends a duplicate acknowledgment for segment
#1, and attaches a SACK option indicating that it has received
segment #3.
Step 3 The client receives segment #4 and sends another
duplicate acknowledgment for segment #1, but this time expands the
SACK option to show that it has received segments #3 through
#4.
Step 4 The server receives the client's duplicate ACK for
segment #1 and SACK for segment #3 (both in the same TCP packet).
From this, the server deduces that the client is missing segment
#2, so segment #2 is retransmitted. The next SACK received by the
server indicates that the client has also received segment #4
successfully, so no more segments need to be transmitted.
Step 5 The client receives segment #2 and sends an
acknowledgment to indicate that it has received all data up to an
including segment #4.
-TCP Window Size Adjustment and Flow ControlThe TCP sliding
window system is used not just for ensuring reliability through
acknowledgments and retransmissions, but it is also the basis
for TCPs flow control mechanism. By increasing or reducing the size
of its receive window, a device can raise or lower the rate at
which its connection partner sends it data. In the case where a
device becomes extremely busy, it can even reduce the receive
window to zero. This will close the window and halt any further
transmissions of data until the window is reopenedExample TCP
closing the Send Window and ZERO WindowThis diagram shows three
message cycles, each of which results in the server reducing its
receive window. In the first cycle, the server reduces it from 360
to 260 bytes, so the clients usable window can increase by only 40
bytes when it gets the servers acknowledgment. In the second and
third cycles, the server reduces the window size by the amount of
data it receives, which temporarily freezes the clients send window
size, halting it from sending new data.
Handling a Closed Window and Sending Probe SegmentsA device that
reduces its receive window to zero is said to have closed the
window. The other devices send window is thus closed; it may not
send regular data segments. It may, however, send probe segments to
check the status of the window, thus making sure it does not miss
notification when the window reopens.
-TCP Window Management IssuesShrinking the TCP WindowA
phenomenon called shrinking the window occurs when a device reduces
its receive window so much that its partner devices usable transmit
window shrinks in size (meaning that the right edge of its send
window moves to the left)
Example
Diagram Description
The client begins with a usable window size of 360 bytes. It
sends a 140-byte segment and then a short time thereafter sends one
of 180 bytes. The server is busy, however, and when it receives the
first transmission, it decides to reduce its buffer to 240 bytes.
It holds the 140 bytes just received and reduces its receive window
all the way down to 100 bytes. When the clients 180-byte segment
arrives,
there is room for only 100 of the 180 bytes in the servers
buffer. When the client gets the new window size advertisement of
100, it will have a problem, because it already has 180 bytes sent
but not acknowledged.Handling Shrinking issues
Shrinking occurs whenever the server sends back a window size
advertisement smaller than what the client considers its usable
window size to be at that time.
Shrinking can result in data already in transit needing to be
discarded.
To prevent the problems associated with shrinking windows from
occurring, TCP adds a simple rule to the basic sliding window
mechanism: A device is not allowed to shrink the window, devices
must instead reduce their receive window size more graduallyOf
course, there may be cases where we do need to reduce a buffer, so
how should this be handled? Instead of shrinking the window, the
server must be more patient. In the previous, where the buffer
needs to be reduced to 240 bytes, the server must send back a
window size of 220, freezing the right edge of the clients send
window. The client can still fill the 360-byte buffer, but it
cannot send more than that. As soon as 120 bytes are removed from
the servers receive buffer, the buffer can then be reduced in size
to 240 bytes with no data loss. Then the server can resume normal
operation, increasing the window size as bytes are taken from the
receive buffer-TCP Silly Window SyndromeThe basic TCP sliding
window system sets no minimum size on transmitted
segments. Under certain circumstances, this can result in a
situation where many small, inefficient segments are sent, rather
than a smaller number of large ones. Affectionately termed silly
window syndrome (SWS), this phenomenon can occur either as a result
of a recipient advertising window sizes that are too small or a
transmitter being too aggressive in immediately sending out very
small amounts of data.
ExampleThis diagram shows one example of how the phenomenon
known as TCP silly window syndrome can arise. The client is trying
to send data as fast as possible
to the server, which is very busy and cannot clear its buffers
promptly. Each time the client sends data, the server reduces its
receive window. The size of the messages the client sends shrinks
until it is sending only very small, inefficient segments.
-Silly Window Syndrome Avoidance AlgorithmsSince both the sender
and recipient of data contribute to SWS, changes are made to the
behavior of both to avoid SWS. These changes are collectively
termed SWS avoidance algorithms.Receiver SWS Avoidance
The receiver contributed to SWS by reducing the size of its
receive window to smaller and smaller values. This caused the right
edge of the senders send window to move by ever-smaller increments,
leading to smaller and smaller segments.To avoid SWS we restrict
the receiver from moving the right edge of the window by too small
an amount. The usual minimum that the edge may be moved is either
the value of the MSS parameter or one-half the buffer size,
whichever is less.
Sender SWS Avoidance and Nagles Algorithm
Instead of trying to immediately send data as soon as we can, we
wait to send it until we have a segment of a reasonable size. The
specific method for doing this is called Nagles algorithm
Simplified, this algorithm works as follows:
- As long as there is no unacknowledged data outstanding on the
connection, as
soon as the application wants, data can be immediately sent. For
example, in the case of an interactive application like Telnet, a
single keystroke can be pushed in a segment.- While there is
unacknowledged data, all subsequent data to be sent is held in the
transmit buffer and not transmitted until either all the
unacknowledged data is acknowledged or we have accumulated enough
data to send a full-sized (MSS-sized) segment. This applies even if
a push is requested by the user.-TCP Congestion Handling and
Congestion Avoidance
Algorithms
Congestion in the network layer is a situation in which too many
datagrams are present in an area of the InternetCongestion in a
network may occur if the load on the networkthe number of packets
sent to the networkis greater than the capacity of the networkthe
number of packets a network can handle.
During Congestion routers may drop some packets.
To deal with congestion and avoid contributing to it
unnecessarily, modern TCP implementations include a set of
Congestion Avoidance algorithms that alter the normal operation of
the sliding window system to ensure more efficient overall
operation.The four algorithms, Slow Start, Congestion Avoidance,
Fast Retransmit and Fast
Recovery are described below.Slow Start
Slow Start, a requirement for TCP software implementations is a
mechanism used by the sender to control the transmission rate,
otherwise known as sender-based flow control. This is accomplished
through the return rate of acknowledgements from the receiver. In
other words, the rate of acknowledgements returned by the receiver
determines the rate at which the sender can transmit data.
When a TCP connection first begins, the Slow Start algorithm
initializes a congestion window to one segment, which is the
maximum segment size (MSS) initialized by the receiver during the
connection establishment phase. When acknowledgements are returned
by the receiver, the congestion window increases by one segment for
each acknowledgement returned. Thus, the sender can transmit the
minimum of the congestion window and the advertised window of the
receiver, which is simply called the transmission window.
Slow Start is actually not very slow when the network is not
congested and network
response time is good. For example, the first successful
transmission and acknowledgement of a TCP segment increases the
window to two segments. After
successful transmission of these two segments and
acknowledgements completes, the window is increased to four
segments. Then eight segments, then sixteen segments and so on,
doubling from there on out up to the maximum window size advertised
by the receiver or until congestion finally does occur.
At some point the congestion window may become too large for the
network or network conditions may change such that packets may be
dropped. Packets lost will trigger a timeout at the sender. When
this happens, the sender goes into congestion avoidance mode as
described in the next section.
Congestion Avoidance
During the initial data transfer phase of a TCP connection the
Slow Start algorithm is used. However, there may be a point during
Slow Start that the network is forced to drop one or more packets
due to overload or congestion. If this happens, Congestion
Avoidance is used to slow the transmission rate. However, Slow
Start is used in Conjunction with Congestion Avoidance as the means
to get the data transfer going again so it doesnt slow down and
stay slow.
In the Congestion Avoidance algorithm a retransmission timer
expiring or the reception of duplicate ACKs can implicitly signal
the sender that a network congestion situation is occurring. The
sender immediately sets its transmission window to one half of the
current window size (the minimum of the congestion window and the
receivers advertised window size), but to at least two segments.If
congestion was indicated by a timeout, the congestion window is
reset to one segment, which automatically puts the sender into Slow
Start mode. If congestion was indicated by duplicate ACKs, the Fast
Retransmit and Fast Recovery algorithms are invoked .As data is
received during Congestion Avoidance, the congestion window is
increased. However, Slow Start is only used up to the halfway point
where congestion originally occurred. This halfway point was
recorded earlier as the new transmission window. After this halfway
point, the congestion window is increased by one segment for all
segments in the transmission window that are acknowledged. This
mechanism will force the sender to more slowly grow its
transmission rate, as it will approach the point where congestion
had previously been detected.
Fast Retransmit
When a duplicate ACK is received, the sender does not know if it
is because a TCP
segment was lost or simply that a segment was delayed and
received out of order at the receiver. If the receiver can re-order
segments, it should not be long before the receiver sends the
latest expected acknowledgement. Typically no more than one or two
duplicate ACKs should be received when simple out of order
conditions exist. If however more than two duplicate ACKs are
received by the sender, it is a strong indication that at least one
segment has been lost. The TCP sender will assume enough time has
lapsed for all segments to be properly re-ordered by the fact that
the receiver had enough time to send three duplicate ACKs.
When three or more duplicate ACKs are received, the sender does
not even wait for a retransmission timer to expire before
retransmitting the segment (as indicated by the position of the
duplicate ACK in the byte stream). This process is called the
Fast
Retransmit algorithm. Immediately following Fast Retransmit is
the Fast Recovery algorithm.Fast Recovery
Since the Fast Retransmit algorithm is used when duplicate ACKs
are being received, the TCP sender has implicit knowledge that
there is data still flowing to the receiver. Why? The reason is
because duplicate ACKs can only be generated when a segment is
received. This is a strong indication that serious network
congestion may not exist and that the lost segment was a rare
event. So instead of reducing the flow of data abruptly by going
all the way into Slow Start, the sender only enters Congestion
Avoidance mode.
Rather than start at a window of one segment as in Slow Start
mode, the sender resumes transmission with a larger window,
incrementing as if in Congestion Avoidance mode. This allows for
higher throughput under the condition of only moderate
congestion-TCP Maximum Segment Size (MSS)TCP is designed to
restrict the size of the segments it sends to a certain maximum
limit, to reduce the likelihood that segments will need to be
fragmented for
transmission at the IP level. The TCP maximum segment size (MSS)
specifies the maximum number of bytes in the TCP segments Data
field, regardless of any other factors that influence segment size.
The default MSS for TCP is 536 bytes, which is calculated by
starting with the minimum IP MTU of 576 bytes and subtracting 20
bytes each for the IP and TCP headers
Devices can indicate that they wish to use a different MSS value
from the default by including a Maximum Segment Size option in the
SYN message they use to establish a connection. Each device in the
connection may use a different MSS value.IP: Internet Protocol
The primary purpose of IP is the delivery of data grams across
an internetwork of connected networks.-IP Characteristics-
Underlying Protocol-Independent IP is designed to allow the
transmission of data across any type of underlying network (layer
2) that is designed to work with a TCP/IP stack- Connectionless
Delivery IP is a connectionless protocol. This means that when
point A wants to send data to point B, it doesnt first set up a
connection to point B and then send the datait just makes the
datagram and sends it
- Unreliable Delivery IP does not provide reliability or
service-quality capabilities, such as error protection for the data
it sends (though it does on the IP header), flow control, or
retransmission of lost data grams. For this reason, IP is sometimes
called a best-effort protocol. -IP Functions-Addressing IP defines
the addressing mechanism for the network
-Data Encapsulation and Formatting/Packaging As the TCP/IP
network layer protocol; IP accepts data from the transport layer
protocols UDP and TCP. It then encapsulates this data into an IP
datagram using a special format prior to
transmission.-Fragmentation and Reassembly
-Routing-IP Address ClassesThe class full IP addressing scheme
divides the IP address space into five classes, A through E, of
differing sizes. Classes A, B, and C are the most important ones,
designated for conventional unicast addresses. Class D is reserved
for IP multicasting, and Class E is reserved for experimental
use.
IP Address ClassFirst Octet of IP AddressLowest Value of First
Octet (binary)Highest Value of First Octet (binary)Range of First
Octet Values (decimal)Octets in Network ID / Host IDTheoretical IP
Address Range
Class A0xxx xxxx0000 00010111 11101 to 1261 / 31.0.0.0 to
126.255.255.255
Class B10xx xxxx1000 00001011 1111128 to 1912 / 2128.0.0.0 to
191.255.255.255
Class C110x xxxx1100 00001101 1111192 to 2233 / 1192.0.0.0 to
223.255.255.255
Class D1110 xxxx1110 00001110 1111224 to 239224.0.0.0 to
239.255.255.255
Class E1111 xxxx1111 00001111 1111240 to 255240.0.0.0 to
255.255.255.255
-IP Address Patterns With Special MeaningsNetwork IDHost IDClass
A ExampleClass B ExampleClass C ExampleSpecial Meaning and
Description
Network IDHost ID77.91.215.5154.3.99.6227.82.157.160Normal
Meaning: Refers to a specific device.
Network IDAll Zeroes77.0.0.0154.3.0.0227.82.157.0The Specified
Network: This notation, with a 0 at the end of the address, refers
to an entire network.
All Zeroes Host ID0.91.215.50.0.99.60.0.0.160Specified Host On
This Network: This addresses a host on the current or default
network when the network ID is not known, or when it doesn't need
to be explicitly stated.
All ZeroesAll Zeroes0.0.0.0Me: (Alternately, this host, or the
current/default host). Used by a device to refer to itself when it
doesn't know its own IP address. The most common use is when a
device attempts to determine its address using a host-configuration
protocol like DHCP. May also be used to indicate that any address
of a multihomed host may be used.
Network IDAll Ones77.255.255.255154.3.255.255227.82.157.255All
Hosts On The Specified Network: Used for broadcasting to all hosts
on the local network.
All OnesAll Ones255.255.255.255All Hosts On The Network:
Specifies a global broadcast to all hosts on the directly-connected
network. Note that there is no address that would imply sending to
all hosts everywhere on the global Internet, since this would be
very inefficient and costly.
-Reserved, Loopback and Private IP AddressesRange Start
AddressRange End AddressClassful Address EquivalentClassless
Address EquivalentDescription
0.0.0.00.255.255.255Class A network 0.x.x.x0/8Reserved.
10.0.0.010.255.255.255Class A network 10.x.x.x10/8Class A
private address block.
127.0.0.0127.255.255.255Class A network 127.x.x.x127/8Loopback
address block.
128.0.0.0128.0.255.255Class B network
128.0.x.x128.0/16Reserved.
169.254.0.0169.254.255.255Class B network
169.254.x.x169.254/16Class B private address block reserved for
automatic private address allocation. See the section on DHCP for
details.
172.16.0.0172.31.255.25516 contiguous Class B networks from
172.16.x.x through 172.31.x.x172.16/12Class B private address
blocks.
191.255.0.0191.255.255.255Class B network
191.255.x.x191.255/16Reserved.
192.0.0.0192.0.0.255Class C network
192.0.0.x192.0.0/24Reserved.
192.168.0.0192.168.255.255256 contiguous Class C networks from
192.168.0.x through 192.168.255.x192.168/16Class C private address
blocks.
223.255.255.0223.255.255.255Class C network
223.255.255.x223.255.255/24Reserved.
-IP Datagram General Format
VersionThis 4-bit field defines the version of the IP
protocolHeader Length
This 4-bit field defines the total length of the datagram header
in 4-byte words. This field is needed because the length of the
header is variable (between 20 and 60 bytes). When there are no
options, the header length is 20 bytes, and the value of this field
is 5 (5 4 = 20). When the option field is at its maximum size, the
value of this field is 15 (15 4 = 60).Type of Service (TOS)
Field
In the original design of IP header TOS defined how the datagram
should be handled. Part of the field was used to define the
precedence of the datagram; the rest defined the type of service
(low delay, high throughput, maximize reliability). IETF has
changed the interpretation of this 8-bit field. This field now
defines a set of differentiated services. In the new
interpretation, the first 6 bits make up the codepoint subfield and
the last 2 bits are not used. The codepoint subfield can be used in
two different ways.a. When the 3 right-most bits are 0s, the 3
left-most bits are interpreted the same
as the precedence bits in the service type interpretation. In
other words, it is
compatible with the old interpretation. The precedence defines
the eight-level priority of the datagram (0 to 7) in issues such as
congestion. If a router is congested and needs to discard some
datagrams, those datagrams with lowest precedence are discarded
first. b. When the 3 right-most bits are not all 0s, the 6 bits
define 56 (64 8) services
based on the priority assignment by the Internet or local
authorities.
Total lengthThe total length field defines the total length of
the datagram including the header in bytes.
Length of data = total length - header length
Why we need this field anyway?There are occasions in which the
datagram is not the only thing encapsulated in a frame; it may be
that padding has been added. For example, the Ethernet protocol has
a minimum and maximum restriction on the size of data that can be
encapsulated in a frame (46 to 1500 bytes).If the size of an IP
datagram is less than 46 bytes, some padding will be added to meet
this requirement. In this case, when a machine decapsulates the
datagram, it needs to check the total length field to determine how
much is really data and how much is paddingIdentificationThis field
uniquely identifies each data gram (fragment) during fragmentation.
When a datagram is fragmented, the value in the identification
field is copied into all fragments, thus all fragments belonging to
a fragmented datagram are identified during reassembly by the
destination.
FlagsThis is a three-bit field. The first bit is reserved (not
used). The second bit is called the do not fragment bit. If its
value is 1, the machine must not fragment the datagram. If it
cannot pass the datagram through any available physical network, it
discards the datagram and sends an ICMP error message to the source
host. If its value is 0, the datagram can be fragmented if
necessary.The third bit is called the more fragment bit. If its
value is 1, it means the datagram is not the last fragment; there
are more fragments after this one. If its value is 0, it means this
is the last or only fragmentFragmentation offset.This 13-bit field
shows the relative position of this fragment with respect to the
whole datagram. It is the offset of the data in the original
datagram measured in units of 8 bytes (64 bits). Below figure shows
a datagram with a data size of 4000 bytes fragmented into three
fragments. The bytes in the original datagram are numbered 0 to
3999. The first fragment carries bytes 0 to 1399. The offset for
this datagram is 0/8 = 0. The second fragment carries bytes 1400 to
2799; the offset value for this fragment is 1400/8 = 175. Finally,
the third fragment carries bytes 2800 to 3999. The offset value for
this fragment is 2800/8 = 350
Time to Live (TTL) FieldTTL is used as maximum hop count for the
datagram. Each time a router processes a datagram, it reduces the
value of the TTL field by one. Once the TTL value becomes zero, the
datagram is said to have expired, at which point it is dropped, and
usually an Internet Control Message Protocol (ICMP) Time Exceeded
message is sent to inform the originator of the message that it has
expired. The TTL field is one of the primary mechanisms used to
prevent router loopsProtocol
The Protocol field identifies the higher-layer protocol,
generally either a transport layer protocol or encapsulated network
layer protocol carried in the datagramSome of the values of
Protocol field for different higher-level protocols are as
below
Header ChecksumA checksum is computed over the header to provide
basic protection against corruption in transmission. To compute the
IP checksum for an outgoing datagram, the value of the checksum
field is first set to 0. Then the 16-bit one's complement sum of
the header is calculated (i.e., the entire header is considered a
sequence of 16-bit words). The 16-bit one's complement of this sum
is stored in the checksum field. At each hop, the device receiving
the datagram does the same checksum calculation, and if there is a
mismatch, it discards the datagram as damaged. Source address This
32-bit field defines the IP address of the source. This field must
remain unchanged during the time the IP datagram travels from the
source host to the destination host. Destination address
This 32-bit field defines the IP address of the destination.
This field must remain unchanged during the time the IP datagram
travels from the source host to the destination host.OptionsThe
options, is a variable-length list of optional information for the
datagram. Options can be used for network testing and debugging.
Some of the options include record route, time stamp.-Fragmentation
-MTU
The size of the largest IP datagram that can be transmitted over
a physical network is called that networks maximum transmission
unit (MTU). Ethernet MTU is 1,500 bytes
If a datagram is passed from a network with a high MTU to one
with a low MTU, it must be fragmented to fit the other networks
smaller MTU.
Since some physical networks on the path between devices may
have a smaller MTU than others, it may be necessary to fragment the
datagram more than once.
Internet Minimum MTU: 576 Bytes. Routers are required to handle
an MTU of at least 576 bytes. This value is specified in IP
Standard. Hence 576 bytes is the default MTU for IP data grams.
-MTU Path DiscoveryMTU path discovery is used to determine the
optimal MTU to use for a route between two devices.MTU Path
Discovery uses ICMP error-reporting mechanism and Dont Fragment
(DF) bit of the IP headerMTU Path Discovery MechanismThe source
node sends a packet (data gram) having MTU of its local physical
link and with Dont Fragment (DF) bit set. If this packet goes
through without any errors, the devices can use that value for
future packets to that destination. If the packet encounters a
router whose local MTU is less than the packet size, the packet is
discarded and an ICMP Destination Unreachable - Fragmentation
Needed, Dont fragment bit set message is sent to the originating
host. The ICMP error message includes the MTU of the link
necessitating fragmentation. Now the source node tries again with
MTU smaller than the MTU mentioned in the ICMP error message. This
continues until it finds the largest MTU that can be used for the
Path.-The IP Fragmentation ProcessWhen an MTU requirement forces a
datagram to be fragmented, it is split into several smaller IP data
grams, each containing part of the original. The header of the
original datagram is changed into the header of the first fragment
with few fields modified and new headers are created for the other
fragments.Each fragment is set to have the same Identification
value to mark them as part of the same original datagram. The
Fragment Offset of each is set to the location where the fragment
belongs in the original datagram. The More Fragments field is set
to 1 for all fragments except to the last, to let the recipient
know when it has received all the fragments.-Fragmentation
ExampleThe four fragments shown in below figure are created as
follows:
The first fragment is created by taking the first 3,300 bytes of
the 12,000-byte IP datagram. This includes the original header,
which becomes the IP header of the first fragment (with certain
fields changed, as described in the next section). So, 3,280 bytes
of data are in the first fragment. This leaves 8,700 bytes
(11,9803,280) to encapsulate.
The next 3,280 bytes of data are taken from the 8,700 bytes that
remain after the first fragment is built and paired with a new
header to create the second fragment. This leaves 5,420 bytes.
The third fragment is created from the next 3,280 bytes of data,
with a 20-byte header. This leaves 2,140 bytes of data.
The remaining 2,140 bytes are placed into the fourth fragment,
with a 20-byteheader.
-Fragmentation-Related IP Datagram Header FieldsThe following IP
header fields participate in IP fragmentation.
Total LengthAfter fragmenting, the Total Length field indicates
the length of each fragment
Identification, More Fragments and Fragment Offset fields are
used as described above in the IP header description.
-IP Message ReassemblyWhen a datagram is fragmented, it becomes
multiple fragment datagrams. The destination of the overall message
must collect these fragments and reassemble them into the original
message.
In IP version 4 (IPv4), fragmentation can be performed by a
router between the source and destination of an IP datagram, but
reassembly is done only by the destination device.
Reasons/Advantages behind Reassembly at the End-Fragments can
take different routes to get from the source to destination, so any
given router on the path may not see all the fragments in a
message-Reassembly at intermediate routers would increase
complexity-Routers doing reassembly need to wait for all the
fragments before sending the reassembled message, which would slow
down the routingDisadvantage of Reassembly at the end
-Reassembly at the end results in more, smaller fragments
traveling over longer routes than if intermediate reassembly
occurred. This increases the chances of a fragment getting lost and
the entire message being discarded-Potential inefficiency in the
utilization of data link layer frame capacity. In situations where
some of the links on the path having higher MTU, because of
reassembly at the end the links capacity may be under utilized
-Reassembly Process -The receiving device initializes a buffer
where it can store the fragments of the message as they are
received. It keeps track of which portions of this buffer have been
filled with received fragments, perhaps using a special table.
-The recipient knows it has received a message fragment the
first time it sees a datagram with the More Fragments bit set to 1
or the Fragment Offset a value other than 0. It identifies the
message based on the source and destination IP addresses, the
protocol specified in the header, and the Identification field
generated by the sender
-The receiving device sets up a timer for reassembly of the
message. Since it is possible that some fragments may never show
up, this timer ensures that the device will not wait an infinite
time trying to reassemble the message
-Reassembly is complete when the entire buffer has been filled
and the fragment with the More Fragments bit set to 0 is received,
indicating that it is the last fragment of the datagram.
-On the other hand, if the timer for the reassembly expires with
any of the fragments missing, the message cannot be reconstructed.
The fragments are discarded, and an ICMP Time Exceeded message is
generated. Since IP is unreliable, it relies on higher-layer
protocols such as the Transmission Control Protocol (TCP) to
determine that the message was not properly received and then
retransmit it.ICMP: Internet Control Message Protocol-In TCP/IP,
diagnostic, test, and error-reporting functions at the internetwork
layer are performed by the Internet Control Message Protocol
(ICMP), which is like IPs administrative assistant.
-ICMP is not like most other TCP/IP protocols in that it does
not perform a specific task. It defines a mechanism by which
various control messages can be transmitted and received to
implement a variety of functions.
-ICMP messages are transmitted within IP data grams
-ICMP Message format
-The first 4 bytes have the same format for all messages, but
the remainder differs from one message to the next.
-The type field identified the particular ICMP message. There
are 15 different values for the type field.
-The code field Identifies the subtype of message within each
ICMP message Type value-The checksum field covers the entire ICMP
message. The algorithm used is the same as for IP header
checksum.-ICMP Message ClassesICMP messages are divided into two
classes:
Error Messages These messages are used to provide feedback to a
source device about an error that has occurred. They are typically
generated specifically in response to some sort of action, usually
the transmission of a datagram. Errors are usually related to the
structure or content of a datagram or to problem situations on the
internetwork encountered during datagram routing
Informational (or Query) Messages These are messages that are
used to let devices exchange information, implement certain
IP-related features, and perform testing. They are generated either
when directed by an application or on a regular basis to provide
information to other devices-ICMP Messages with their Type
-ICMP Error Messages-ICMP error messages always contain the IP
header and the first 8 bytes of the IP datagram that caused the
ICMP error to be generated. This lets the receiving ICMP module
associate the message with one particular protocol (TCP or UDP from
the protocol field in the IP header) and one particular user
process (from the TCP Or UDP port numbers that are in the TCP or
UDP header contained in the first 8 bytes of the IP datagram).-ICMP
error-reporting messages sent in response to a problem seen in an
IP datagram can be sent back only to the originating device.
Intermediate devices cannot be the recipients of an ICMP message
because their addresses are normally not carried in the IP
datagrams header.
-ICMP error message must not be generated in response to any of
the following:An ICMP Error Message Responding to ICMP error
message with another error message create message loops. An ICMP
error message can be generated in response to an ICMP informational
message.A Broadcast or Multicast Datagram IP Datagram Fragments
Except the First In many cases, the same situation that
might cause a device to generate an error for one fragment would
also apply to each successive one, causing unnecessary ICMP
traffic. For this reason, when a datagram is fragmented, a device
may send an error message only in response to a problem in the
first fragment.Data grams with Non-Unicast Source Address If a
datagrams source address doesnt define a unique, unicast device
address, an error message cannot be sent back to that source. This
prevents ICMP messages from being broadcast, unicast, or sent to
non routable special addresses such as the loopback address.-ICMPv4
Destination Unreachable MessagesICMPv4 Destination Unreachable
messages are used to inform a sending device of a failure to
deliver an IP datagram
Example Scenarios
Code ValueMessage SubtypeDescription
0Network UnreachableThe datagram could not be delivered to the
network specified in the network ID portion of the IP address.
Usually means a problem with routing but could also be caused by a
bad address.
1Host UnreachableThe datagram was delivered to the network
specified in the network ID portion of the IP address but could not
be sent to the specific host indicated in the address. Again, this
usually implies a routing issue.
2Protocol UnreachableThe protocol specified in the Protocol
field was invalid for the host to which the datagram was
delivered.
3Port UnreachableThe destination port specified in the UDP or
TCP header was invalid.
4Fragmentation Needed and DF SetIf a packet with DF bit set
encounters a router whose local MTU is less than the size of packet
the router drops the packet and sends ICMP error message.
-ICMPv4 Source Quench Messages
A source-quench message informs the source that a datagram has
been discarded due to congestion in a router or the destination
host. The source must slow down the sending of data grams until the
congestion is relieved.One source-quench message is sent for each
datagram that is discarded due to congestion.
-ICMPv4 Time Exceeded MessagesThe time-exceeded message is
generated in two cases.-The first is whenever router decrements a
datagram with a time-to-live value to zero; it discards the
datagram and sends a time-exceeded message to the original
source.
-The second is when the final destination does not receive all
of the fragments in a set time; it discards the received fragments
and sends a time-exceeded message to the original source.-Parameter
Problem MessagesAny ambiguity in the header part of a datagram can
create serious problems as the datagram travels through the
Internet. If a router or the destination host discovers an
ambiguous or missing value in any field of the datagram, it
discards the datagram and sends a parameter-problem message back to
the source-ICMP Query Messages-Echo (Request) and Echo Reply
MessagesICMPv4 Echo (Request) and Echo Reply messages are used to
facilitate network reachability testing. A device can test its
ability to perform basic communication with another one by sending
an Echo message and waiting for an Echo Reply message to be
returned by the other device. The ping utility, a widely used
diagnostic tool in TCP/IP internetworks, makes use of these
messages.-Ping ProgramThe TCP/IP ping utility is used to verify the
ability of two devices on a TCP/IP internetwork to communicate. It
operates by having one device send ICMP Echo (Request) messages to
another, which responds with Echo Reply messages. The program can
be helpful in diagnosing network connectivity issues.
-Methods of Diagnosing Connectivity Problems using pingInternal
Device TCP/IP Stack Operation By performing a ping on the devices
own address, you can verify that its internal TCP/IP stack is
working. This can also be done using the standard IP loopback
address, 127.0.0.1.Local Network Connectivity If the internal test
succeeds, its a good idea to do a ping on another device on the
local network, to verify that local communication is possible.
Local Router Operation If there is no problem on the local
network, it makes sense to ping whatever local router the device is
using to make sure it is operating and reachable.Domain Name
Resolution Functionality If a ping performed on a DNS domain name
fails, you should try it with the devices IP address instead. If
that works, this implies either a problem with domain name
configuration or resolution.Remote Host Operation If all the
preceding checks succeed, you can try performing a ping to a remote
host to see if it responds. If it does not, you can try a different
remote host. If that one works, it is possible that the problem is
actually with the first remote device itself and not with your
local device.-Trace Route Program/Utility-The trace route program
sends a dummy UDP data gram to an invalid port which cant be used
by an application at the destination. The TTL field of the IP
datagram is set to 1.-The first router to handle the datagram
decrements the TTL value and it becomes 0.The router discards the
datagram, and sends back the ICMP time exceeded message. IP
datagram containing this ICMP message has the router's IP address
as the source address. This identifies the first router in the
path. -Trace route then sends a datagram with a TTL of 2, and we
find the IP address of the second router. This continues until the
datagram reaches the destination host. -Once the datagram reaches
the destination the destination host's UDP module generates an ICMP
"port unreachable" error, as the UDP data gram is arrived on an
invalid port. With this message trace route program concludes it
operation. ARP: Address Resolution Protocol- Address resolution is
required because internetworked devices communicate logically using
layer 3 addresses, but the actual transmissions between devices
take place using layer 2 (hardware) addresses.
- ARP is a full-featured, dynamic resolution protocol used to
match IP addresses to underlying data link layer addresses.
Originally developed for Ethernet, it has now been generalized to
allow IP to operate over a wide variety of layer 2 technologiesARP
General Operation
1. Source Device Checks Cache The source device will first check
its cache to determine if it already has a resolution of the
destination device. If so, it can skip to step 9.
2. Source Device Generates ARP Request Message The source device
generates an ARP Request message. It puts its own data link layer
address as the Sender Hardware Address and its own IP address as
the Sender Protocol Address. It fills in the IP address of the
destination as the Target Protocol Address. (It must leave the
Target Hardware Address blank, since that it is what it is trying
to determine!)
3. Source Device Broadcasts ARP Request Message The source
broadcasts the ARP Request message on the local network.
4. Local Devices Process ARP Request Message The message is
received by each device on the local network. It is processed, with
each device looking for a match on the Target Protocol Address.
Those that do not match will drop the message and take no further
action.
5. Destination Device Generates ARP Reply Message The one device
whose IP address matches the contents of the Target Protocol
Address of the message will generate an ARP Reply message. It takes
the Sender Hardware Address and
Sender Protocol Address fields from the ARP Request message and
uses these as the values for the Target Hardware Address and Target
Protocol Address of the reply. It then fills in its own layer 2
address as the Sender Hardware Address and its IP address as the
Sender Protocol Address. Other fields are filled in, as explained
in the description of the ARP message format in the following
section.
6. Destination Device Updates ARP Cache Next, as an
optimization, the destination device will add an entry to its own
ARP cache that contains the hardware and IP addresses of the source
that sent the ARP Request. This saves the destination from needing
to do an unnecessary resolution cycle later on.
7. Destination Device Sends ARP Reply Message The destination
device sends the ARP Reply message. This reply is, however, sent
unicast to the source device, because there is no need to broadcast
it.
8. Source Device Processes ARP Reply Message The source device
processes the reply from the destination. It stores the Sender
Hardware Address as the layer 2 address of the destination and uses
that address for sending its IP datagram.
9. Source Device Updates ARP Cache The source device uses the
Sender protocol Address and Sender Hardware Address to update its
ARP cache for use in the future when transmitting to this
device.
-ARP Message FormatAn ARP packet is encapsulated directly into a
data link frame
The type field in the frame indicates that the data carried by
the frame is an ARP packet
Field NameSize (bytes)Description
HRD2
PRO2Protocol Type: This field is the complement of the Hardware
Type field, specifying the type of layer three addresses used in
the message. For IPv4 addresses, this value is 2048 (0800 hex),
which corresponds to the EtherType code for the Internet
Protocol.
HLN1Hardware Address Length: Specifies how long hardware
addresses are in this message. For Ethernet or other networks using
IEEE 802 MAC addresses, the value is 6.
PLN1Protocol Address Length: Again, the complement of the
preceding field; specifies how long protocol (layer three)
addresses are in this message. For IP(v4) addresses this value is
of course 4.
OP2
SHA(Variable, equals value in HLN field)Sender Hardware Address:
The hardware (layer two) address of the device sending this message
(which is the IP datagram source device on a request, and the IP
datagram destination on a reply, as discussed in the topic on ARP
operation).
SPA(Variable, equals value in PLN field)Sender Protocol Address:
The IP address of the device sending this message.
THA(Variable, equals value in HLN field)Target Hardware Address:
The hardware (layer two) address of the device this message is
being sent to. This is the IP datagram destination device on a
request, and the IP datagram source on a reply)
TPA(Variable, equals value in PLN field)Target Protocol Address:
The IP address of the device this message is being sent to.
Four Different Cases
The following are four different cases in which the services of
ARP can be used Case 1: The sender is a host and wants to send a
packet to another host on the same network. In this case, the
logical address that must be mapped to a physical address is the
destination IP address in the datagram header. Case 2: The sender
is a host and wants to send a packet to another host on another
network. In this case, the host looks at its routing table and
finds the IP address of the next hop (router) for this destination.
If it does not have a routing table, it looks for the IP address of
the default router. The IP address of the router becomes the
logical address that must be mapped to a physical address.
Case 3: The sender is a router that has received a datagram
destined for a host on another network. It checks its routing table
and finds the IP address of the next router. The IP address of the
next router becomes the logical address that must be mapped to a
physical address.
Case 4: The sender is a router that has received a datagram
destined for a host in the same network. The destination IP address
of the datagram becomes the logical address that must be mapped to
a physical address.
-ARP CachingA sender usually has more than one IP datagram to
send to the same destination. It is inefficient to use the ARP
protocol for each datagram destined for the same host or router.
The solution is the cache table. When a host or router receives the
corresponding physical address for an IP datagram, the address can
be saved in the cache table. This address can be used for the data
grams destined for the same receiver within the next few
minutes.
The ARP cache takes the form of a table containing matched sets
of hardware and
IP addresses. Each device on the network manages its own ARP
cache table. There are two different ways that cache entries can be
put into the ARP cache:Static ARP Cache Entries These are address
resolutions that are manually added to the cache table for a device
and are kept in the cache on a permanent basis.
Dynamic ARP Cache Entries These are hardware and IP address
pairs that are added to the cache by the software itself as a
result of past ARP resolutions that were successfully completed.
They are kept in the cache for only a period of time and are then
removed.
Cache Entry ExpirationDynamic entries cannot be added to the
cache and left there foreverdynamic entries left in place for a
long time can become stale. Consider Device As ARP cache, which
contains a dynamic mapping for Device B, which is another host on
the network. If dynamic entries stayed in the cache forever, the
following situations might arise.
Device Hardware Changes Device B might experience a hardware
failure that requires its network interface card to be replaced.
The mapping in Device As cache would become invalid, since the
hardware address in the entry is no longer on the network.Device IP
Address Changes Similarly, the mapping in Device As cache also
would become invalid if Device Bs IP address changed.Device Removal
Suppose Device B is removed from the local network. Device A would
never need to send to it again at the data link layer, but the
mapping would remain in Device As cache, wasting space and possibly
taking up search time.
To avoid these problems, dynamic cache entries must be set to
automatically expire after a period of time. This is handled
automatically by the ARP implementation, with typical timeout
values being 10 or 20 minutes. After a particular entry times out,
it is removed from the cache. The next time that address mapping is
needed, a fresh resolution is performed to update the cache-Proxy
ARPSince ARP relies on broadcasts for address resolution, and
broadcasts are not propagated beyond a physical network, ARP cannot
function between devices on different physical networks. When such
operation is required, a device, such as a router, can be
configured as an ARP proxy to respond to ARP requests on the behalf
of a device on a different network
These two examples show how a router acting as an ARP proxy
returns its own hardware address in response to requests by one
device for an address on the other network.
In this small internetwork shown, a single router connects two
LANs that are on the same IP network or subnet. The router will not
pass ARP broadcasts, but has been configured to act as an ARP
proxy. In this example, Device A and Device D are each trying to
send an IP datagram to the other, and so each broadcasts an ARP
request. The router responds to the request sent by Device A as if
it were Device D, giving to Device A its own hardware address
(without propagating Device As broadcast). It will forward the
message sent by Device A to Device D on Device Ds network.
Similarly, it responds to Device D as if it were Device A, giving
its own address, then forwarding what Device D sends to it over to
the network where Device A is located.Proxy Arp Pros and ConsThe
main advantage of proxying is that it is transparent to the hosts
on the different physical network segments. The technique has some
drawbacks, however.
First, it introduces added complexity. Second, if more than one
router connects two physical networks using the same network ID,
problems may arise. Third, it introduces potential security risks;
since it essentially means that a router impersonates devices by
acting as a proxy for them, the potential for a device spoofing
another is real-Gratuitous ARPA gratuitous ARP request is a
broadcast request for a hosts (router, switch, and device) own IP
address. If a host sends an ARP request for its own IP address and
no ARP replies are received, the hosts assigned IP address is not
being used by other nodes. If a host sends an ARP request for its
own IP address and an ARP reply is received, hosts assigned IP
address is already being used by another node.A gratuitous ARP
request is an Address Resolution Protocol request packet where the
source and destination IP are both set to the IP of the machine
issuing the packet and the destination MAC is the broadcast address
ff:ff:ff:ff:ff:ff. Ordinarily, no reply packet will occur. A
gratuitous ARP reply is a reply to which no request has been
made.Gratuitous ARPs are useful for three reasons:
They can help detect IP conflicts. When a machine receives an
ARP request containing a source IP that matches its own, then it
knows there is an IP conflict. They assist in the updating of other
machines' ARP tables. If the host sending the gratuitous ARP has
just changed its hardware address (perhaps the host was shut down,
the interface card replaced, and then the host was rebooted), this
packet causes any other host on the cable that has an entry in its
cache for the old hardware address to update its ARP cache entry
accordingly They inform switches of the MAC address of the machine
on a given switch port, so that the switch knows that it should
transmit packets sent to that MAC address on that switch port.
-Summary Comparison of UDP and TCP
Characteristic / DescriptionUDPTCP
General DescriptionSimple, high-speed, low-functionality wrapper
that interfaces applications to the network layer and does little
else.Full-featured protocol that allows applications to send data
reliably without worrying about network layer issues.
Protocol Connection SetupConnectionless; data is sent without
setup.Connection-oriented; connection must be established prior to
transmission.
Data Interface To ApplicationMessage-based; data is sent in
discrete packages by the application.Stream-based; data is sent by
the application with no particular structure.
Reliability and AcknowledgmentsUnreliable, best-effort delivery
without acknowledgments.Reliable delivery of messages; all data is
acknowledged.
RetransmissionsNot performed. Application must detect lost data
and retransmit if needed.Delivery of all data is managed, and lost
data is retransmitted automatically.
Features Provided to Manage Flow of DataNoneFlow control using
sliding windows; window size adjustment heuristics; congestion
avoidance algorithms.
OverheadVery lowLow, but higher than UDP
Transmission SpeedVery highHigh, but not as high as UDP
Data Quantity SuitabilitySmall to moderate amounts of data (up
to a few hundred bytes)Small to very large amounts of data (up to
gigabytes)
Types of Applications That Use The ProtocolApplications where
data delivery speed matters more than completeness, where small
amounts of data are sent; or where multicast/broadcast are
used.Most protocols and applications sending data that must be
received reliably, including most file and message transfer
protocols.
Well-KnownApplications and ProtocolsMultimedia applications,
DNS, BOOTP, DHCP, TFTP, SNMP, RIP, NFS (early versions)FTP, Telnet,
SMTP, DNS, HTTP, POP, NNTP, IMAP, BGP, IRC, NFS (later
versions)
-Example of Protocols that use both TCP and UDP are DNS, NFSUDP:
User Datagram Protocol- The User Datagram Protocol (UDP) was
developed for use by application protocols that do not require
reliability, acknowledgment, or flow control features at the
transport layer. It is designed to be simple and fast. It provides
only transport layer addressing (in the form of UDP ports), an
optional checksum capability, and little else.- UDP is probably the
simplest protocol in all of TCP/IP. It takes application layer data
that has been passed to it, packages it in a simplified message
format, and sends it to IP for transmission.
-A protocol uses UDP instead of TCP in two situations. The first
is when an application values timely delivery over reliable
delivery, and when TCPs retransmission of lost data would be of
limited or even no value. The second is when a simple protocol can
handle the potential loss of an IP datagram itself at the
application layer using a timer/retransmit strategy, and when the
other features of TCP are not required. Applications that require
multicast or broadcast transmissions also use UDP, because TCP does
not support those transmissions.UDP Message Format
Length The length of the entire UDP datagram, including both
header and Data fields.
Checksum An optional 16-bit checksum computed over the entire
UDP datagram plus a special pseudo header of fields. The method is
same as that of TCP