사용자 도구

사이트 도구


scrap:tcpperformanceandflowcontrol

from http://www.candle.com/www1/cnd/portal/CNDportal_Article_Master/0,2245,2683_2985_48626,00.html

Introduction

TCP/IP is now the de facto networking protocol of the world. TCP/IP consists of four protocol layers, Data Link, Network, Transport, and Application. The layered protocol specifications are considerably simpler than the System Network Architecture (SNA). And this simplicity is its major strength, providing the network implementation flexibility and low application entry costs that are directly responsible for its network domination.

The Network layer, frequently referred to as the Internet Protocol (IP) layer, provides connectionless, best-effort, datagram delivery services. This is the protocol implemented by the network infrastructure that consists of protocol routers and transmission links that deliver data packets from the origination to the destination.

The Transport layer consists of the User Datagram Protocol (UDP) and Transmission Control Protocol (TCP). The UDP is essentially a pass-through between the Application layer and the Network layer. It allows application control and also assumes the full burden of interfacing directly with the network. When application-A engages in data exchange with application-B using UDP, an application-A data send request corresponds to exactly an application-B data receive request. This is generally referred to as record bound data exchange and the programming logic to do this is very simple. The TCP provides reliable stream transport service on behalf of the communicating applications. It establishes a connection, or session, between parties and takes up full responsibilities regarding data delivery, sequencing, error detection, and recovery. Since TCP treats application data as a stream of bytes and buffers it according to an internal algorithm, an application-A send request may not correspond to an application-B receive request. In fact, data sent by multiple application-A send requests may be received by one application-B receive request and vice versa. This makes application programming logic a bit more complicated because the applications need to be able to know what an application data record from send and received network data is and process it accordingly. However, this minor complexity is far less work for the application program than the programming efforts needed for duplicating TCP functionality.

TCP is the default preferred protocol of Internet applications because all e-commerce transactions require reliable data transport services. TCP provides reliable transport by:

  • Breaking down the application data into segments for transmission.
  • Calculating the check sum on its header and segment data in order to maintain transmission data integrity.
  • Sending positive acknowledgement on data received.
  • Detecting segment loss (IP is a best effort network delivery service) and retransmitting.
  • Providing data flow control in order to ensure orderly data exchange and avoid possible network congestion.
  • Arranging out-of-order data segments in sequence before passing it to the application.
  • Detecting and discarding duplicated received data segments.

It is critical to understand how TCP operates in detail, and its built-in data flow control strategies, in particular, to establish a baseline for eBusiness network performance and management.

This article assumes that the readers are familiar with fundamental TCP/IP protocols, such as network address structures, TCP data exchange procedures, and basic socket programming. The following discussions and examples are based on the widely implemented TCP/IP version 4. The proposed TCP/IP version 6 addresses many version 4 limitations, but the basic operations remain unchanged.

Data Segmentation

Data may be broken into smaller pieces for transmission in two places:

At origin host TCP

At TCP session connection time, each session partner notifies the other of its Maximum Segment Size (MSS) in its SYN packet. The session partner will never send any segment greater than the other announced MSS. TCP determines its own MSS based on, in effect, whether the destination partner is on its own local network or it is non-local. For partners residing on the same local network, the MSS is the network adapter s Maximum Transmission Unit (MTU) size minus 40 (20 bytes for the IP header, 20 bytes for the TCP header). So for Ethernet type LAN connection, for instance, the MSS is usually 1460. The non-local destination default MSS is 536. In the example in Fig. 1, Node B establishes a connection with Node A. Because Node A is located on a non-local network, Node B uses the default MSS of 536. Node A in return announces its MSS to be 256 because its MTU is only 296. In this scenario, both sides use a segment size of 256 for data exchange. TCP always chooses segment size in order to avoid fragmentation.

Fig.1: Ethernet-type LAN Connection

It is easy to determine that the session partner is located on a non-local network if the partner s network address has different network ID and subnet ID. However, for partners of the same network ID but with a different subnet ID, it may or may not be residing in a non-local network, which effects TCP MSS decisions. Most TCP/IP implementation allows the network administration to set a configuration option SUBNETARELOCAL. If this is set, then TCP will use the maximum possible MSS, up to the MTU limit, for connecting partners of the same network ID but different subnet ID. For the example above, if SUBNETARELOCAL option is set, then Node B will initiate connection with an MSS of 1460. Note however, the segment size in use is still 256 because the segment size is limited by the Node A MTU size.

TCP breaks application data into segments (up to the MSS). Each segment is assigned a sequence number. The receiver TCP reassembles segments into the original application datastream based on the sequence number in the TCP header (Fig. 2).

Fig. 2: TCP Header Datastream

IP fragmentation at network router

TCP determines data exchange segment size based on the MSS in an attempt to avoid further segmentation. However, this strategy doesn t totally prevent segment fragmentation by the IP network.

Fig. 3: TCP Segmentation

In this example (Fig. 3), both Node A and Node B advertise an MSS of 1460, which is the data segment size, in effect. Application data of 1350 flowing from Node A to Node B will be transmitted in one segment. Router-X receives this data segment and it must break it into two segments of 984 and 366 (each has a copy of the IP and TCP headers) in order for it to route them to Router-Y. Once an IP packet is fragmented, the fragmented segments are not reassembled until they reach the final destination. This is to avoid possible refragmentation by the next hop router and to make the process transparent to the transport protocol layer. In this case, Node B receives two segments (Fig. 4).

Fig. 4: IP Header Segmentation

The sender IP assigns a unique identifier to each datagram. The intermediary IP fragments the segment by copying the same identifier to each fragmented datagram. The header flag is set to “more fragment” (IP_MF 0x2), except the last fragment, which is not set (0x0). The fragment-offset field contains data offset that corresponds to the original data segment. The receiver IP reassembles these fragments by copying fragment data into a segment buffer based on data offset before passing the data to TCP or UDP.

Fragmented datagrams have just as much a chance of being lost as any other datagrams. If so, the original segment must be retransmitted. However, the IP has no timer or recovery mechanism, so the TCP or UDP application must detect and handle the recovery.

TCP can avoid network segmentation by employing an RFC1191 path MTU discovery strategy, which is implemented by AIX and Solaris. This approach attempts to discover the smallest MTU size en-route and, by setting the transmission segment size accordingly, avoid fragmentation in the network. When a TCP session is established, TCP uses the smaller of its network adapter s MTU size or the MSS advertised by its partner (536 if the partner does not specify MSS) as the initial segment size. The segment sent will have the “don t fragment” bit set in its IP header flag (IP_DF 0x4). The router receives a packet that needs to be fragmented but has the DF bits set and will drop the packet, generating an ICMP error ICMP_UNREACH_NEEDFRAG. Upon receiving this ICMP message, the sender will adjust the segment size to the next hop MTU size returned in the ICMP message and retransmit. Since routes change dynamically in IP networks and a larger MTU may become available, this procedure must be retried periodically. The default recheck interval is 10 minutes.

Is segmentation good or bad, that is, is a small MSS better than larger one? A large segment size allows more application data to be sent in one segment. This minimizes the overhead of IP and TCP headers for data transmission and TCP/router processing. It s clear that for data changes between applications located in the same local network, the largest possible MSS should be used. However, for data exchanges across wide area networks, especially the Internet, where it s almost guaranteed that data packets will be routed by a dozen or so routers, then a small segment size will improve overall response time (similar to the chaining benefit in the SNA world). Let s illustrate this with an example. Say you re sending 1400 bytes of data through three routers, assuming the router processing time is zero and the propagation delay is negligible due to the short distance (Fig. 5).

Fig. 5: Data Transmission Routing

Using the segment size of 1460, the transmission time for 1400 bytes of data plus 20 bytes of IP header and 20 bytes of TCP header over 56KB line as a single segment is 0.206 seconds. Therefore, it takes 0.618 seconds for this segment to travel from Router 1 to Router 4.

If the segment size of 536 bytes is used, then the application data will be segmented into three segments for transmission. At a time of 0.082 seconds, the first 536 byte segment arrives at Router 2 and can be forwarded to Router 3 immediately while the second 536 byte segment is transmitting from Router 1 to Router 2 (Fig. 6).

Fig. 6: Data Transmission Routing

At a time of 0.246 seconds, the first 536 byte segment arrives at Router 4, the second 536 byte segment reaches Router 3, and the third 408 byte segment is already en route between Router 2 and Router 3 (Fig. 7).

Fig. 7: Data Transmission Routing

Because of the parallelism, the total transmission time is 0.386 seconds (Fig. 8).

Fig. 8: Parallelism in Data Transmission Routing

Routers are store and forward machines. They must receive the entire data segment before they can process the data. Small segments enable parallelism to take place in the network and improve overall response time. The improvement increases as the total number of routers en route increases. However, using shorter segment sizes increases the total number of IP datagrams in the network. Today s routers are several factors faster in throughput compared to routers just a year or two ago. Therefore, the network s ability for routing packets, generally speaking, isn t an issue. Shorter segments also decrease router memory demand per packet, but increasing the total number of packets may offset this advantage. The router memory requirement increases when routing shorter packets from faster network interfaces to slower network interfaces because packets must be buffered. This is especially true at the Internet boundary where the transmission capacities from consumers to ISP are considerably slower than the backbone network.

Today s transmission network is very reliable. Packets lost due to transmission errors are rare. If a router is short on resources, most likely memory, then the packets are discarded. This is the major cause of data loss in a network and the source of TCP retransmission. Let s first examine how TCP detects lost data and then retransmits.

Data Acknowledgement

As shown before, the TCP header contains a 32-bit sequence number, a 32-bit acknowledgement number and a 16-bit window size fields. TCP is a full duplex protocol. Each connection partner independently selects an initial sequence number (global variable initialized to a time-of-day clock plus 904 for each connection) when establishing a new connection. The initial sequence numbers are carried in the SYN packet and exchanged at start up. The sequence number indicates the next byte to be sent from the connection partner.

TCP acknowledges bytes received rather than a particular data segment. The acknowledgement number in the TCP header acknowledges bytes correctly received plus one. That is, it indicates to the sender that the next byte expected from the receiver and hence acknowledges all bytes sent up to the acknowledgement number, minus one. The acknowledgement could be an IP header and TCP header without data sent from the receiver or simply an outgoing data segment with an acknowledgement number and flag set in the TCP header. The receiver TCP doesn t immediately generate acknowledgement upon data received, waiting for either more segments to arrive, which allows the acknowledgements to be combined, or the application may have outgoing data and the acknowledgement can be piggybacked with the data segment. RFC 1122 states that the TCP should implement delay acknowledgement but the delay must be less than 50 milliseconds. A 20 milliseconds acknowledgement delay is the common implementation value. Of course, the delay is bypassed if there is data ready to send as indicated above.

One frequently wonders what happens when the TCP sequence number wraps since TCP needs to make comparisons to determine whether a segment comes before or after the other. The TCP sequence number and acknowledegment number are defined in the TCP header as unsigned 32-bit integers. This means that it can go up to 4,294,967,295 bytes. Mathematically speaking however, simple arithmetic can be used to determine the correct sequence numbers relationship as long as their difference is not greater than one-half of the total integer space, in this case, 2,147,483,648 bytes. TCP assumes this fact because no network is (yet) able to deliver this amount of data before IP datagram timeout (T3 is six minutes and FDDI is three minutes). A set of C macros defined in the TCP header file (tcp.h or tcp_seq.h) returns true or false by casting the two comparing sequence numbers as signed integers and computing the difference. In the case of sequence number wraps, a very small number subtracts a very large number, causing integer underflow, producing a true macro result (a positive number). Therefore, the TCP program logic simply uses the pre-defined macro, such as SEQ_GT (s1, s2), for checking proper data segment sequence and acknowledgement numbers.

Nagle Algorithm

Many old TCP/IP applications, such as rlogin, simulate TTY operation. Each time a key is pressed, a data segment of one byte data is sent. This results in many “small” segments flooding the network. In a LAN environment, this is not a concern because congestion usually is not an issue. However, this mode of operation causes problems in wide area networks.

TCP is required to implement an RFC 896 Nagle algorithm that addresses this behavior. The algorithm basically states that TCP connections can have only one small unacknowledged segment outstanding. This forces the sending application to pause before sending the next small segment and reduces the total number of small segments in a network. This solution seems sound because it is self-clocking, and the faster acknowledgements return before the faster application can send.

Nevertheless, the Nagle algorithm imposes an application throughput constraint similar to that of SNA applications running in definite response mode. Furthermore, TCP considers “small” segments as segments smaller than the MSS and, as you may recall from the previous discussion, it is related to the network interface MTU. The application, however, is unaware of MSS unless it interrogates the network characteristics using ioctl socket API.

Consequently, if the application s data size is generally less than MSS, if it s usually one direction oriented, if the server application frequently requires multiple application data items before processing, or if the TCP connection is acting as a pipe conduit on behalf of multiple application sub-tasks, then the application always appears to be running slow regardless of how one tunes the application logic. This now becomes obvious because the sending TCP obeys the Nagle algorithm and the receiving TCP waits for the 20-millisecond acknowledgement delay since, there isn t any immediate reply data, a majority of the time. This is also the case where the same application processing the same data using UDP always out performs TCP because UDP has no such operation constraints.

Because TCP applications know they did not send trivial small data, you can disable the Nagle algorithm using setsockopt socket call with option TCP_NODELAY.

Segment Lifetime

IP is a connectionless best-effort delivery network. How can one prevent a data segment from being stuck in the network undelivered and how can you prevent it from appearing after the existing connection has closed and a new connection using the same host port pair has been established? TCP/IP implements two strategies to resolve this situation.

The IP header contains an 8-bit time-to-live (TTL) field and is initialized by the sender. The Assigned Number RFC 1340 current specifies a TTL default value of 64. RFC 1009 states that a network router should decrease the datagram TTL field by the number of seconds held by the router, or one if held less than one second. When TTL reaches zero, the datagram is discarded and ICMP notification type 11 (ICMP_TIMXCEED) is generated and returned to the sender. In reality, routers will hardly ever hold a datagram for more than a fraction of second. Hence, the TTL effectively represents the maximum number of hops set by the sender and the upper limit of an indefinite datagram routing loop. Therefore, the IP through header TTL field imposes a lifetime of data segment.

TCP implements Maximum Segment Life (MSL) in order to avoid delayed segments of the previous connection from mistakenly appearing in the new connection. RFC 793 specifies that MSL should be set to two minutes, but the system administrator may configure MSL to a shorter value.

TCP does not immediately delete the connection control block when closing a connection. It schedules a timer function for a two MSL period before cleanup and reuse of the same socket port. This is in case the final FIN acknowledgement gets lost, allowing the opportunity to re-acknowledge it when the connection partner retransmits. TCP discards all received data segments during the two MSL wait. Of course, this approach cannot work if the system has crashed and rebooted within the MSL interval and the application immediately establishes a connection using the same socket port pair (many server applications automatically start after reboot). RFC 793 also states that a system should not allow any TCP connection during the MSL intervals after rebooting. No system implements this requirement since most systems take much longer than MSL time from crash to complete system restart.

Note the two MSL wait times explain why a server program terminates without closing its connection and restarts shortly after but cannot bind to the same well-known port, even though it had issued a setsockopt socket function call with SO_REUSEADDR option. It must wait two MSL intervals later (generally four minutes).

Transmission Error Detection

IP, TCP, and UDP headers all include a 16-bit checksum field. The checksum is done by padding transmission data to double byte boundary, with zeros if necessary, then calculating the sum by treating transmission data as a stream of bytes and adding them two bytes at a time. The checksum is the one s complement of the sum. The sender saves the calculated checksum in the corresponding checksum field in the header for transmission. A zero checksum field indicates that the sender did not compute the checksum. Note that this algorithm is not foolproof because it cannot detect if a data sequence has changed. For example, the sum of data 123456 is the same as 561234.

The receiver applies the same checksum algorithm on received data. Since the receiver s calculation includes the sender checksum, the receiver s checksum should be all one bit if the data has not been altered, which is very simple for the program to verify.

The IP header checksum protects the 20-byte IP header since this is the only part of the packet that concerns the IP layer. If the checksum fails, the packet is thrown away and no error message is generated. It is up to TCP or UDP to detect the lost packet and take appropriate action.

Both the UDP and TCP checksum protects the header and data. In addition, the checksum also includes a 12-byte pseudo header that contains parts of IP header information such as origination and destination IP addresses, protocol flag (UDP, TCP, ICMP, etc.) and total data length. The pseudo header is not transmitted, but enables the TCP or UDP to make sure that the IP delivers packets to the right node and right application. This is probably an overcautious operation procedure.

Per RFC 768, the UDP checksum is optional, which can save some computing cycles. The receiver does not compute the checksum if the header checksum field is zero. However, a simple DO loop of adding 16-bit integers is hardly a measurable chore for today s computers. All current TCP/IP UDP implementation computes checksum by default. The UDP discards the received packet if the checksum computation fails. No notification is created. The sending and receiving applications must be programmed to handle timed out and lost packets.

A TCP checksum is mandatory and must be computed. The receiver drops the received segment if the checksum computation fails. The sender TCP detects lost data and retransmits the unacknowledged segment.

Retransmission

Both the data segment (and its fragments) and the acknowledgement packet can get lost in the network. TCP sets a retransmission timer when sending data segments (TCP does not acknowledge the acknowledgement). If acknowledgement is not received within the retransmission time interval, the unacknowledged data is considered lost. While sending more than one segment, the TCP does not reset the retransmission timer for each segment. The retransmission timer is set only when there is no outstanding segment expecting acknowledgement, as shown in the example below (Fig. 9):

Fig. 9: Lost Transmission

TCP A sent three segments, 101101, 101637, and 102173, each with 536 bytes of data. Segment 101101 and segment 101637 arrived at TCP B. Twenty milliseconds later, TCP B acknowledged them. Momentarily, the third segment reached TCP B but this acknowledgement was lost. As TCP A retransmission timer expired, it re-sent segment 102173 since data up to that sequence number had been acknowledged. In the meantime, TCP B application received all three-segment data and generated a 300 byte reply and it is forwarded to TCP A in segment sequence 334987. Duplicate segment 102173 arrived at TCP B and it is acknowledged but data is discarded. Meanwhile segment 334987 arrived at TCP A and acknowledgement was generated.

Since IP is a best-effort delivery network, the determination of accurate transmission timeout interval before initiating retransmission is critical to TCP operation:

  1. Premature retransmissions add unnecessary processing, delay and network load.
  2. The segment round-trip time (RTT), in other words, the response time of sending a data segment and receiving the corresponding acknowledgement, varies depending on network operation conditions and other application activities. Segments may also travel through different paths and result in RTT variation. Therefore, the TCP should take into consideration RTT fluctuations during the life of a connection.
  3. RTT will be much longer when a network is heavily loaded or congested. Retransmission further aggravates a congested network, like pouring gasoline on a fire.

Per RFC 793, the TCP sets the retransmission timeout (RTO) value by factoring in the RTT history and the current measured RTT:

RTT = (0.9 * Last RTT) + (0.1 * Current Measured RTT) and 
RTO = 2 * RTT

This smoothes out the RTT by weighting heavily on accumulative RTT history and helps application connections that are experiencing steady RTT. However, this does not stabilize a connection that experiences wide fluctuations of RTT. TCP improves RTO calculations by using both the RTT average estimation and the standard deviation. An integer implementation of this strategy (shift bits instead of using floating point and square roots) becomes:

Variance = Measured RTT - Last RTT 
RTT = Last RTT + (Variance / 8) 
Deviation = Last Deviation + ((Absolute Value of Variance - Last Deviation) / 4) 
RTO = RTT + 4 * Deviation

An exponential back-off method is used to determine the retransmission timer value. The successive timeout value doubles to a maximum of 64 seconds. Therefore, the retransmission timer multiplication factors will be two, four, eight, 16 and so on. TCP attempts twelve retransmissions before giving up and closing the connection. For example, assume the calculated RTO is 2.5 seconds and timeout occurs. TCP now applies exponential back-off. Because this is the first timeout at this moment, a factor of two is used. The retransmission timer is now set to five seconds. If the segment goes into timeout again, then the factor of four applies and the retransmission timer becomes 10 seconds, and then 20 seconds, 40 seconds and finally stays at 64 seconds.

An ambiguity develops when a segment is retransmitted. For instance, when the retransmission timer expires and the RTO backs off as discussed before, the segment is retransmitted with a new RTO, and then acknowledgement is received. The sender does not know whether the acknowledgement is for the first transmission or the result of the retransmission. TCP implements Karn s algorithm to address this situation. Essentially, Karn s algorithm states that the estimated RTT should not be updated when acknowledgement arrives after retransmission and, since exponential back-off already applied to the RTO due to retransmission, it uses the same RTO without further back-off. It then recalculates the new RTO when acknowledgement is received for the segment that was not retransmitted.

TCP sets estimated RTT based on data exchange history and RTT mean deviation. This helps to ensure effective application data flow and efficient use of network resources by improving the data loss detection accuracy and minimizing unnecessary retransmissions. The TCP retransmission methodology is not configurable by the network administrator.

Window Size

The connecting TCP partners advertise their window sizes in the data segment and/or the acknowledgement TCP header. The window size is the received data buffer size that TCP allocated on behalf of the application. TCP saves the received application data in the data buffer until the application issues a socket-receive function call in order for TCP to move received data from the buffer to the application s data storage. The default receive buffer size is 4096 bytes. Current operating systems, such as Solaris and AIX, use a larger default buffer size of 8192 or even 16384 bytes. Larger buffer sizes improve performance and application programs frequently issue a socket function called setsockopt with option SO_RCVBUF to change the buffer size (see window scale option below).

The advertised window size is the vehicle for the receiver to control data flow from the sender to the receiver. It tells the sender the amount of data that the receiver is able to accept. The sender never sends data that will exceed the advertised window size at that instance. Let s use the same data exchange between TCP A and TCP B below to illustrate TCP window size.

TCP A sends three segments to TCP B and all segments advertise a window size of 4096. TCP B acknowledges two arrived segments but an advertised window size of 3024 because the application has not yet issued a socket function call to retrieve the inbound data. As the third segment arrives with a 20 milliseconds delay later, the application has already retrieved the data so that acknowledgement carries a window size of 4096. TCP A retransmits the third segment with a window size of 4096 because it has not received any data. TCP B re-acknowledges the third segment with a window size of 4096 because the duplicated data was discarded. The last acknowledgement from TCP A also advertises a window size of 4096 since the application has already processed the reply data (Fig. 10).

Fig. 10: Lost Data Transmission

TCP mandates that the advertised window size cannot be decreased once it is in effect. Thus, the window size may only decrease if the sender sends data and the acknowledgement number increases. The receiver cannot simply advertise a small window size merely because it decides to receive less data.

When the window size goes to zero, the receiver effectively shuts off the data flow from the sender. In the example below, TCP A sends four 1024 bytes segments to TCP B. TCP B correctly acknowledges receiving 4096 bytes of data but, until the application issues a socket-receive function call to retrieve the data, it advertises a window size of zero. After the data has been copied to application storage, TCP B opens the data flow by acknowledging the same received sequence number, but with an open window size (Fig. 11).

Fig. 11: Open Data Flow

What happens if the window open acknowledgement gets lost? TCP A and TCP B will be deadlocked as each side is waiting for the other to proceed. TCP prevents this condition by implementing two strategies:

  1. Persist timer : The TCP persist timer is set when the sender receives zero window size notification from the receiver. TCP with outbound data will send a periodic window probe - a data segment with one byte of data. Since the receiver has no room to receive this byte of data, it returns an acknowledgement without acknowledging this byte (i.e., the acknowledgement number does not increase). Hence, this results in the same single byte being continuously retransmitted until the window reopens. The persist timer works the same as retransmission timer using exponential back-off algorithm (factor two, four, eight, 16, and so on) except that it goes to a maximum of 60 seconds and it never gives up.
  2. Gratuitous acknowledgement : Whenever TCP passes data from the receive buffer to the application, it checks to see if this will cause the receive window to open. If so, it automatically sends a gratuitous acknowledgement. As the sender processes the acknowledgement, it discovers that the window has opened and resumes transmission.

If TCP advertises an open window immediately after a small amount of buffer space becomes available, it may fall victim to Silly Window syndrome behavior. In this condition, the receiver s window size oscillates between zero and a small value. The sender transmits small segment data, attempting to fill the buffer as soon as possible. This results in poor network utilization because of the low application data transmitted in comparison to the TCP/IP overhead. TCP programming logic avoids this unproductive situation once a received window size has reached zero by not advertising an open window unless the MSS or one quarter of the buffer space, whichever is larger, becomes available.

The TCP header window size is a 16-bit field. This limits the TCP window to 65535 bytes, a rather small limit for today s networking applications. Most current implementations employ a window scale option, RFC 1323, thereby increasing TCP window from 16 to 32 bits. The implementation maintains downward compatibility by using the same 16-bit window size field in the TCP header, but defines an option field that contains a window size scaling factor. The TCP code internally represents a window size using a 32-bit integer. The window scale option (type 3, TCPOPT_WINDOWSCALE), comes after the TCP header and defines a one-byte shift count value between 0 and 14. Zero means no scaling and a shift count of 14 creates a window size of 65535 x 214 or 1,073,725,440 bytes. The window scaling option can only be included in the SYN segment at connection establishment time. Therefore, the scale is fixed, but may be different, in each direction. The active open partner (the one-start connection) must include a window-scaling option in its SYN segment, but the passive open partner can send this option only if the active open partner does so. If the active open partner receives no window scaling option in reply, then it assumes its connecting partner does not support this feature and resets its own shift count to zero. The TCP chooses the scaling factor based on the system default size of the receive buffers or the value set by the application program using the setsockopt socket function call.

Congestion Window

TCP detects network congestion by retransmission, or in other words, by transmitting a timeout or receiving duplicate acknowledgements. This simple conclusion is drawn because network congestion implies a shortage of network resources and delays. The natural consequence of IP operation in congested conditions is to drop datagrams. Thus, some transmissions will be unacknowledged timeouts and must be retransmitted.

If the sender receives duplicate acknowledgements, say two acknowledgements that acknowledge data sequence 123456, but it already sent up to 126528, then this implies lost data. TCP acknowledges contiguous correctly received data only in response to receiving data segments and, per RFC specification, the receiver of an out-of-sequence segment must immediately generate a duplicate acknowledgement bypassing the 20 milliseconds delay rule. Thus, the sender receives duplicate acknowledgements from the receiver, implying that a gap exists in received data. For example, it has received segments 122432 and 125504 but is missing segment 124480 for three 1024-byte segments. Therefore, the sender needs to retransmit data starting from sequence number 123456. Retransmission, however, could further exacerbate network congestion by adding more traffic to the network.

TCP avoids compounding network problems by slowing down the transmission rate after detecting congestion. Window size, as discussed above, is the data flow mechanism imposed by the receiver. The congestion window (an internal program variable) is the sender TCP data flow strategy that limits the amount of data it injects into the network. In other words, the sender will send data segments up to the receiver s advertised window size or the congestion window size, whichever is smaller, but never less than MSS - Maximum Segment Size (one data segment).

Initially, the sender sets the congestion window to the receiver s advertised window size. When congestion arises (i.e., an acknowledgement timeout), the congestion window is reduced by half each time retransmission occurs, but is never smaller than the MSS as stated above. For instance, assuming the advertised window size is 8192 and MSS is 536, the congestion window will decrease to 4096, then 2048, and then 1024 if the sender must retransmit. This quickly reduces the outbound data rate in just a few retransmissions. The TCP, however, attempts to avoid repetitively increasing and decreasing flow rate and consequently causing network congestion and then backing off. Therefore, the congestion window increases slowly, by approximately one segment over congestion window per acknowledgement received. In general, segment size equals the MSS and the congestion window equals the MSS times n segments. So the congestion window increases by this formula:

Number of segment increase = 1 segment / n segments 
        Increase in bytes  = MSS / (congestion window / MSS) 
                           = (MSS * MSS) / congestion window 

For example, if the MSS is 536 and the advertised window size is 8192, then the congestion window increment is 536 times 536 and divided by 8192 or 35 bytes. Thus, the TCP exponentially slows down the outbound data flow when detecting network congestion, while the additive reopens the transmission window slowly. This reduces the probability of future network congestion and promotes steady, smooth data exchanges.

Slow Start

In the approach where TCP starts sending multiple segments, runs into congestion and then backs off, it does not seem to be an efficient method of begining data exchanges. The rate of sending segments should be equal to the rate at which its connecting partner returns acknowledgements. The slow start algorithm states that, after connection has been established, the congestion window is initialized to one segment size (i.e., the MSS advertised by its partner) and the sender increases the congestion window by one segment size for each acknowledged segment. All TCP implementations today are required to support the slow start algorithm.

The sender starts by sending one segment and, if acknowledgement is successfully received, it increases the congestion window to two segments. It then sends two segments. When these two segments are acknowledged, the congestion window becomes four segments in size and four segments are sent, and so on. The congestion window opens up exponentially until it reaches the receiver s window size or congestion is detected.

In practice, TCP uses slow start and congestion window in concert to avoid congestion.

  1. At the beginning, the congestion window is set to one segment size and the slow start threshold (a program variable) is set to 65535 bytes.
  2. When congestion occurs, one-half of the current congestion window size, but not less than two MSS, is saved in the slow start threshold variable. In addition, if timeout (rather than duplicate acknowledgement) indicates the congestion, then the congestion window is reset to one and restarts slow start.
  3. As data is acknowledged by the connecting partner, the congestion window can increase.
    1. If the congestion window is less than the slow start threshold, then increases in the congestion window are based on the slow start criteria (i.e., exponentially).
    2. Otherwise, the window size increases by using an avoidance algorithm (MSS2 / congestion window).

Hence, TCP is doing slow start and reopening windows quickly until halfway, where it ran into congestion before. It then switches to the avoidance algorithm and slowly increases the data transmission rate.

A slight enhancement to this implementation is commonly referred to as Fast Retransmit and Fast Recovery algorithm. As noted above, when a duplicate acknowledgement is received, the congestion window is not reset to one. This is because the sender cannot determine for certain that the duplicate acknowledgement is due to lost data or out-of-sequence data (recall that the receiver must immediately generate acknowledgement upon receiving out-of-sequence data). However, if the instance is simply an out-of-sequence condition, then the sender should only receive a few duplicate acknowledgements before all segments arrive at the receiver and are processed. If several duplicate acknowledgements are received in a row, then this is a good indication that data has been lost. Therefore, TCP waits for three duplicate acknowledgements before setting the slow start threshold to half of the current congestion window size, but immediately retransmits the lost segment(s) without waiting for the retransmission timer to expire. This is the Fast Retransmit algorithm. Since the congestion window is not set to one in this case (slow start not performed), the existing data flow rate can resume much quicker. This is the Fast Recovery algorithm.

Conclusion

This document discusses TCP operation, flow control, data lost detection, and congestion avoidance methodologies. Today s TCP/IP implementation incorporates experiences learned in the past. In comparison to SNA, its data flow control and procedures are relatively simple and self-contained. As TCP/IP becomes the pivotal component in the e-commerce Internet world, the following improvements are of value to customers:

  • Customer-centric instead of network-centric. As with all network architecture, network designers are always network-centric in terms of network resource usage, efficiency and management. They should, instead, be customer-centric in terms of application response time and/or throughput. The network defaults should be set up and customizable to achieve this objective.
  • Intelligent network. A TCP/IP network consists of routers and switches. These network components do not share their knowledge among themselves unless they are like devices from the same vendor. Standard procedures are needed for diverse network components to exchange operation conditions, parameters, resource boundaries, and control in order to balance workloads, proactively avoid bottlenecks and dynamically revise routing choices.
  • Knowledge sharing. A host supports many applications and each application may have many active TCP connections. Many TCP connections from a host may have the same or different application partners in the same target host. There is no sharing of information among these connections. Each must go through and experience the same difficulties, such as lost data, retransmission, delays, congestion avoidance and so on, even though they are from the same host going to the same destination and, most likely through the same network path. The data exchange history needs to be shared and persistent among connections in order to improve application effectiveness and overall network efficiency.
  • Class of Services. Similar to SNA, applications can define and select their desired class of service so that the application data can be delivered based on application business criteria. Note, IP defines Type-of-Service but all TCP/IP implementations provide no means for the application to define or select the desired type of service.

Data compression. TCP/IP should, by default, compress application data since Internet applications now deliver more complex and richer content. Transmitting less data translates directly into improvements in response time and network efficiency.

scrap/tcpperformanceandflowcontrol.txt · 마지막으로 수정됨: 2014/11/10 09:42 (바깥 편집)