Analysis and Research of Communication Interrupt Fault for Shanghai Metro Data Transmission System

—A line of Shanghai metro has been put into use for nearly fifteen years. There are three times extended during this time. The existing line's data transmission system was modified over the last decades and has adopted many kinds of data transmission technology. By the analysis and research of communication interrupt of certain line's data transmission system, which usually occurred in some site, the maintainers can find various security hidden danger in time and take corresponding measures, which has improved the quality of metro operation.


I. INTRODUCTION
One of Shanghai metro line has adopted a new data transmission technology, which provide a high speed, large capacity optical fiber transmission channel.Therefore, the data transmission system of shanghai metro has become more safe, reliable and fast.Recently, multi-station's signal system exists communication interrupt.Through the trouble analysis and research of this phenomena, the worker of metro maintenance can find various security hidden danger in time, and take corresponding corrective measures.

II. DTS SYSTEM NETWORK TOPOLOGY
Network topology refers to the transmission media interconnect devices of all kinds of physical layout, especially the position of the distribution of computer and the cross style of cable.There are many kinds of Network topology.Such as star, bus, ring, tree, distributed, network and cellular topology structure, etc.One line of Shanghai metro adopts ring topology, all nodes connect by end to end to form a closed ring communication lines.Information can be transmitted in oneway or two-way.Ring network structure has the following advantages: 1) when we increase or decrease the workstation, it only takes a simple connection operation; 2) it can use optical fiber to improve transmission distance; 3) once a node failure, it can automatic bypass with high reliability.But the ring network structure also has some disadvantages: 1) any node failure in a ring network will cause the whole network paralysis; 2) it is difficult to detect trouble, because this connection does not centralized control, when trouble occurs ,it is essential to examine each node on the network, which is very difficult; 3) since the information is serial communication, too much nodes in the ring network will affect the transmission efficiency and make the corresponding time become longer.Due to the disadvantages of above, one line of Shanghai metro adopts double channel redundancy style, namely double loop network topology.
The double channel redundancy structure has the characteristics of continuity, real-time, limitless and quickness.Based on the double channel redundancy type, double loop network topology process switch automatically by the hardware, the fault detection and switch work are completed mainly by bottom layer, thus improving the process efficiency, reducing the fault recovery time of network communication, meet the real-time requirements.

One line of Shanghai metro's DTS (Data Transmission
System) network topology structure as shown in Fig. 1.All loop is reverse two fiber optic fiber ring, which can provide redundancy.When one set of hardware occurs failure, the system can still continue to work.The loop transmit data through the redundant Ethernet switches and terminal server.Ethernet packets are processed by central control room and each signal equipment room through optical fiber industrial Ethernet switches, serial data through redundant serial and Ethernet device server for data processing.
In one line of Shanghai metro, the central server and each site combine together through the optical fiber to form a ring, the whole ring have S1, S2, two channels, they are redundant.each site with two three layer of Ethernet switches as relay nodes.In each site, man-machine dialogue workstation, terminal server and shielding door AP are unified management www.ijacsa.thesai.orgby the switch.RS910 terminal server is an industrial Ethernet switch, which has two serial port and two fast Ethernet port, meanwhile, has the function of limiting the port rate and inhibiting the broadcast storm.Due to the fault points usually occur in the terminal server, so the following content give a brief introduction about the work conditions and functions of the terminal server-RS910.The Fig. 2 is the network connection diagram of RS910 as a terminal server, the two terminal server connect A、B two switches separately, its main function is used for interlocking data transmission between the adjacent stations and make decision for redundancy (MIterminal server -switches).The Third terminal server connect A、B two net segments at the same time, it is mainly used for site shielding door linkage control (MI -terminal serverswitches -wayside AP -train MR), known as the PSD door terminal server.

III. FAULT PHENOMENON AND ITS INFLUENCE
Recently, DTS of Shanghai metro line appears multi-site communication interrupt trouble, which has led to the failure of interlock.Large interlock control area jumped to red tape, the station ATS and the center ATS show inconsistent, leading to approach cannot be arranged, the train stopped and affect the normal operation of the subway.As shown in the TABLE I.

IV. FAILURE ANALYSIS AND PROCESSING
Within a few minutes, the switches of A, B network segment downtime simultaneously at several stations of a subway line in December 9, 2013and the 11th, the fundamental reason is the presence of large amounts of data network in a short time.The switching equipment utilization reached 100% by SNMP of the network monitoring station.In addition, the maintenance personnel found all the ports of switches including control ports of console equipment are not loading when they inspecting the switches in 11thDecember 9, 2013.The main reasons for switch downtime: 1) interlocking with RS910 terminal server firmware BUG resulting in serial does not work; 2) not enabled the broadcast suppression function of switch.Detailed analysis is as follows:

A. Interlocking with RS910 terminal server firmware bug resulting in serial does not work
The initial performance of these two failures are communication failure in the west of the K-station, after a few hours, several switches have downtime.In the 9th, for example, communication failure occurs in the K-station on the west side at 2:00 more, and E-station's switches occur downtime at 7:00.The reason for this is due to the presence of a terminal server bug of interlock system, and its availability decreased after a long working hours, which performance is the serial communication error.What's worse, it cause the release of large of data to the network in a short time, which result in utilization of switch very high and switch downtime at last.There is a surround bug of TCP sequence number in the RS910 terminal server.The sequence number account 4 bytes which range is [0, 2 32 -1], a total of 2 32 (i.e 4 284 967 296).After increase to 2 32 -1, the next sequence number back to zero.That is to say, the sequence number use mod 2 3 algorithm.If the sequence number of connected TCP occur surround (final confirmation sequence number is actually a big unsigned figure, and the next sequence number is a small unsigned figure), the receive window will be permanently closed.This problem can be treated as a "Serial Lock".Even though the data cannot be transmitted to the serial port, the data received from the serial port can still be sent to the remote end.RS910 Terminal Server the most basic role is the conversion of protocol and physical-line, shown in Fig. 3.The communication port of terminal server connect with computer interlocking MI is called serial port, and the communication port of terminal server connect with MOXA Layer 3 switch is called Ethernet port.It uses TCP transport protocol because of considering the protocol conversion between RS910 terminal server's serial port and Ethernet port.TCP transport protocol need to go through a three-step handshake confirmation before establish communication.The sequence number wraparound bug is the handshake signal generating TCP protocol sequence numbers wrap before establish TCP protocol data transmission, which will cause RS910 Terminal Server normally forward data received from local interlock to the remote end, and the remote end cannot transfer interlock communication data to interlock MI.Therefore, wraparound bug of RS910 terminal server firmware will cause the serial does not work.

B. Not enabled the broadcast suppression function of switch
During the 9th and 11th twice fails, broadcast storm protection option of the switches are disable, the switch does not use its own broadcast storm suppression.
In computer networks, data link layer and network layer use broadcasting technology, the former transmits the broadcast information to a plurality of physical devices, the latter transmits the broadcast information to a plurality of logical devices.There are two TCP / IP protocol broadcast: 1) full broadcast which send broadcast data to each host; 2) part broadcast which broadcast information to a particular broadcast network or subnet; It may occur "broadcast storm" among multiple forwarding device during the broadcast process: a lot of broadcast data packet repeat transmission, degrade network performance and exhaust network bandwidth, which bring down the network.
The broadcast storm control principle is to allow the port to filter the broadcast storm appeared on the network.After broadcast storm control function is turned on, the port discard received broadcast frames automatically when broadcast frames port received add up to a predetermined threshold value.When this function is disabled or broadcast frame is not cumulative to the threshold, the broadcast frame is normally broadcast to other ports of the switch.
As show in Fig. 4, assume the host A send broadcast packet constantly to the outside until it paralysis.The broadcast packets send by host A need to be forwarded by switch.At this point, the switch will drop the broadcast packets (eg DHCP, ARP, RARP, NETBIOS, RIP, etc.) traffic of host A, so that other mainframe data traffic can be forwarded normally,so as to prevent broadcast packets sent by the host A effect other devices of broadcast domain.If you do not suppress broadcast storms, the broadcast packets will be flooded to all tunnels.At the same time, a large number of broadcast reply will propagate in VLAN, causing broadcast storms.
Most switches support broadcast storm control at present, and broadcast packets of each port can be remain below a specific ratio which configure this feature in the future, so you can reserve bandwidth to be used, thereby inhibiting broadcast storms.pt-7828 switch have broadcast storm control function, as shown in Fig. 5.For RUGGEDCOM RS 910 TCP sequence number wraparound bug: it is necessary to upgrade software firmware of RS 910 terminal server.
Broadcast storm protection options of all switches is enable, so that the switch can automatically discard and protect operational security system when ports have a particularly large broadcast data.
For some stations, the adjacent station transmit interconnection information by try using a photoelectric converter iConverter to replace RS910 terminal servers and switches.The network connection of terminal Server contrast with iConverter photoelectric converter is shown in Fig. 6.The main function of the communications device is to replace serial communication device, so that the adjacent station transmit interconnection information.
The iConverter photoelectric converter is mainly composed of three parts: 5-Module Chassis cage, RS485/422 cards and NMM2 photoelectric conversion modules.The cage built master / slave of two power modules,RS485/422 cards and NMM2 photovoltaic modules were installed in slot1 and slot2 ~ 5. Take B station (middle) for example, slot 3 ~ 2 replace slot 5 ~ 4 of A station (west); slot 5 ~ 4 replace slot 3 ~ 2 of C station (east).Slot 1 is NMM2 module which connected to M1-P4 port of DTS switches.
Using iConverter photoelectric sensors replace RS910 terminal servers and switches, which avoid the software protocol conversion of a terminal server firmware effectively.
Since the end of December 2013, the metro took over the rectification program and DTS broadcast storm system has been effectively controlled.
VI. EPILOGUE DTS system of shanghai metro a Line has experienced three extension and transformed DTS data transmission system of existing line and used a variety of new data transmission technology.The use of new equipment also raises some new questions: sequence number bug of terminal server firmware cause the serial does not work which result in broadcast storms, switches, and a series of chain reaction crash, leading to a neighboring station interlocking communication interrupted, or even the whole ring paralyzed and cause some impact on subway normal operation.To analyze the malfunction and develop appropriate corrective solution which can reduce the probability of occurrence of similar failures and ensure the normal operation of the subway.It can provide some maintenance experience for other urban rail transit maintenance unit.The on-site maintenance deal with the malfunction quickly when there is the same or similar failure and reduce the malfunction impact of the operating to a minimum.

Fig. 1 .
Fig. 1.The network topology of a Shanghai Metro line's Data Transmission System

Fig. 5 .
Fig. 5. Broadcast storm protection V. REFORM PROGRAM AND RESULTS

Fig. 6 .
Fig. 6.Comparison diagram of network connection between terminal server and iConverter photoelectric converter

TABLE I .
LIST OF COMMUNICATION INTERRUPT FAULTS IN DTS SYSTEM