AdaptiveFaultDiagnosisAlgorithmfor ControllerAreaNetwork

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 10, OCTOBER 2014 5527

Adaptive Fault Diagnosis Algorithm forController Area Network

Supriya Kelkar, Member, IEEE, and Raj Kamal, Member, IEEE

Abstract—A controller area network (CAN)-based distributedsystem may develop faults at run-time. These faults need to bedetected and diagnosed. This paper proposes a new algorithmnamed adaptive fault diagnosis algorithm for CAN (AFDCAN). Itis designed for low-cost resource-constrained distributed embed-ded systems. The proposed algorithm detects all faulty nodes onthe CAN. It allows new node entry and reentry of repaired faultynodes during a diagnostic cycle. AFDCAN is found to providehigh fault tolerance and to ensure reliable communication. It usessingle-channel communication deploying the bus-based standardCAN protocol. A hardware implementation of the proposed algo-rithm has been used to obtain the results. The results show that theproposed algorithm diagnoses all faults in the system. Analysis ofthe proposed algorithm proves that the algorithm uses a definiteand bounded number of testing rounds and messages to completeone diagnostic cycle.

Index Terms—Adaptive algorithms, automotive applications,controller area network (CAN) protocol, distributed networks,distributed systems, fault diagnosis, real-time systems.

I. INTRODUCTION

D IFFERENT techniques are used to handle faults in adistributed network. Faults in the network may be due

to node failures or connection failures. Faults at nodes canarise due to failures in the processor, in the memory, or in theinput–output hardware interfaces. The faults in the network canbe at the physical layer, data link layer, application layer, ornetwork management level [1].

Controller area network (CAN) protocol is widely used inreal-time industrial and automotive applications. There aremany other proprietary as well as standard serial protocols usedfor automotive multiplexing [2], [3]. CAN handles the errorsefficiently at the node level. Error confinement depends uponthe behavior of the node when the node is in one of the threestates, namely, active state, passive state, or bus-off state [4].The confinement of faults is a part of the CAN protocol [4].However, along with capabilities of fault diagnosis and faultconfinement at node level, CAN-based distributed embedded

Manuscript received April 6, 2013; revised July 22, 2013 and September 8,2013; accepted November 14, 2013. Date of publication January 2, 2014; dateof current version May 2, 2014. This work was supported by the Cummins Col-lege of Engineering for Women, Pune, India, and the Institute of Engineeringand Technology, Devi Ahilya University, Indore, India.

S. Kelkar is with the Computer Engineering Department, Cummins Collegeof Engineering for Women, Pune 411052, India (e-mail: [email protected]).

R. Kamal is with the School of Computer Science and Information Tech-nology, Devi Ahilya University, Indore 452017, India (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIE.2013.2297296

systems need a robust fault diagnosis algorithm at networklevel. This is to assure a reliable communication to achieveexpected functioning of the system. Reliability can be achievedby providing redundancy at node level or at channel level [5]–[7]. However, this additional use of hardware increases thecost. In a distributed embedded system, every node needs tohave knowledge about other nodes in the system. In case ofmaster-slave configuration, the functions are distributed amongdifferent nodes. Every node in the system in such a situationshould be able to detect the faulty nodes in the system andshould relay this information on the network. There will be aconsiderable rise in the number of diagnostic messages if everynode tries to detect the health of all other nodes. This, in turn,increases the bus-load and slows down the communication.There is a high probability of denial of bus access to low-priority messages due to the increase in bus-load.

Authors address these issues and propose adaptive fault diag-nosis algorithm for CAN (AFDCAN). Every node in AFDCANtests other nodes until it detects a first fault-free node. Thus, anode in AFDCAN does not test all other nodes in the system.All of the fault-free nodes together detect the faulty nodes in thesystem. This reduces the number of diagnostic messages on thebus. Thus, the system diagnostic function is distributed amongthe nodes. Also, redundancy in terms of extra nodes or channelsis not required [5]–[7].

II. EARLIER WORK AND MOTIVATION FOR THE

PRESENT WORK

CAN is a single-channel protocol [4]. It does not supportredundant bus for communication among different nodes in thenetwork [5]. Considerable literature is available for CAN toachieve redundancy at channel or media level [5], [6] and atnode level [6]. Redundancy at media and node levels is a desiredfeature for fault handling in safety-critical applications. Timetriggered-CAN (TT-CAN) supports channel redundancy withthe use of additional hardware [7], [8]. Time triggered protocolsupports node and channel duplication for real-time controlsystems such as automotive applications [9]. An active startopology called CANcentrate has been proposed for providinga solution to communication failures in CAN-based systems[10], [11]. CANcentrate uses an active hub to connect the CAN-based nodes and prevents propagation of errors from one portto others. However, CANcentrate requires additional hardware,which itself may become faulty.

Automotive applications such as steer-by-wire are driven byactuators [12]. The faulty actuators need to be switched offwithin a short time span to avoid unwanted system behavior.

0278-0046 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

5528 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 10, OCTOBER 2014

Time-constrained failure diagnosis uses a subset of the totalnumber of processors in the system to test an actuator indepen-dently [12]. The individual test results are exchanged amongthese processors, and the fault-free processors correctly identifythe actuator fault. Here, more processors are required for testing[12]. Another approach for diagnosing faults in automotiveapplications is based on condensed diagnosis instead of globaldiagnosis [13]. Condensed diagnosis includes diagnostic in-formation of components, inputs, and signals for a unit or anagent along with a scalar variable representing removed com-ponents. Using condensed local diagnosis of different agents,the condensed global diagnosis of the system can be computed.Arogeti et al. [14] have proposed a new fault detection andisolation technique for an electrohydraulic steering system ofan electric vehicle. They have proposed a hybrid bond graphmodeling technique and have derived the global analyticalredundancy relations [14]. Vong et al. [15] have proposed a newframework for simultaneous-fault diagnosis. They have usedpairwise probabilistic multilevel classification based on time-dependent patterns and have applied the framework to diagnoseautomotive engine-ignition systems. The effects of faults onthe estimation of schedulability and reliability are illustratedin [16]. Authors have applied the outcome of the analysis ona distributed antilock breaking system of an automotive [16].

Networked control systems (NCSs) are used in applica-tions such as automotive, manufacturing processes, etc. [17].Jiang et al. have considered the Takagi-Sugeno fuzzy modelfor the NCSs and have proposed H∞ filter design with respectto transfer delays and packet loss [17]. Yang et al. [18] havemodeled the NCSs as nonlinear switched systems and haveaddressed the problem of system stabilization. Their resultshave been applied to multiagent systems with leader-followerstructure, especially when the follower agent may not cooperatewith the leader and run away from a target or a goal [18].

Data-driven approach is used for fault diagnosis in automo-tive domain, in manufacturing systems, and in process controlapplications. Faults are detected and isolated using a multivari-ate statistical technique, namely, principle component analysis(PCA) and pattern classifier techniques such as support vectormachines, probabilistic neural network, and K-nearest neighbor[19]. PCA and independent component analysis (ICA) are usedtogether to detect the faults in operating parameter identifierswhich are collected through various sensors and diagnosticroutines situated in the electronic control units (ECUs) [20].

ICA is better suited than PCA for fault detection in automo-tive applications. ICA is more effective for automotive subsys-tem data of non-Gaussian nature [21]. Although PCA is usedfor fault isolation, it is not well suited for task discrimination[22], [23]. PCA shows limited discrimination or classificationwhen data for both normal operation and fault operation areavailable [23]. Among correspondence analysis (CA) and PCA,better results are observed with the former [22], [24].

Kelkar and Kamal conclude that these data-driven methodsare useful in cases where the system is multivariate and the dataset available for fault detection is very large. Techniques formultivariate statistical analysis such as PCA tend to reduce thenumber of variables in a data set, thus limiting the dimensionsof the data. Statistical techniques such as PCA, ICA, and CA

TABLE ICOMPARISON OF DISTRIBUTED ALGORITHMS

are applied on normal or fault diagnosis data acquired throughdata loggers. Analyses using these techniques are carried outoutside the actual control units or systems.

A number of algorithms have been suggested in the areaof fault diagnosis in distributed networks. Table I shows thecomparison of some of these algorithms [25]–[27]. Distributedsystem-level diagnosis algorithm for arbitrary networks suchas point-to-point, broadcast, or combination of both has beenproposed [28]. Here, a fault-free node is made responsible toget tested by another fault-free node. The tests are conductedperiodically, and the node transmits diagnostic information toall of its neighbors about detection of a new node or a repairedfaulty node. This information is received simultaneously by allof the neighbors. A hierarchical adaptive distributed system-level diagnosis algorithm (Hi-ADSD) uses clusters containingnodes, where the number of nodes in a cluster is always apower of two [29]. Here, a node can get tested by more thanone node, and tests are conducted asynchronously. The numberof test rounds required for Hi-ADSD [29] is less than that foradaptive DSD [27]. There is also an improvement in diagnosislatency [29].

Comparison-based model is another approach for fault di-agnosis in multiprocessor systems. In this model, redundanttasks are executed at two processors, and the outcomes of thetasks are sent to a central observer which, in turn, compares theoutcomes to find the faulty processor [30], [31]. The processorsthemselves compare the outcomes of the tasks executed bytwo other processors instead of the central observer [32]. Ingeneralized comparison model (GCM), the comparison is doneby one of the two processors being tested [33].

The outcomes are also sent to the central observer, eventhough the processors themselves perform the comparison [32],[33]. The outcomes of the tasks executed by two processors arealso sent to all other processors in the broadcast comparison

KELKAR AND KAMAL: ADAPTIVE FAULT DIAGNOSIS ALGORITHM FOR CONTROLLER AREA NETWORK 5529

Fig. 1. System under consideration for fault diagnosis.

model [34]. Artificial neural network (ANN) based GCMmodel employs backpropagation neural networks to diagnosethe faults in multiprocessor or multicomputer systems [35].These ANN-based techniques require considerable memoryand processing power.

SAE J1939 protocol uses CAN bus at physical layer. Thisprotocol is widely used in automotive applications and alsosupports extensive diagnostic features. SAE J1979 is the SAEguideline for On-Board-Diagnostics-II (OBD-II) [36]. Specificonline and offline diagnostic features for different categories ofautomotive are available in SAE documents.

While working with distributed embedded systems, authorsfelt a need to incorporate an effective fault detection techniquein these systems. For example, in the process control applica-tions, some nodes or microcontroller-based systems may failduring operation. Such information of failed nodes needs to beknown to all other nodes in the network. For this purpose, themethodologies used in distributed computer systems [25]–[29]have been found suitable. These methods have been used forhybrid or star networks where fault detection algorithms aremostly adaptive and parallel.

The adaptive algorithm does not use a fixed testing schemebut adapts to the current fault situation in the system, and thenodes are tested accordingly. In distributed computer systems,diagnostic tests are carried out by different nodes simulta-neously and hence are termed as parallel. A fault detectionmethod using parallel diagnostic tests would have been stressfulto the distributed embedded systems which use bus topology.Hence, in this paper, the authors have proposed an adaptivesequential algorithm for fault detection in bus-based distributedembedded systems, named as AFDCAN. It is an adaptive faultdetection algorithm, which resides inside the nodes along withthe applications and detects faults in the system as and whenthey occur.

AFDCAN uses distributed network approach for detectingfaults. AFDCAN is designed for CAN-based embedded dis-tributed networks which specifically use bus topology. Thisalgorithm needs comparatively less memory, and the numberof messages required for a diagnostic cycle is optimized.AFDCAN can be adapted for other automotive protocols suchas local interconnect network and SAE J1939 with a fewmodifications. In the present work, focus is on finding the faultynodes in the network and, in turn, informing the statuses offaulty nodes to all of the healthy nodes. As mentioned earlier,different approaches have been used for fault diagnosis inCAN-based systems. These approaches include additional useof hardware in case of CANcentrate [10], specific domains likeautomotive using SAE J1979 [36], time-constrained fault diag-nosis [12], condensed diagnosis [13], generic distributed mul-

ticomputer systems [25]–[29], etc. To the best of the authors’knowledge, specifically in the CAN protocol, fault detectiontechniques for a bus topology have not been proposed earlier.

The remainder of this paper is organized as follows.Section III proposes the new AFDCAN algorithm. Section IVprovides its modeling and analysis. Section V proposes fur-ther modifications to AFDCAN. Comparison of AFDCANwith other distributed diagnosis algorithms is presented inSection VI. Section VII presents important experimental resultsfor different fault cases when implemented using the hard-ware. This is followed by Section VIII giving the conclusion.The mathematical representation of the AFDCAN algorithm ispresented in the Appendix.

III. AFDCAN

A. System Details, Assumptions, and Fault Modelfor AFDCAN

AFDCAN algorithm uses single bidirectional channel andbus-based CAN topology. Let V be the set containing n numberof nodes in the system, where N1 is the first node, N2 is thesecond node, etc. Then

V = {N1,N2,N3,N4, . . . ,Nn} . (1)

The following assumptions are considered for fault condi-tions in AFDCAN.

1) The node is assumed faulty when it stops functioning.2) There can be one or more faulty nodes in the system,

where faults are bounded.3) The number of faulty nodes in the system is bounded

by (n− 2), where n is the total number of nodes in thesystem.

4) A node can become faulty any time during the diagnosticcycle. This fault condition of the node may be temporaryor permanent.

5) Fluctuations or changes in the status of a node are notdetected during the same diagnostic cycle if that node hasalready been tested. However, the change in status of thatnode will be detected in the next diagnostic cycle.

6) Every node is tested by only one other node in a diagnos-tic cycle.

Fig. 1 shows a generic system utilizing the AFDCAN algo-rithm for fault diagnosis. AFDCAN uses the following faultmodel for the CAN network. The fault model defines theoutcome of the test, i.e., response from the fault-free node orfrom the faulty node when the test message is sent by anotherfault-free node.


Fig. 2. Format of the buffer at every node.

Fig. 3. Test/result/second result/broadcast frame.

Let t(ni, nj) be the test performed by node ni on node nj.Then, result Re of t(ni, nj) will beRe = 0, if ni is fault free and nj is faulty;Re = 1, if both ni and nj are fault free;ni cannot test nj and evaluate nj as fault free if ni itself is faulty.

In the present algorithm, the faulty node stops functioning,and therefore, it is unable to communicate with the other nodes.

B. Details of the AFDCAN Algorithm

A node continues to diagnose until it finds a fault-freenode. This approach is similar to the adaptive DSD algorithmproposed for distributed systems [27].

Each node is assigned a unique node identification (NID).Each node maintains a buffer (NBUFF) containing n number offields, where n is the total number of nodes in the system. TheNBUFF of each node consists of individual fields containingnode ID and state for all of the nodes in the system. The stateindicates whether the node is faulty, is fault free, or is in anundefined state. Fig. 2 shows NBUFF of a node in a systemcontaining five nodes. A fault-free node in the network initiatesthe AFDCAN algorithm by sending the test frame.

The format for different types of frames used in AFDCAN isshown in Fig. 3. There are 8 B in the CAN data frame allocatedto data field. Fig. 3 shows these 8 B and their use in AFDCAN.

A node can be a tester which tests the other nodes or it canbe a testee which gets tested by any one fault-free node. Whilesending the test frame, the NID of the tester becomes the sourcenode identification (SID), and the NID of the testee becomes thedestination node identification (DID).

Any node that does not receive a test frame as a testee withina specific time becomes the tester and initiates the diagnosisprocess. The testee receives the test frame if it is not faulty.It marks the tester as fault free in its buffer. Also, it updatesits buffer with the diagnostic information of all of the nodespreceding the tester. This status information is available inthe test frame sent by the tester. Then, the testee will sendthe result frame to the tester. After receiving the result framefrom the testee within timeout, the tester reads the diagnostic

information from the result frame. As read from the result framesent by the testee, the tester updates its buffer, marking its ownstatus as fault free. Also, the tester marks the testee as fault freein its buffer. This communication is called “test round” betweentwo nodes. This is shown in Fig. 4.

The aforementioned test rounds continue until the last nodein the system is tested. The last fault-free node sends the secondresult frame to the earlier fault-free node after all of the testrounds are completed. After the reception of second resultframe, the (m− 1)th fault-free node transmits the broadcastframe to all of the nodes, which contains the complete diagnosisreport of the network. Here, m ≤ n, with n being the totalnumber of nodes in the system and mth node being the lastfault-free node in the system. This is the end of one diagnosticcycle as shown in Fig. 4. These diagnostic cycles are periodicin nature.

In Fig. 5, as discussed earlier, the “test round” will becompleted between N1 and N2, N2 and N3, and N3 and N4.However, when N4 sends the test frame to N5, node N5, beingfaulty, does not respond and thus does not send the result frameto N4. Therefore, N4 marks N5 as faulty in its buffer and thensends the second result frame to N3. Now, N3 will update itsbuffer for the final result and will send the broadcast frame to allof the nodes so that all of the fault-free nodes will update theirbuffers with the complete diagnostic information of the system.This is the reason why the second result frame is transmitted inthe AFDCAN algorithm.

C. Entry of New Nodes and Reentry of Faulty NodesAfter Repairs

AFDCAN supports new node entry and its participation inthe fault diagnostic cycle.

When a new node is powered up, it needs to listen to thebus. The new node sends the new node entry frame after thedetection of a result frame on the bus. All of the nodes which arepart of the diagnostic cycle receive the new node entry frame.These nodes recognize the arrival of the new node and allocatea field in their buffer for this node. The allocated field in thebuffer is then updated with the node ID and state of the newnode. Thus, the new node starts participating from the nextdiagnostic cycle.

Fig. 6 shows the format of the entry frame for a new node.Faulty nodes can also reenter the network and participate inthe diagnostic cycle after they are repaired. A faulty node canjoin the current or ongoing diagnostic cycle if its tester has notyet sent the test frame to it. Otherwise, it starts participatingfrom the next diagnostic cycle. The number of faulty nodes isbounded by (n− 2).

In the Appendix, Section A presents the algorithm for a testround between the tester and the testee, Section B presents thealgorithms for different timeout conditions, and the algorithmfor the new node entry is presented in Section C.

IV. MODELING AND ANALYSIS OF THE AFDCANALGORITHM

The system considered by AFDCAN is represented by adirected multigraph G, where G = (V,E) consists of a set V


Fig. 4. Complete diagnostic cycle when all of the nodes are fault free.

Fig. 5. Complete diagnostic cycle when node N5 is faulty.

Fig. 6. Entry frame for a new node.

of vertices, a set E of edges, and a function f from E to{(u, v)|u, vCV }. The edges e1 and e2 are multiple edges iff(e1) = f(e2)[37].

In the graph, the vertices represent the nodes, and theedges represent the communication paths. These communica-tion paths are unidirectional. Fig. 7 is the graphical representa-tion of the system under consideration for a case where there areno faulty nodes. “T” represents the test frame sent by the node,“R”, the result frame sent by the node, “2R”, the second resultframe sent by the node, and “B”, the broadcast frame sent bythe node. It may be noted that these edges are directed outwardfrom the nodes.

Consider nodes N3 and N4 shown in Fig. 7. As there aretwo edges, namely, result and broadcast, directed from N4

to N3, one can say that (N4,N3) is an edge of G(V,E),with f(e1) = f(e2) = (N4,N3). Similarly, for nodes N4 andN5, it can be proved that (N5,N4) is an edge of G(V,E),with f(e1) = f(e2) = (N5,N4). Thus, it is proved that Fig. 7represents a multigraph. Fig. 8 represents the multigraph of thesystem with fault condition.

The system under consideration has been proved to be aplanar graph for all fault conditions with (n− 2) as bounded

Fig. 7. System with no faulty node.

Fig. 8. System with node N2 faulty.

condition for faulty nodes. That means that the system with fivenodes is a planar graph with the maximum allowed number offaulty nodes being three.


Fig. 9. Adjacency matrix for out-degree of nodes, when node N2 is faulty.

The following discussion provides the proofs for a lemmaand two theorems.

Lemma: There is a directed path from any fault-free node toany other fault-free node. This leads to following theorems.

Theorem 1: The number of test rounds for one completediagnostic cycle is definite and bounded.

Theorem 2: The number of diagnostic messages required fora complete diagnostic cycle is definite and bounded.

The proofs for the aforementioned lemma and theorems aregiven in Section IV-(A)–(C), respectively.

A. Proof for the Lemma

Fig. 9 shows the adjacency matrix N for out-degree of nodesN1, N2, N3, N4, and N5. N represents the case where node N2

is faulty as shown in Fig. 8.In an adjacency matrix N

N=[aij ] then

aij=1, if there is an edge from Ni to Nj , i, j=1, 2 . . . 5

aij=0 otherwise [37].

For Fig. 8, when aij = 0, no directed path exists from i to j,and when aij = 1, a directed path exists.

The rows of the adjacency matrix represent nodes N1, N2,N3, N4, and N5, respectively. In the adjacency matrix shownin Fig. 9, where the elements having values of 1 indicate that adirected path exists and 2 indicate that more than one path existbetween the nodes. When all of the elements in ith row are zero,Nith node is faulty. The second row (row N2) of Fig. 9 containsall zeros indicating N2 being a faulty node. The following aresome of the observations regarding the fault conditions in thesystem after analyzing the adjacency matrix (Fig. 9).

1) If a particular row contains all of the elements as zero, thecorresponding node is a faulty node.

2) The elements above the main diagonal of the adjacencymatrix provide the information about the directed pathbetween all of the fault-free nodes of the system.

B. Proof for Theorem 1

One test round in AFDCAN consists of two messages be-tween any two fault-free nodes. They are the test message andthe result message.

With n being the total number of nodes in the system, thenumber of test rounds tR will be (n− 1) for the fault-freesystem.

If there are f number of faulty nodes in the system, then

tR = n− 1− f. (2)

There is an additional result message sent from mth fault-free node to (m− 1)th fault-free node, where m ≤ n and mis the last fault-free node in the system with f faulty nodes.This constitutes half of tR. Also, (m− 1)th fault-free nodebroadcasts the result which contains the complete diagnosisreport of the network.

For the fault-free system, the number of broadcasting mes-sages will constitute {(n− 1)/2} test rounds, where (n− 1)is the total number of broadcasted messages from (m− 1)thfault-free node. Also, there are test messages sent to f faultynodes, where these nodes do not respond and thus do not com-plete the test round. This constitutes {(n− 2)/2} test rounds,where (n− 2) is the maximum number of faulty nodes. Thus,there is a need to consider all of the aforementioned additionalcommunication edges as part of the diagnostic cycle. Thesecommunication edges will also contribute to (2).

Let ce be the total of aforementioned additional communi-cation edges. Also, tR is called as tRPair, which representsthe total number of pairs of messages inclusive of all of thecommunication edges for a single diagnostic cycle in a system.The maximum number of pairs of messages in one completediagnostic cycle of the system is represented by tRPair(max).

Therefore

tRPair(max) = n− 1− f + ce (3)

where

ce =

[(n− 1

2

)+

1

2+

(n− 2

2

)]

and ce = Total of all of the additional communication edges asdescribed previously.

Therefore

tRPair(max) = 2n− f − 2. (4)

Consider the system consisting of five nodes, with threenodes detected as faulty. In this case, as per (4), there will bemaximum five pairs of messages (tRPair(max)) in one completediagnostic cycle. Equation (4) satisfies all of the 26 faultconditions for the system with five nodes.

C. Proof for Theorem 2

Let tm be the total number of test messages sent, r, the totalnumber of result messages sent, and (n− 1), the total numberof broadcasted messages.

Then, the total number of messages (ms) in one completediagnostic cycle is

ms = tm + r + (n− 1). (5)

If there are f number of faulty nodes in the system, then

r = n− f. (6)

Therefore

ms = tm + 2n− f − 1. (7)

Equation (7) provides the exact number of messages requiredfor one complete diagnostic cycle for the fault-free or fault


Fig. 10. Arrow diagram for the system with nodes N1 and N3 faulty.

condition of the system. The value of ms can also be obtainedfrom the arrow diagram. The arrow diagram shown in Fig. 10represents the condition where nodes N1 and N3 are faulty.The in-degree din(G) gives the exact number of messages for acomplete diagnostic cycle. For the system with five nodes, thevalues of ms can be obtained for all of the 26 fault conditionsusing arrow diagrams similar to Fig. 10.

Let msmax be the upper bound for the number of messagesin one complete diagnostic cycle. It is known from earlierdiscussion that

tm ≤ (n− 1). (8)

Thus, by adding (2n− f − 1) on both sides of (8), we get

msmax = 3n− f − 2. (9)

Thus, the total number of messages in one complete diagnos-tic cycle is bounded and also definite.

V. PROPOSED MODIFICATIONS IN AFDCAN

The diagnostic cycle in AFDCAN is periodic, and the algo-rithm checks each and every node during the diagnostic cycle.A node is tested again even if it is detected as faulty in theearlier cycle. This technique increases the time required forthe completion of a diagnostic cycle, as the earlier node waitsfor the response from the next faulty node until timeout. Asolution to this additional latency would be in avoiding thetesting of the faulty nodes in the next diagnostic cycles. Afault-free node can check its buffer to identify faulty nodes, ifany, for the currently completed diagnostic cycle. During thenext diagnostic cycle, a fault-free node can avoid testing thesefaulty nodes. Thus, the diagnostic time can be reduced. Thiswill improve the performance of AFDCAN. In such a case, thefaulty node after repair can enter into the diagnostic cycle bysending the entry frame for the repaired faulty node.

The entry frame for the repaired faulty node will contain SIDspecifying the node ID, DID as zero, and frame type equal to“0101.” The entry frame for the repaired faulty node will besimilar to the entry frame for the new node (Fig. 6) but withthe frame type bits as “0101.” Also, the buffer writing at everynode can be avoided if the current fault status of a tested nodeis the same as that found in the buffer.

Presently, the number of faulty nodes in the system isbounded by (n− 2). This bound can be made equal to (n− 1)with minor changes in the algorithm. Thus, AFDCAN canoperate even if there is only one fault-free node in the system.

TABLE IICOMPARISON OF DISTRIBUTED FAULT DIAGNOSIS ALGORITHMS

VI. COMPARISON OF AFDCAN WITH OTHER

FAULT DIAGNOSIS ALGORITHMS FOR

DISTRIBUTED NETWORKS

The AFDCAN algorithm is based on the CAN protocol.In this section, the authors compare AFDCAN with otheralgorithms [25]–[29] for fault diagnosis in distributed networks.These algorithms [25]–[29] are discussed in Section II.

AFDCAN is a fault diagnosis algorithm for distributed em-bedded systems, whereas the algorithms in [25]–[29] have beenimplemented on Ethernet-based distributed computer systems.AFDCAN is an adaptive fault diagnosis algorithm, and the al-gorithms in [25]–[29] also are of adaptive type. The algorithmsin [25]–[29] are executed in parallel by the computer systems,whereas the tests in AFDCAN are executed sequentially. How-ever, in AFDCAN, the transmission of the final result to all ofthe nodes is done at the same time, which means “in parallel.”

The fault model used in AFDCAN is asymmetric as dis-cussed in Section III-A. Presently, in AFDCAN, the faulty nodedoes not respond to the test message sent by the other node anddoes not understand the final result. This is because AFDCANconsiders the faulty node as “ceased to function,” as in the fail-stop model discussed in [28]. The fault model is symmetric [27]for algorithms described in [25]–[27], and [29]. A comparisonof AFDCAN and other algorithms is presented in Table II.

VII. RESULTS AND DISCUSSIONS

A. System With Five Nodes

The AFDCAN algorithm has been implemented using fiveembedded hardware units based on Renesas M16C/6N groupmicrocontroller [38]. The baud rate used for CAN data transferis 125 kb/s. All 26 conditions for fault diagnosis of the systemcontaining five nodes are considered and verified. Figs. 11and 12 show two fault conditions detected by AFDCAN. Themarked area in each of these figures indicates one complete


Fig. 11. All of the nodes are fault-free.

Fig. 12. Nodes N2, N3, and N4 are faulty.

diagnostic cycle. Lines 1 and 2 of the marked areas show onetest round between two nodes. The different cases which makeup 26 conditions for fault diagnosis in the system with fivenodes are as follows: 1) no faulty node in the system; 2) onefaulty node in the system; 3) two faulty nodes in the system;and 4) three faulty nodes in the system.

The bus activity is captured using Vector CANalyzer, and theresults have been verified.

B. Entry of New Nodes and Reentry of Faulty Nodes

Entry of new node and reentry of repaired faulty nodesare also verified experimentally for AFDCAN as shown inFigs. 13–16. The system with four nodes N1, N2, N3, and N4

is considered. Node N5 is used to test the entry of a new node.

Fig. 13. New node N5 entry.

Fig. 14. New node N5 is part of the diagnostic cycle.

Fig. 13 shows the entry of N5, and Fig. 14 shows N5 becomingpart of the diagnostic cycle. Node N5 is considered as a faultynode for testing the reentry of the repaired faulty node. N5 ispowered off to indicate the fault condition (Fig. 15), and later,N5 is powered on again to indicate the reentry in the system(Fig. 16).

C. Diagnostic Cycle Time of AFDCAN

AFDCAN is executed on the hardware as explained earlier inSection VII-A at 125-kb/s baud rate, with all of the five nodesbeing fault free. The bus activity of the complete diagnosticcycle captured on CAN bus is shown in Fig. 11 as marked area.The time taken for completion of one diagnostic cycle of Fig. 11is 560 ms.


Fig. 15. Faulty node N5 is powered off.

Fig. 16. Faulty node N5 reenters the diagnostic cycle after repair.

Authors have further improved this diagnostic cycle timewithout affecting the AFDCAN algorithm with the same baudrate. The improved diagnostic cycle time for completion of onediagnostic cycle is 526 ms, as shown in the marked area ofFig. 17.

When the AFDCAN algorithm is executed at 500-kb/s baudrate, the diagnostic cycle time is found to be 526 ms. Thisdiagnostic cycle time is the same as when AFDCAN is exe-cuted at 125 kb/s. As mentioned earlier, while implementingAFDCAN, authors have made use of the local timer moduleof the microcontroller at every node. In order to achieve syn-chronization between all of the nodes in the 26 fault conditions,it was required to provide a time window for checking the re-ception of different AFDCAN frames at every node. Therefore,the diagnostic cycle time depends on the timer at every nodeand is not affected by the baud rate of the CAN bus.

Fig. 17. Improved diagnostic cycle time when all of the five nodes are faultfree.

D. Support for Large Number of Nodes in AFDCAN

The CAN data frame has 64 b (8 B) of data field alongwith other fields. Data field is used for sending informationon the CAN bus. Out of 8 B of data field, 2 B are requiredfor specifying SID and DID, respectively (Fig. 3). Also, 4 bare required for specifying frame type (Fig. 3). Therefore, theremaining 44 b can be used for holding diagnostic data of thenodes present in the system. Two bits per node are required torepresent the state of the node (Fig. 3) in AFDCAN. Hence,AFDCAN supports 22 nodes in the system, including entriesof new nodes. Thus, a large number of nodes can be partof the fault diagnosis system. This feature is useful for bothautomotive and industrial applications.

VIII. CONCLUSION

The AFDCAN algorithm uses a definite number of testrounds and sends a definite number of messages to find the faultconditions in the CAN-based distributed embedded system.Therefore, AFDCAN uses a definite bandwidth, based on totalnumber of nodes in the system. The number of test roundsand the number of messages decrease with the increase in thenumber of faulty nodes.

AFDCAN also supports the entry of new nodes and reentryof repaired faulty nodes, as demonstrated in Section VII-B. Thefailure of the response is detected by the testee or tester withthe help of timeout.

Thus, AFDCAN uses definite time for fault diagnosis of thesystem. The improved diagnostic cycle time of AFDCAN is526 ms, when all of the five nodes are fault free.

Looking further, synchronization of timings at differentnodes in the system is required for better performance of theAFDCAN. Also, the time taken by AFDCAN diagnostic cyclecan be reduced by implementing the primary message generator[6]. The improvements proposed in Section V may also beconsidered.


TT-CAN is a higher layer protocol above the standard CAN.Standard CAN messages are sent in TT-CAN [7]. These mes-sages are transmitted at a specific time slot, and thus, theydo not compete with other CAN messages for bus access.Communication schedules of all CAN nodes are synchronizedby TT-CAN for a network. By designing the schedules of dif-ferent CAN nodes for different AFDCAN diagnostic messages,AFDCAN may be adopted for TT-CAN.

AFDCAN may become part of CAN-based automotive net-works. In automotive, both periodic and aperiodic messagesmay appear on the CAN bus. It is essential to meet the trans-mission deadline for each message. The AFDCAN messagesneed to be scheduled carefully in such a way that they do notinterfere in the existing schedules of message transfers at everynode. This will ensure the real-time transfer and safety aspectsof CAN messages in automotive.

In automotive, the CAN messages are sent or received bynodes or ECUs of CAN network. These messages are givenpriority with the help of the identifier present in the CAN dataframe. For a CAN message, the lower the identifier, the higheris the priority assigned. As far as AFDCAN is concerned, thepriority of the diagnostic messages is decided based on thepriority of the existing periodic and aperiodic messages. Whileallocating identifiers to the diagnostic messages for AFDCAN,the transmission deadlines of the periodic and aperiodic mes-sages need to be considered.

The collective diagnostic information of all of the nodes inAFDCAN is available with any fault-free node in the system.The corrective action can be taken, or the faults can be reporteddepending on the severity of the problem. Thus, the use ofAFDCAN gives a new proposition for network diagnosis inapplications such as automotive and industrial automation.

APPENDIX

A. Algorithm to Find First the Fault-Free Node by the Tester

1) Initialization;2) Make test frame, TFRM.

TFRM = {NID} ∪ {DID} ∪ {Tbit}⋃n

j=1{TBUFF1,j}where,⋃n

j=1{TBUFF1,j} =⋃n

j=2k{NBUFF1,j} k = 1, 2 . . . n.3) Send TFRM to testee.4) Wait until timeout.5) If result frame is received within timeout, then, read the

received frame in RFRM.a. If frame type bits found in RFRM indicate R bits

(“0001”) then, modify node buffer as follows:⋃nj=2k{NBUFF1,j} =

⋃j=l+m{RFRM1,j} where,

k=1, 2 . . . n, m=1, 2 . . .(n−1), l={|SID|+|DID|+2}.b. Mark testee as fault-free, NBUFF1,2DID = 1.c. Exit.

6) If result frame is not received by tester from testee withintimeout, increment faulty node counter by one.

7) Check for destination ID field (DID) for following condi-tions:a. DID less than n;b. DID equal to n.

8) If condition (7.a) is true, then,a. NBUFF1,2DID = 0 and make DID = DID + 1.b. TFRM={NID}∪{DID}∪{Tbit}

⋃nj=1{TBUFF1,j}

where,⋃n

j=1{TBUFF1,j} =⋃n

j=2k{NBUFF1,j}where, k = 1, 2 . . . n.

c. Send TFRM to testee and exit.9) If condition (7.b) is true, then,

RFRM={NID}∪{DID}∪{2Rbit}⋃n

j=1{TBUFF1,j},where DID is the NID of the earlier fault-free node and⋃n

j=1{TBUFF1,j} =⋃n

j=2k{NBUFF1,j}where, k = 1, 2 . . . n.

10) RFRM is 2nd result frame. Send it to the earlier fault-freenode.

B. Algorithm for the Following Timeout Conditions inAFDCAN

1) Algorithm for steps to be taken when test frame sent fromthe tester is not received by the testee, within timeout.a. Update node buffer as:⋃2(NID−1)

j=2k {NBUFF1,j}=0, where, k=1, 2 . . .(NID−1).b. Increment faulty node counter by one.c. Send test frame to next node.

2) Second result frame not received by tester or broadcastframe not received by the node.a. ∀j = 2k, {NBUFF1,j} = u, where, u = 2d indicating

undefined state of the node and k = 1, 2 . . . n.b. Start new diagnostic cycle.

C. Algorithm for Sending the New Node Entry Frame to All ofthe Nodes in the System

(Refer to Fig. 6)1) TFRM = {NID} ∪ {0} ∪ {NERbit}.2) Send TFRM to all of the nodes in the system after sensing

a result frame on the bus.

REFERENCES

[1] L. Cauffriez, J. Ciccotelli, B. Conrardc, and M. Bayartc, “Design ofintelligent distributed control systems: A dependability point of view,”Reliab. Eng. Syst. Safety, vol. 84, pp. 19–32, 2004.

[2] S. Kelkar and R. Kamal, “Control area network based quotient remaindercompression-algorithm for automotive applications,” in Proc. 38th Annu.IEEE IECON, Montreal, QC, Canada, Oct. 2012, pp. 3030–3036.

[3] S. Kelkar and R. Kamal, “Comparison and analysis of quotient remain-der compression algorithms for automotives,” in Proc. IEEE INDICON,Kochi, India, Dec. 2012, pp. 802–807.

[4] Robert Bosch GmbH, Ver. 2.0 Controller Area Network (CAN)—ProtocolSpecification 1991, Robert Bosch GmbH, Ver. 2.0.

[5] M. J. Short and M. J. Pont, “Fault-tolerant time-triggered communicationusing CAN,” IEEE Trans. Ind. Informat., vol. 3, no. 2, pp. 131–142,May 2007.

[6] J. R. Pimentel and J. A. Fonseca, “FlexCAN: A flexible architecture forhighly dependable embedded applications,” in Proc. 3rd Int. WorkshopReal-Time Netw., Italy, 2004. [Online]. Available: http://paws.kettering.edu//~jpimente/flexcan/FlexCAN-architecture.pdf

[7] T. Fuhrer, B. Muller, W. Dieterie, F. Hartwich, R. Hugel, and H. Weiler,“Time triggered communication on CAN,” in Proc. 7th Int. CANConf., Amsterdam, Netherlands, 2000. [Online]. Available: http://www.bosch-semiconductors.de/media/pdf_1/canliteratur/cia2000paper_1.pdf

[8] Muller, T. Fuhrer, F. Hartwich, R. Hugel, and H. Weiler, “Fault tol-erant TTCAN networks,” in Proc. 8th Int. CAN Conf., Las Vegas,NV, USA, 2002. [Online]. Available: http://www.bosch-semiconductors.de/media/pdf_1/canliteratur/fault_tolerant_ttcan.pdf


[9] H. Kopetz and G. Grünsteidl, “TTP-A protocol for fault-tolerant real-timesystems,” Computer, vol. 27, no. 1, pp. 14–23, Jan. 1994.

[10] B. Manuel, P. Julián, N. Guillermo, and A. Luís, “An active star topologyfor improving fault confinement in CAN networks,” IEEE Trans. Ind.Informat., vol. 2, no. 2, pp. 78–85, May 2006.

[11] M. Barranco, J. Proenza, and L. Almeida, “Quantitative comparison ofthe error-containment capabilities of a bus and a star topology in CANnetworks,” IEEE Trans. Ind. Electron., vol. 58, no. 3, pp. 802–813,Mar. 2011.

[12] N. Kandasamy, J. P. Hayes, and B. T. Murray, “Time-constrained failurediagnosis in distributed embedded systems: application to actuator diag-nosis,” IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 3, pp. 258–270,Mar. 2005.

[13] J. Biteus, E. Frisk, and M. Nyberg, “Distributed diagnosis using a con-densed representation of diagnoses with application to an automotivevehicle,” IEEE Trans. Syst., Man, Cybern. A, Syst. Humans, vol. 41, no. 6,pp. 1262–1267, Nov. 2011.

[14] S. A. Arogeti, D. Wang, C. B. Low, and M. Yu, “Fault detection isolationand estimation in a vehicle steering systems,” IEEE Trans. Ind. Electron.,vol. 59, no. 12, pp. 4810–4820, Dec. 2012.

[15] C. M. Vong, P. K. Wong, and W. F. Ip, “A new framework of simultaneous-fault diagnosis using pairwise probabilistic multi-label classification fortime-dependent patterns,” IEEE Trans. Ind. Electron., vol. 60, no. 8,pp. 3372–3385, Aug. 2013.

[16] H. A. Hansson, T. Nolte, C. Norstrom, and S. Punnekkat, “Integratingreliability and timing analysis of CAN-based systems,” IEEE Trans. Ind.Electron., vol. 49, no. 6, pp. 1240–1250, Dec. 2002.

[17] B. Jiang, Z. Mao, and P. Shi, “ H∞-filter design for a class of networkedcontrol systems via T-S fuzzy-model approach,” IEEE Trans. Fuzzy Syst.,vol. 18, no. 1, pp. 201–208, Feb. 2010.

[18] H. Yang, B. Jiang, V. Cocquempot, and H. Zhang, “Stabilization ofswitched nonlinear systems with all unstable modes: Applications tomulti-agent systems,” IEEE Trans. Autom. Control, vol. 56, no. 9,pp. 2230–2235, Sep. 2011.

[19] K. Choi, J. Luo, K. Pattipati, S. M. Namburu, L. Qiao, and S. Chigusa,“Data reduction techniques for intelligent fault diagnosis in automotivesystems,” in Proc. IEEE Int. Conf. Autotestcon, Sep. 2006, pp. 66–72.

[20] A. Routray, A. Rajguru, and S. Singh, “Data reduction and clusteringtechniques for fault detection diagnosis in automotives,” in Proc. 6thAnnu. Conf. Autom. Sci. Eng., Toronto, ON, Canada, 2010, pp. 326–331.

[21] N. Das, A. Routray, and P. K. Dash, “ICA methods for blind sourceseparation of instantaneous mixtures: A case study,” Neural Inf. ProcessLett. Rev., vol. 11, no. 11, pp. 225–246, Nov. 2007.

[22] K. P. Detroja, R. D. Gudi, S. C. Patwardhan, and K. Roy, “Fault detec-tion and isolation using correspondence analysis,” Ind. Eng. Chem. Res.,vol. 45, no. 1, pp. 223–235, 2006.

[23] K. P. Detroja, R. D. Gudi, and S. C. Patwardhan, “Data reduction algo-rithm based on principle of distributional equivalence for fault diagnosis,”Control Eng. Practice, vol. 20, no. 10, pp. 1033–1041, Oct. 2012.

[24] S. Pusha, R. D. Gudi, and S. Noronha, “Polar classification with corre-spondence analysis for fault isolation,” J. Process Control, vol. 19, no. 4,pp. 656–663, Apr. 2009.

[25] S. H. Hosseini, J. G. Kuhl, and S. M. Reddy, “A diagnosis algorithm fordistributed computing systems with dynamic failure and repair,” IEEETrans. Comput., vol. 33, no. 3, pp. 223–233, Mar. 1984.

[26] A. Bagchi and S. L. Hakimi, “An optimal algorithm for distributed system-level diagnosis,” in Proc. 21st IEEE Int. Symp. Fault-Tolerant Comput.,Montreal, QC, Canada, 1991, pp. 214–221.

[27] R. P. Bianchini and R. W. Buskens, “Implementation of on-line distributedsystem-level diagnosis theory,” IEEE Trans. Comput., vol. 41, no. 5,pp. 616–626, May 1992.

[28] S. Rangarajan, A. T. Dahbura, and E. A. Ziegler, “A distributed system-level diagnosis algorithm for arbitrary network topologies,” IEEE Trans.Comput., vol. 44, no. 2, pp. 312–334, Feb. 1995.

[29] E. P. Duarte and T. Nanya, “A hierarchical adaptive distributedsystem-level diagnosis algorithm,” IEEE Trans. Comput., vol. 47, no. 1,pp. 34–45, Jan. 1998.

[30] M. Malek, “A comparison connection assignment for diagnosis of mul-tiprocessor systems,” in Proc. 7th Int. Symp. Comput. Architect., 1980,pp. 31–35.

[31] K. Chwa and S. Hakimi, “Schemes for fault tolerant computing: A com-parison of modularly redundant and t-diagnosable systems,” Inf. Control,vol. 49, no. 3, pp. 212–238, Jun. 1981.

[32] J. Maeng and M. Malek, “A comparison connection assignment forself-diagnosis of multiprocessor systems,” in Proc. 11th Int. Symp.Fault-Tolerant Comput., 1981, pp. 173–175.

[33] A. Sengupta and A. Dahbura, “On self-diagnosable multiprocessor sys-tems: Diagnosis by the comparison approach,” IEEE Trans. Comput.,vol. 41, no. 11, pp. 1386–1395, Nov. 1992.

[34] M. Blough and H. Brown, “The broadcast comparison model for on-linefault-diagnosis in multicomputer systems: Theory and implementation,”IEEE Trans. Comput., vol. 48, no. 5, pp. 470–493, May 1999.

[35] E. Mourad and A. Nayak, “Comparison-based system-level fault diag-nosis: A neural network approach,” IEEE Trans. Parallel Distrib. Syst.,vol. 23, no. 6, pp. 1047–1059, Jun. 2012.

[36] C. A. Lupini, Vehicle Multiplex Communication-Serial Data NetworkingApplied to Vehicular Engineering. Warrendale, PA, USA: SAE, 2004.

[37] K. H. Rosen, Discrete Mathematics and its Application, 5th ed.New York, NY, USA: McGraw-Hill, 1988.

[38] Renesas M16C/6N Group Hardware manual, Oct. 2005.

Supriya Kelkar (M’02) received the Bachelor’sdegree in electronics and communication engineer-ing from Karnataka University, Karnataka, India, in1989, and the Master’s degree in electronics andtelecommunication engineering from Pune Univer-sity, Pune, India, in 1999. She received the Ph.D.degree in the area of automotive multiplex systemsfrom the Institute of Engineering and Technology,Devi Ahilya University, Indore, India, in 2014.

She was a Research and Development Engineerfor electronics systems with Chromatography and

Instruments Company, Vadodara, India, for three years. She is currently anAssociate Professor with the Computer Engineering Department, CumminsCollege of Engineering for Women, Pune. Her research interests includefault diagnosis in distributed real-time systems, data compression algorithmsfor automotive distributed systems, and real-time networks for industrial andautomotive applications, such as controller area networks.

Dr. Kelkar is a member of the IEEE Vehicular Technology Society.

Raj Kamal (M’07) received the Doctoral degreein the area of physics from the Indian Institute ofTechnology, New Delhi, India, in 1972.

He has over 40 years of experience in research,has published over 125 research papers, and hastaught physics, electronics, computer science, andinformation technology. He carried out postdoctoralresearch at Uppsala University, Uppsala, Sweden, in1978–1979 and 1984. He is currently a Professorwith the School of Computer Science and Infor-mation Technology, Devi Ahilya University, Indore,

India. He is widely recognized for his research and engineering books:Embedded Systems (McGraw-Hill), Computer Architecture (a Schaum Se-ries Adaptation by McGraw-Hill), Microcontrollers (Pearson Education), andMobile Computing (Oxford University Press).

Dr. Kamal is a member of the IEEE Computer Society.

AdaptiveFaultDiagnosisAlgorithmfor ControllerAreaNetwork

Documents

confinement of faults

node level

distributed systems

theproposed algorithm

network canbe

adistributed network

dueto node failures

new node entry