# A Stimulus-free Probabilistic Model for Single-Event-Upset Sensitivity

Thara Rejimon and Sanjukta Bhanja University of South Florida, Tampa, FL, USA. E-mail: (rejimon, bhanja)@eng.usf.edu

#### Abstract—

With device size shrinking and fast rising frequency ranges, effect of cosmic radiations and alpha particles known as Single-Event-Upset (SEU), is a growing concern in logic circuits. Accurate understanding and estimation of Single-Event-Upset sensitivities of individual nodes is necessary to achieve better soft error hardening techniques at logic level design abstraction. We propose a probabilistic framework to study the effect of inputs, circuit structure and delay on Single-Event-Upset sensitivity of nodes in logic circuits as a single joint probability distribution function (pdf). To model the effect of timing, we consider signals at their possible arrival times as the random variables of interest. The underlying joint probability distribution function, consists of two components: ideal random variables without the effect of SEU and the random variables affected by the SEU. We use a Bayesian Network to represent the joint pdf which is a minimal compact directional graph for efficient probabilistic modeling of uncertainty. The attractive feature of this model is that not only does it use the conditional independence to arrive at a sparse structure, but also utilizes the same for smart probabilistic inference. We show that results with exact (exponential complexity) and approximate non-simulative stimulus-free inference (linear in number of nodes and samples) on benchmark circuits yield accurate estimates in reasonably small computation time.

#### I. INTRODUCTION

High-energy neutrons present in cosmic radiations and alpha particles from packaging materials give rise to single event upsets (SEUs) resulting in soft errors in logic circuits. When particles hit a semiconductor material, electron-hole pairs are generated, which may be collected by a P-N junction, resulting in a short current pulse which may cause a logic upset or Single Event Upset (SEU) in the signal value. An SEU may occur in an internal node of a combinational circuit and propagate to an output latch. When a latch captures the SEU, it may cause a bit flip, which can alter the state of the system resulting in a soft error. As process technology scales down, due to high operating frequencies, low supply voltages, shrinking device geometry and small noise margin, soft errors are of serious concern in logic circuits. Soft error susceptibility of a node *j* with respect to a latch,  $SES_{j,Q_L}$  is the soft error rate at the latch output  $Q_L$ , contributed by node j. Let  $T_{j,i}$  be a Boolean variable which takes logic value 1 if an SEU at a node j causes an error at an output node i. Then  $P(T_{i,i}|SEU_i)$  is the probability of occurrence of an error at output node *i*, given a particle hit at node *j* causes an SEU. Let  $P(SEU_i)$  be the probability that a particle hit at node j generates an SEU at that node and let  $P(Q_L|T_{i,j})$  be the probability that an error at output node *i* causes an erroneous signal at latch output  $Q_L$ . Mathematically  $SES_{j-Q_L}$  is expressed by Eq. 1.

$$SES_{j\_Q_L} = R_H P(SEU_j) P(T_{j\_i} | SEU_j) P(Q_L | T_{j\_i})$$
(1)

where  $R_H$  is the particle hit rate on a chip which is fairly uniform in space and time.  $P(SEU_i)$  depends on  $V_{dd}$ ,  $V_{th}$  and also

Proceedings of the 19th International Conference on VLSI Design (VLSID'06)  $1063-9667/06 \$20.00 \$ C 2006 **IEEE** 

on temperature.  $P(Q_L|T_{j,j})$  is a function of latch characteristics and the switching frequency.

In this work, we explore  $P(T_{j,i}|SEU_j)$  by accurately considering the effect of (1) SET duration, (2) effect of gate delay and (3) timing, (4) re-convergence in the circuit structure and most importantly (5) inputs. Several works have been done on soft error analysis which estimate the overall output signal errors due to SEUs at the internal nodes [6], [7], [9], [10], [11]. Note that our focus is to identify the SEU locations which cause soft errors at the output(s) with high probabilities and **not** on the overall soft error rates. Knowledge of relative contribution of individual nodes to output error will help designers to apply selective radiation hardening techniques. This model can be fused with the modeling of the latches [10], [12] considering latching window, setup and hold time etc and also with the process parameters like  $V_{th}$ ,  $V_{dd}$  [6], [7], [10] for a comprehensive model capturing processing, electrical and logical effect.

We model internal dependency of the signals in timingaware model such that the SEU sensitization probability  $(P(T_{i,i}|SEU_i))$  captures the effect of circuit structure, circuit path delay and also the input space. We use a circuit expansion algorithm similar to that presented in [4], [16] to embed time-related information in the circuit topology without affecting its original functionality. We assume a fan-out dependent delay model, where gate delay of each node is equal to its fanout. Due to the temporal nature of SEUs, not all of the SEUs cause soft errors. From the expanded circuit, we generate a list of SEUs (possible SEU list) which are possibly sensitized to the circuit outputs at the time frame when output signals are latched. From the expanded circuit and the possible SEU list, we construct an error detection circuit and model SEU in large combinational circuits using a Timing-aware Logic induced Soft Error Sensitivity model (TALI-SES), which is a Bayesian Network.

Bayesian Networks are causal graphical probabilistic models representing the joint probability function over a set of random variables. A Bayesian Network is a directed acyclic graphical structure (DAG), whose nodes describe random variables and node to node arcs denote direct causal dependencies. A directed link captures the direct cause and effect relationship between two random variables. Each node is quantified by the conditional probability of the states of that node *given* the states of its parents, or its direct causes. The attractive feature of this graphical representation of the joint probability distribution is that not only does it make conditional dependency relationships among the nodes explicit but it also serves as a computational mechanism for efficient probabilistic updating. Bayesian networks have traditionally been used in medical diagnosis, artificial intelligence, image analysis, and specifically in s



model [2] and single stuck-at-fault/error model [13] in VLSI but their use in timing-aware modeling of Single-Event-Upsets is new. We use two inference schemes for probabilistic updating. An exact method also known as clustering technique [14] and an approximate non-simulative, stimulus-free inference scheme known as Probabilistic Logic Sampling (PLS) [18]. These inference schemes are discussed in detail in section V.

This paper is organized as follows. Section II is a summary of the prior works done on soft error modeling and analysis. In section III, we give an outline of our modeling. In section IV we discuss our model in detail explaining the timing issues, features of Bayesian network-based modeling and the proposed TALI-SES model. Section V deals with Bayesian inference where we discuss both exact and approximate(stochastic) inference schemes. In section VI we give experimental results. We compare our results with logic simulation results and found that this novel probabilistic estimation technique is accurate (closeto-zero error) and efficient.

#### II. BACKGROUND

Effects of Single-Event-Upsets have been studied by several authors in the past [20], [21], [22]. A concurrent error detection scheme to detect and reduce soft error failure rates have been described in [1]. A model that captures the effects of technology trends in the Soft Error failure Rates (SER), considering different types of masking phenomena such as electrical masking, latching window masking and logical masking, is presented in [3]. Results from their model predict that SER per chip will increase nine orders of magnitude from 1992 to 2011 and then it will become comparable to the soft error failure rates of unprotected memory chips. A model to analyze Single Event Upsets with zero-delay logic simulation is presented in [4]. A mathematical model to estimate the possible propagation of glitches due to transient faults has been presented in [17]. Zhang et al. in [10] proposed a composite soft error rate analysis method (SERA). This method uses a conditional probability based parameter extraction technique by resorting to device and logic level simulation. In their work, combinational circuits are assumed to have unbalanced re-convergent paths. This is not true even for small benchmarks such as c17. Another method for soft error tolerance analysis and optimization has been discussed in [8] by modeling glitch generation and propagation characteristics of gates in a circuit. However, they use circuit level simulation results to incorporate logical masking effects.

Since all the state-of-the-art techniques have resorted to simulation for logical and device level effects (known to be expensive and pattern-sensitive), we felt the need to explore the input data-driven uncertainty in a comprehensive manner through a probabilistic model to capture the effect of primary inputs, the effect of gate delay and SEU duration on the logical masking. There is future scope for this model to be fused with other models [6], [7], [10], [11], [12] for capturing device effects such as electrical masking, threshold voltage and supply voltage.

#### III. OVERVIEW

We model the effect of single event upset produced at an internal node of a circuit on the output signals, by computing the

Proceedings of the 19th International Conference on VLSI Design (VLSID'06) 1063-9667/06 \$20.00 © 2006 IEEE

joint probability distribution described by Eq. 2.

$$P(T_{j,i}) = \sum_{j, \{I_l\}, X_k, k \neq j} P(T_{j,i}, I_0, \cdots, I_l, \cdots, I_N, X_1, \cdots, X_k, \cdots, X_M)$$
(2)

where  $P(T_{j,i})$  is the probability that an SEU generated at an internal node j causes an erroneous signal at output i.  $T_{j,j}$  is a test signal which compares the error-free signal at the  $i^{th}$  output with the corresponding error-sensitized output caused by an SEU at the *j*<sup>th</sup> node. If  $T_{j,j} = 1$ , it indicates the occurrence of an error at output i due to an SEU at j.  $P(T_{j\_i})$  depends on the N input signals  $I_0, \dots, I_N$ , M internal signals  $X_1, \dots, X_M$ , and the type of SEU at j (SEU<sup>1</sup> caused by 0\_1\_0 transition or SEU<sup>0</sup> caused by 1\_0\_1 transition). Ideally, the real effect of a particle hit at node j on the  $i^{th}$  output is product of the conditional probability  $P(T_{i,i}|SEU_i)$  and  $P(SEU_i)$ , where  $P(SEU_i)$  is the probability that a particle hit at a node produces an SEU at node j and it depends on process parameters such as  $V_{dd}$  and  $V_{th}$  and also depend dynamically on temperature. With reduced supply voltages and diminishing dimensions, this probability will be very close to one. In this work, we compute the upper bound of SEU sensitivity by setting  $P(SEU_i) = 1$ .

In Eq.2, the probability  $P(T_{j\_i})$  does not consider the transient nature of SEU. For example, the SEU effect may reach the output for a short time span, but the output signal can be reinforced to its correct value before it is sampled by the latch. SEU propagation depends on the gate delays and SEU duration. Let  $t_h$  be the time when an SEU originates at a node,  $\delta$  be the SEU duration,  $t_s$  be the time when outputs are sampled and  $\Pi$  be the set of propagation delays  $(t_d)$  of sensitized paths from the node to the circuit outputs. Nodes satisfying the following conditions do not cause soft error [4]:

$$t_h + \delta + t_d < t_s \quad \forall t_d \in \Pi. \tag{3}$$

To capture the effect of gate delays and SEU duration, we do a time-space transformation of the original circuit, by means of a circuit expansion algorithm similar to that presented in [4]. Our model captures not only the effect of gate delays, but also effect of difference in path delays (arrival times) between the input signals of gates assuming fan-out-dependant delay model. In the expanded circuit, each gate is replicated several times corresponding to the time instants at which the gate output is evaluated. The circuit outputs are also replicated.

Thus each of the random variables in Eq.2 represent a set of variables at different time frames.  $I_i = \{I_{i,0}, I_{i,1}\}$  where  $I_{i,0}$  and  $I_{i,1}$  are the input signal values of  $I_i$  before and after the application of a clock pulse. The new input signal  $I_{i,1}$  remains the same throughout the clock cycle.  $X_i = \{X_{i,t_k}\} \forall t_k$  where  $t_k$  is the signal evaluation time. By considering the temporal nature of SEUs we obtain a list of SEU which can cause erroneous outputs. This is explained in section A.

TALI-SES is a Directed Acyclic Graph we build from the expanded circuit and the SEU list to capture the effect of each SEU at a node to the output. We discuss TALI-SES construction in section C.

# IV. THE PROPOSED MODEL

In this section, we first focus on handling the timing-aware feature of our probabilistic model, followed by the fault



struction. We conclude the section with discussion about the model itself, given the timing-aware graph and the fault list.

# A. Timing Issues

We first expand the circuit by time-space transformation of the original circuit, without changing its functionality. A gate in the original circuit *C* will have many replicated gates in the expanded circuit *C'*, corresponding to different time-frames at which the gate output is evaluated. The output evaluation time  $\{T\}$  of each gate in the circuit is calculated based on variable delay model. We assume that the delay associated with a gate is equal to its fan-out. For each gate *g* whose output is evaluated at time  $t \in \{T\}$  a replicate node *g*, *t* is constructed. The inputs of *g*, *t* are the replicate nodes of the gates, which are the inputs of *g* in the original circuit and belongs to the time-frames t' < t.

Fig. 1(a) is the expanded circuit of benchmark c17. We consider the value of signal *i* at time *t* by (i, t). Now the random variable that represents the value of a signal *i* at time *t* is denoted by  $X_{i,t}$ . The circuit outputs reach steady state values,  $X_{22,0}$  and  $X_{23,0}$  at t = 0, after the application of the previous inputs,  $\{X_{1,0}, X_{2,0}, X_{3,0}, X_{6,0}, X_{7,0}\}$ . Let the new inputs  $\{X_{1,1}, X_{2,1}, X_{3,1}, X_{6,1}, X_{7,1}\}$  be applied at t = 1.  $X_{10,2}$  is the signal value at the output of gate 10 at time instant 2.

Input signals of certain gates in the circuit might have different arrival time due to the difference in path delays. We need to introduce additional duplicate nodes for those internal signals with less path delay in order to model the effect of any SEU generated at the junction of the gates at times later than the signal's arrival. For example, in Fig. 1(a), input signals to gate 22 have path delays 2 and 5 respectively. The final output signal (22, 6)is evaluated with input signals (16, 5) and (10, 5) (assuming fanout of output gates to be one). If no SEUs originated at the output of gate 10 between time instants 2 and 5, (10, 2) and (10, 5)would be the same. However, in the event an SEU occurs at node 10 at t = 5, (10,2) and (10,5) may differ depending on the inputs, which can cause a wrong output signal at (22, 6). We model the effect of SEU at (10, 5) by introducing a duplicate gate (10,5) whose inputs are (1,1) and (3,1). In Fig. 1(a), it can be seen that (10, 3), (10, 4), (10, 5), (19, 3) and (19, 5) are duplicate gates, represented by filled gate symbols.

Steps for constructing the timing-aware expanded circuit, based on fan-out dependent delay model are the following:

- 1. Arrange gates in the order of levels, with the level of input gates equal to zero.
- 2. Include all gates that are present in the original circuit. Output signals of these gates represent the steady state signal values at t = 0, before the application of new inputs.
- 3. Add additional input nodes representing new input signal values at t = 1;
- 4. For each level of the circuit starting from level  $l_i = 1$ , repeat the following step:

For each gate g in level  $l_i$ , create replicate gates at time frame  $t = t_p + f_g$ , where  $t_p$  is the maximum time frame of the previously inserted parent gates of g and  $f_g$  is the fanout of gate g. Update time frames of gate g.

Output signals of a circuit are sampled at  $t = t_s$ , where  $t_s$  is the maximum of the latest signal arrival times of the output signals. SEUs which do not satisfy Eq. 3 affect circuit outputs

Proceedings of the 19th International Conference on VLSI Design (VLSID'06) 1063-9667/06 \$20.00 © 2006 IEEE

resulting in soft errors. These SEUs are the upsets generated at the output of gates, which are in the fan-in cones of final outputs, outputs evaluated at time  $t_s$ . SEUs occurring at certain other gates, which are not in the fan-in cones of the final outputs, may also affect circuit outputs. These nodes arise due to the SEU duration time  $\delta$ . For example in Fig. 1(a), we see that the final outputs are generated at time instant t=6. If an SEU occurs at signal 19 at 4 ns and lasts for one time unit, it will essentially be capable of tampering the value of node 23 at 6 ns. Note that we assume that  $\delta$  is one time unit. The fault list will be different if we change the value of  $\delta$ . Thus we can see that SEUs which are sensitized to outputs at time frames between  $t_s$ and  $t_s - \delta$  may cause soft errors, depending on the input signals and circuit structure.

Based on the above considerations, we modify the expanded circuit by including only those gates that propagate SEU effects to the outputs between time instants,  $t_s$  and  $t_s - \delta$ . Thus we get a considerable reduction in the circuit size. Fig. 1(b) is the modified expanded circuit of c17, which models all SEUs possibly sensitized to a final output.

Next, we discuss how to generate a list of possible SEUs affecting the circuit outputs. Not all gates in Fig.1(b) are SEU sensitive. As discussed above, a duplicate node introduces an additional delay of at least one time unit. If the delay introduced by a duplicate gate is greater than or equal to  $\delta$ , the SEU duration time, the effect of SEUs originated at any of the gates in the fan-in cones of the duplicate gate is nullified and correct signal value is restored at the output of the duplicate gate, and hence those SEUs are effect-less. Thus we create a reduced list of SEUs by traversing the modified expanded circuit from each of the circuit outputs at time instants between  $t_s$  and  $t_s - \delta$ , until a duplicate gate or an input node is reached.

# B. Bayesian Networks

A Bayesian network[19] is a Directed Acyclic Graph (DAG) in which the nodes of the network represent random variables and a set of directed links connect pairs of nodes. The links represent causal dependencies among the variables. Each node has a conditional probability table (CPT) except the root nodes. Each root node has a prior probability table. The CPT quantifies the effect the parents have on the node. Bayesian networks compute the joint probability distribution over all the variables in the network, based on the conditional probabilities and the observed evidence about a set of nodes.

The exact generalized joint probability distribution over the n variables in a Bayesian Network is given by Eq. 4.

$$P(x_{n}, x_{n-1}, \cdots, x_{k}, \cdots, x_{3}, x_{2}, x_{1})$$
  
=  $P(x_{n}|x_{n-1}, \cdots, x_{k}, \cdots, x_{3}, x_{2}, x_{1})P(x_{n-1}|x_{n-2}, \cdots, x_{k}, \cdots, x_{1})$   
 $\cdots P(x_{k}|x_{k-1}, x_{k-2}, \cdots, x_{2}, x_{1}) \cdots P(x_{2}|x_{1})P(x_{1})$   
(4)

Here  $x_k$  denotes some value of the variable  $X_k$ . Let  $pa(x_k)$  be a set of values for  $X_k$ 's parents. In a Bayesian Network, the node  $X_k$  becomes conditionally independent of all the other nodes given only its parents. Using this conditional independence, a minimal factored representation of exact joint probability distribution over *n* random variables can be expressed as in Eq. 5.





Fig. 1. (a) Time-space transformed circuit of benchmark c17, modeling all SEUs. (b) Modified time-space transformed circuit of benchmark c17, modeling only the possibly sensitized SEUs.



Fig. 2. (a) An illustrative SEU sensitivity logic for a sub-circuit of c17. (b) Timing-aware-Logic-induced-DAG model of the SEU sensitivity logic in (a)

$$P(X) = \prod_{k=1}^{n} P(x_k | pa(x_k))$$
(5)

We adapted the theoretical understanding based on [19]. The reader is referred to these works for details.

# C. TALI: Timing-aware-Logic-induced Soft error model

In this section, we first describe the proposed Bayesian network based model, which can be used to estimate the soft error sensitivity of logic blocks. This model captures the dependence of SEU sensitivity on the input pattern, circuit structure and the gate delays. Note that this probabilistic modeling does not require any assumptions on the inputs and can be used with any biased workload patterns. The proposed model, Timing-Aware-Logic-Induced-Soft-Error-Sensitivity (TALI-SES) Model is a Directed Acyclic Graph (DAG) representing the time-space transformed, SEU-encoded combinational circuit,  $\langle C', J \rangle$ where C' is the expanded circuit created by time-space transformation as discussed in section. A and J is the set of possible SEUs (also discussed in section A). The error detection circuit consists of the expanded circuit C', an error sensitization logic for each SEU and a detection unit T consisting of several comparator gates. We explain it with the help of a small exam-

Proceedings of the 19th International Conference on VLSI Design (VLSID'06) 1063-9667/06 \$20.00 © 2006 IEEE

ple shown in Fig 2(a), which is the error detection circuit for a small portion of benchmark c17. The error sensitization logic for an SEU at node j consists of the duplicate descendant nodes from *j*. In Fig. 2(a), the block with the dotted square is the sensitization logic for  $16, 5_s^1$  [An SEU<sup>1</sup> at node 16 at time t = 5]. It consists of nodes 22,  $5_s$  and 22,  $6_s$  descending from node 16, 5 of the time-space transformed circuit. For simplicity, we show the modeling of only one SEU in this example. Our model can handle any number of SEUs simultaneously. Each SEU sensitization logic has an additional input to model the SEU. Example: input  $SEU_{16,5}^1$ . This input signal value is set to logic one in order to model the effect of a 0-1-0 SEU occurring at node 16 at time frame 5.

As discussed previously in section A, an SEU lasting for a duration  $\delta$  can cause an erroneous output if its effect reaches the output at any instant between the sampling time  $t_s$  and time frame  $t_s - \delta$ . In this work we assume  $\delta$  to be one. Hence we get error sensitized outputs at time frame  $t_s$  and for some SEUs at time frame  $t_s - 1$  also. We need to compare the SEU-free output signals evaluated at the sampling time,  $t_s$  with the corresponding SEU-sensitized output signals arriving at  $t_s - 1$  and  $t_s$ . Hence these signals are sent to a detection unit T. The comparators in the detection unit compare the ideal and error sensitized outputs with the corresponding error-free outputs and generate test signals. For example, the test signals for an SEU at node j at time t, (represented as  $(j, t)_s$ ) are  $T_{(j,t)\_(i,t_s)}$  and  $T_{(j,t)\_(i,t_s-1)}$ . If any of these test signal values is 1, it indicates the occurrence of an error. The probability  $P(T_{(i, t)-i})$ , which is a measure of the effect of SEU  $(j, t)_s$  on the output node *i* is computed as  $P(T_{(j,t)-i}) = \max\{P(T_{(j,t)-(i,t_s)}), P(T_{(j,t)-(i,t_s-1)})\}.$ An SEU can have effect on more than one output. The overall effect of an SEU  $(j, t)_s$  on the outputs is computed as  $P(T_{(i,t)}) = \max_{\forall i} \{ P(T_{(i,t),i}) \}$ . In the example the SEU (16,5), is sensitized to outputs 22,6 and 23,6. Hence the two te



for this SEU are  $T_{(16,5)}(22,6)$  and  $T_{(16,5)}(23,6)$ .

More than one SEU can originate at a node at different time frames. Considering the effect of both types of SEUs,  $SEU^0$  and  $SEU^1$ , at node j at all time frames, we compute the worst case output error probability due to node j as  $P(T_j) = \max_{\forall t} \{P(T_{(j,t)}^1), P(T_{(j,t)}^0)\}$ , which is the maximum probability over all time frames of  $SEU^0$  and  $SEU^1$ .

These detection probabilities depend on the circuit structural dependence, the inputs, dependencies amongst the inputs, gate delays and the SEU duration. In this work we assume random inputs for experimentation and validation of our model.

We construct the TALI-SES DAG of the SEU detection circuit by nodes which are random variables representing signal values of the SEU detection circuit. A signal i in the detection circuit is represented by the random variable  $X_i$  in the DAG.

In TALI-SES DAG structure the parents of each node are its Markov boundary elements. Hence TALI-SES is a boundary DAG. For definition of Markov Boundary and boundary DAG, please refer to [19]. Note that TALI-SES is a boundary DAG because of the causal relationship between the inputs and the outputs of a gate that is induced by logic. It has been proven in [19] that if graph structure is a boundary DAG D of a dependency model M, then D is a minimal I-map of M ([19]). This theorem along with definitions of conditional independencies, in [19] (we omit the details) specifies the structure of the Bayesian network. Thus TALI-SES DAG is a minimal I-map and thus a Bayesian network (BN).

**Quantification of TALI-SES-BN**: TALI-SES-BN consists of nodes that are random variables of the underlying probabilistic model and edges denote direct dependencies. All the edges are quantified with the conditional probabilities making up the joint probability function over all the random variables. The overall joint probability function that is modeled by a Bayesian network can be expressed as the product of the conditional probabilities. Let us say,  $X' = \{X'_1, X'_2, \dots, X'_m\}$  are the node set in TALI-SES Bayesian Network, then we can say

$$P(X') = \prod_{k=1}^{m} P(x'_{k} | Pa(X'_{k}))$$
(6)

where  $Pa(X'_k)$  is the set of nodes that has directed edges to  $X'_k$ . A complete specification of the conditional probability of a two input AND gate output will have  $2^3$  entries, with 2 states for each variable. These conditional probability specifications are determined by the gate type. By specifying the appropriate conditional probability we ensure that the spatial dependencies among sets of nodes (not only limited to just pair-wise) are effectively modeled.

# V. BAYESIAN INFERENCE

We explore two inference schemes for the TALI-SES. The first inference scheme is cluster based exact inference and the second one is based on stochastic inference algorithm which is an approximate non-simulative scalable anytime method.

# A. Junction Tree Based Inference

We demonstrate this inference scheme with an example. A small combinational circuit is shown in Fig. 3a. We first con-

Proceedings of the 19th International Conference on VLSI Design (VLSID'06) 1063-9667/06 \$20.00 © 2006 IEEE

struct a Bayesian network for a subset of the time transformed circuit of Fig. 3a that captures the effect of SEU of "zero" at node 5 at a time instant 2 unit (denoted by the random variable  $X_{5,2}s^0$ ) on the output signal 6 at 3 time unit(denoted by random variable  $X_{6,3}$ ) Fig. 3b is the moralized graph (discussed below) derived from the Bayesian Network. Note that the error in output signal  $X_{6,3}$  is determined by  $T_{6,(5,2)}$  which is an xor combination of  $X_{6,3}$  and  $X_{6,35}$  where  $X_{6,35}$  is the node that captures the effect of SEU at node 5 at 2 time unit. This is the original TALI-SES Bayesian Networks that we further process for exact inference. The steps involved in the exact inference scheme are described below. Moralization: Create an undirected graph structure called the moral graph from the Bayesian network DAG structure by adding undirected edges between the parents of a common child node and dropping the directions of the links. The moral graph represents the Markov structure of the underlying joint function [23]. The dependencies that are preserved in the original DAG are also preserved in the moral graph [23]. The dashed edges in Fig. 3c are added at this stage. This step ensures that every parent child set is a complete sub graph. Triangularization: In this step, every cycle of length greater than or equal to 4 is reduced to a cycle of 3 by inserting additional links (chords) to the moral graph. The moral graph is said to be triangulated if it is chordal [23]. Note that in this particular example, moral graph is chordal and no additional links are needed. Message passing in Junction Tree: A junction tree is defined as a tree of cliques (collection of completely connected sub graph) of the choral graph (cliques are connected by unique path as in Fig 3c). Junction tree possesses running intersection property [23] that ensures that if two cliques share a common variable, the variable should be present in all the cliques that lie in the unique path between them. Fig. 3c is the junction tree derived from the chordal graph of Fig. 3b in this example. Interested readers are referred to [2] for a detailed description of how local message passing is performed in junction trees. Probabilistic Backtracking: Note that since junction tree has no cycle and it is also not directional, we can propagate evidence from any node at any clique and propagate the evidence in any direction. It is in sharp contrast with simulative approaches where flow of information always propagates from input to the outputs. Thus, we would be able to use it for input space characterization for achieving zero output error due to SEUs. We would instantiate a desired observation in an output node (say zero output error in presence of SEU(s)) and backtrack the inputs that can create such a situation. If the input trace has large distance from the characterized input space, we can conclude that zero error is reasonably unlikely. Note that this backtracking feature of Bayesian Networks is already used in medical diagnosis but are new in the context of input space modeling for soft error.

This exact inference is expensive in terms of time and hence for larger circuits, we explore a stochastic sampling algorithm, namely probabilistic Logic Sampling (PLS). This algorithm has been proven to converge to the correct probability estimates [18], without the added baggage of high space complexity.





Fig. 3. (a) A Small Circuit

(b) Chordal Graph (Same as Moral Graph in this example)

(c) Junction Tree

# B. Probabilistic Logic Sampling (PLS)

Probabilistic logic sampling is the earliest and the simplest stochastic sampling algorithm proposed for Bayesian Networks [18]. Probabilities are inferred by a complete set of samples or instantiations that are generated for each node in the network according to local conditional probabilities stored at each node. The advantages of this inference are that: (1) its complexity scales linearly with network size, (2) it is an any-time algorithm, providing adequate accuracy-time trade-off, and (3) the samples are not based on inputs and the approach is input pattern insensitive. The salient aspects of the algorithm are as follows.

- 1. Each sampling iteration stochastically instantiates all the nodes, guided by the link structure, to create a network instantiation.
- 2. At each node,  $x_k$ , generate a random sample of its state based on the conditional probability,  $P(x_k|Pa(x_k))$ , where  $Pa(x_k)$  represent the states of the parent nodes. This is the local, importance sampling function.
- 3. The probability of all the query nodes are estimated by the relative frequencies of the states in the stochastic sampling trace.
- 4. If states of some of the nodes are known (evidence), such as in diagnostic backtracking, network instantiations that are incompatible with the evidence set are disregarded.
- 5. Repeat steps 1, 2, 3 and 4, until the probabilities converge.

The above scheme is efficient for predictive inference, when there is no evidence for any node, but is not efficient for diagnostic reasoning due to the need to generate, but disregard samples that do not satisfy the given evidence. We adopt the tool GeNie [15] for inference using Probabilistic Logic Sampling.

**Complexity** The computational complexity of the exact method is exponential in terms of number of variables in the largest cliques. Space complexity of the exact inference is  $n.2^{|Cmax|}$  [2], where n is the number of nodes in the Bayesian Network, and |Cmax| is the number of variables in the largest clique. The time complexity is given by  $p.2^{|Cmax|}$  [2] where p is the number of cliques.

The time complexity, based on the stochastic inference scheme, is linear in *n*, the number of nodes in the expanded circuit, specifically, it is  $O(n|N_{SEU}|N)$ , where  $N_{SEU}$  is the number of SEUs and *N* is the number of samples.

Proceedings of the 19th International Conference on VLSI Design (VLSID'06)  $1063-9667/06 \$20.00 \ \odot 2006$  **IEEE** 

TABLE I Size of original and time-expanded ISCAS circuits

|       | Gates | Gates    | # of nodes | Time   |
|-------|-------|----------|------------|--------|
|       |       | expanded | (TALI)     | frames |
| c432  | 196   | 476      | 1947       | 55     |
| c499  | 243   | 464      | 1596       | 30     |
| c880  | 443   | 729      | 2486       | 51     |
| c1355 | 587   | 1440     | 3388       | 55     |
| c1908 | 913   | 1524     | 18020      | 79     |
| c2670 | 1426  | 2584     | 4097       | 81     |
| c3540 | 1719  | 3795     | 15472      | 93     |
| c5315 | 2485  | 4887     | 13138      | 90     |
| c6288 | 2448  | 30113    | 31157      | 263    |
| c7552 | 3719  | 10006    | 45367      | 88     |

# VI. EXPERIMENTAL RESULTS

We demonstrate the modeling of SEU based on TALI-SES using ISCAS benchmark circuits. The logical relationship between the inputs and the output of a gate determines the conditional probability of a child node, given the states of its parents, in the TALI-DAG.

In Table I we report the total number of gates in the actual circuit (column 2), total number of gates in the modified expanded circuit (column 3), and the total number of nodes in the resulting TALI-SES (column 4). Column 5 lists the maximum time-frames of the circuits.

We compute the worst case SEU sensitivity of an individual node  $P(T_i)$  in a circuit as follows:

1. Compute the output error probability at output node *i* due to an SEU at node j at time t by Eq. 7 (as discussed in section IV C).

$$P(T_{(j,t)-i}) = \max\{P(T_{(j,t)-(i,t_s)}), P(T_{(j,t)-(i,t_s-1)})\}$$
(7)

2. Considering the effect of all SEUs at node j at all possible time frames, compute the probability of occurrence of an error at the *i*<sup>th</sup> output due to SEUs at node j by Eq. 8.

$$P(T_{j\_i}) = \max_{\forall t} \{ P(T_{(j,t)\_i}) \}$$
(8)

 Compute the worst case SEU sensitivity of a node j due to an SEU<sup>1</sup> and SEU<sup>0</sup> and all for outputs by Eq. 9

$$P(T_j) = \max_{\forall i} \{ P(T_{j \perp i}^1), P(T_{j \perp i}^0) \}$$

$$(9)$$

# A. Exact Inference

In this section, we explore a small circuit c17, with exact inference where we transform the original graph into junction tree and compute probabilities by local message passing



TABLE II ESTIMATED  $P(T_{j,i})$  OF NODES IN C17 (EXACT INFERENCE)

| node j | $SEU^1$       |               | $SEU^0$       |               |
|--------|---------------|---------------|---------------|---------------|
|        | $P(T_{j_22})$ | $P(T_{j_23})$ | $P(T_{j_22})$ | $P(T_{j_23})$ |
| 10     | 0.2813        | 0             | 0.4375        | 0             |
| 11     | 0.0625        | 0.2188        | 0.3125        | 0.5625        |
| 16     | 0.3125        | 0.1875        | 0.4375        | 0.4375        |
| 19     | 0             | 0.375         | 0             | 0.4375        |
| 22     | 0.4375        | 0             | 0.5625        | 0             |
| 23     | 0             | 0.4375        | 0             | 0.5625        |

the neighboring cliques of the junction tree as outlined in section VA. Note that this inference is proven to be exact [19], [23](zero estimation error).

Table II tabulates the results of the TALI-SES of benchmark c17 using the exact inference. In this table, we report the probabilities of error at output nodes 22 and 23 due to an SEU at each node *j* (column 1) namely (10, 11, 16, 19, 22 and 23). Column 2 and 3 of Table II give error probabilities due to  $SEU^1$  (0-1-0 transition) at output nodes 22 and 23 respectively. Similarly 4 and 5 give error probabilities due to  $SEU^0$  (1-0-1 transition) at output nodes 22 and 23 respectively. We compare the errorfree outputs at 22 and 23 at sampling time  $t_s$  with corresponding error sensitized outputs arriving at time frames  $t_s - 1$  and  $t_s$  due to SEUs generated at a node at all possible time frames (as discussed in section IV C). Columns 2, 3, 4 and 5 of Table II reports the maximum of error probabilities due to SEUs originated at individual nodes at all time frames. From this table it can be seen that for this benchmark circuit  $SEU^0$ s have high impact on the output error probabilities than  $SEU^{1}$ s. Error probability at output node 22 due to an  $SEU^1$  at node 11, is very low (0.0625) whereas error probability at output 22 due to  $SEU^0$  at 11 is 0.3125. It also shows that the effect of SEUs are not the same over all outputs. For example, an  $SEU^1$  at node 19 causes no error at output 22 whereas error probability due to this SEU at output node 23 is 0.4375. Note that nodes 22 and 23 are the output nodes. SEUs occurring at these nodes at sampling time  $t_s$  or time  $t_s - 1$  will be latched by an output latch, and are expected to cause very high error probability. However from Table II, it is observed that probability of occurrence of an error due to  $SEU^1$  at node 23 is only 0.4375. Similarly, probability of occurrence of an error due to  $SEU^1$  at node 22 is also 0.4375. This is due to the type of input pattern. In this work, we assume random inputs. This result shows the dependence of input pattern on  $P(T_{i-i}|SEU_i)$ .

# A.1 Input Space Characterization

In this section, we describe the input space characterization for a particular observation exploring the diagnostic feature of the TALI-SES model. Note that this feature makes it really unique as instead of predicting the effect of inputs and SEU at a node on the outputs, we try to answer queries like "What input behavior will make SEU at node j definitely causing a bit-flip at the circuit outputs?" or "What input behavior will be more conducive to no error at output given that there is an SEU at node j?" Resolving queries like this, aids the designer in observing the input space and helps perform input clustering or modeling.

Let us take an example of c17 benchmark. We explore the input space for studying the effect of  $SEU^0$  and  $SEU^{\bar{1}}$  at node 19 on errors on both the outputs (22 and 23). One can charac-

TABLE III SEU Sensitivity ranges of gates in ISCAS circuits, when  $\delta = 1$ AND INPUT BIAS = 0.5

|       | # of SEU's | $p \le 0.3$ | $0.3$ | <i>p</i> > 0.6 |
|-------|------------|-------------|-------|----------------|
| c432  | 86         | 9           | 11    | 10             |
| c499  | 288        | 8           | 40    | 32             |
| c880  | 296        | 18          | 24    | 54             |
| c1355 | 512        | 0           | 192   | 0              |
| c1908 | 414        | 29          | 90    | 9              |
| c2670 | 366        | 0           | 46    | 61             |
| c3540 | 418        | 75          | 61    | 15             |
| c5315 | 1332       | 76          | 87    | 193            |
| c6288 | 1332       | 1           | 89    | 6              |
| c7552 | 1460       | 240         | 199   | 23             |

terize input space for any one of the outputs (or in general effect of SEU at any node on any other subset of nodes). Fig 4a characterizes the input space for an  $SEU^0$  at node 19 such that no bit-flip occur at the outputs. We plot the probabilities of each inputs 1, 2, 3, 6 and 7 for observing no output error for an  $SEU^0$ at 19. Each column in the plot represents an input. Now the lighter color represents the probability of that input = 0 and the black color represents the probability of input = 1 (sum of these two part should always be one). One can see that for observing no output error for an  $SEU^0$  at 19, input 1 can be random, input 2 and 7 have 67% probability of being at logic one and node 3 and 6 has probability of 33% for logic 1. Note that if the input space is nearly random (p(1)=p(0)=0.5), then an SEU<sup>1</sup> at node 19 causes no output error for both the outputs. Similar plots are shown in Fig. 4c and 4d for characterizing the input space for zero output errors while  $SEU^0$  or  $SEU^1$  occurs at node 11 of c17 benchmarks. Once again it can be seen that zero output error for  $SEU^1$  can be more likely by a random inputs than for SEU<sup>0</sup>. B. Larger Benchmarks

We use approximate inference for larger circuit using Probabilistic Logic sampling [18] which is pattern independent random markov chain sampling and has shown good results in many large industry-size applications.

In Table III, we report the number of possible SEU's causing soft errors (column 2). The reduced SEU list was created based on fanout-dependent delay model and assuming an SEU duration period  $\delta$  equal to one time unit. We report the estimated SEU sensitivity  $P(T_i)$  calculated as in Eq. 9 for all gates in the circuits that might cause error. Column 3 of Table III lists the number of all gates j where  $P(T_i)$  is less than or equal to 0.3, whereas columns 4 and 5 list number of gates for which  $P(T_i)$ lies in between 0.3 and 0.6, and above 0.6 respectively. These results are helpful to identify and classify nodes and apply redundancy measures or modify  $P(SEU_i)$  (by changing device features) to nodes selectively giving higher priority to nodes in column 5 and than that in lower ranges.

We implemented the SEU simulator based on the work done in [4] with a fanout-dependent delay model for the ground truth. We performed the simulation with 500,000 random vectors obtained by changing seed after every 50000 vectors to get the ground-truth SEU probabilities. For our probabilistic framework, we use Probabilistic Logic Sampling [18] inference scheme. SEU detection probabilities computed by Probabilistic

Logic Sampling (PLS) [18] with 9999 samples were c





Fig. 4. Input probabilities which give zero output errors for c17, in the presence of SEU's: node 11 (d)  $SEU^1$  at node 11

TABLE IV

SEU SENSITIVITY ESTIMATION ERRORS AND TIME FOR 9999 SAMPLES.

|       | $(E_{mean})$ | $(E_{max})$ | $T_{bn}(sec)$ |
|-------|--------------|-------------|---------------|
| c432  | 0.0031       | 0.0069      | 18.18         |
| c499  | 0.0024       | 0.0198      | 13.43         |
| c880  | 0.0027       | 0.0090      | 28.31         |
| c1355 | 0.0027       | 0.0120      | 28.84         |
| c1908 | 0.0028       | 0.0120      | 175.67        |
| c2670 | 0.0034       | 0.0130      | 34.70         |
| c3540 | 0.0023       | 0.0101      | 146.20        |
| c5315 | 0.0045       | 0.0112      | 120.79        |
| c7552 | 0.0035       | 0.0100      | 507.02        |

with ground-truth simulation results and average and maximum estimation errors,  $E_{mean}$  and  $E_{max}$  are reported in columns 2 and 3 respectively of Table IV. Propagation time,  $T_{bn}$  (column 4) is the time taken by the PLS scheme for belief propagation. We estimated the SEU sensitivities of all the ISCAS'85 benchmarks with an average belief propagation time of 140.49 sec, whereas the average estimation error is below 0.0034 which shows an excellent accuracy-time trade-off.  $T_{bn}$  is the total elapsed time, *including memory and I/O access*. This time is obtained by the ftime command in the WINDOWS environment on a Pentium-4 2.0 GHz PC.

It is evident from the results that using a graph-based causal, compact probabilistic framework, Bayesian Network, we are able to accurately model the Single-event-upset (SEU)sensitivities of logic circuit signals accounting for temporal and spatial dependencies. The exciting feature of this stimulus-free approach is that it uses conditional independencies in modeling spatial correlations and time-space transformation for capturing temporal dependencies.

# VII. CONCLUSION

We are able to effectively model the SEU sensitivity of individual nodes capturing spatial, temporal correlations, specially emphasizing the effect of inputs, gate delay, SEU duration and circuit structure. We show results with exact and approximate inferences showing excellent accuracy-time trade-offs. Moreover, we are able to characterize input space with respect to SEU causing error and no error. Future effort includes modeling with biased input patterns, for different SEU width  $\delta$  and also for other delay models, to study the effect of these factors on SEU sensitivities.

#### References

 K. Mohanram and N. A. Touba, "Cost-Effective Approach for Reducing Soft Error Failure Rate in Logic Circuits," *International Test Conference*, pp. 893–901, 2003.

Proceedings of the 19th International Conference on VLSI Design (VLSID'06)  $1063-9667/06 \$20.00 \ \odot 2006$  **IEEE** 

(c) (d) <sup>7</sup> SEU's: (a)  $SEU^0$  at node 19 (b)  $SEU^1$  at node 19 (c)  $SEU^0$  at [2] S. Bhanja and N. Ranganathan, "Switching activity estimation of VLSI

In 3 In 6 INPUTS

- [2] S. Bhanja and N. Ranganathan, "Switching activity estimation of VLSI circuits using Bayesian Networks," *IEEE Transaction on VLSI systems*, pp. 558 – 567, Feb. 2003.
- [3] P. Shivakumar, et al., "Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic," *Proc. International Conference* on Dependable Systems and Networks, pp. 389–398, 2002.
- [4] M. Sonza Reorda and M. Violante, "Accurate and efficient analysis of single event transients in VLSI circuits," *On-Line Testing Symposium*, pp. 101–105, 2003.
- [5] M. Sonza Reorda and M. Violante, "Fault List Compaction through Static Timing Analysis for Efficient Fault Injection Experiments," *Proc. Defect* and Fault Tolerance Symposium, pp. 263–271, 2002.
- [6] T. Karnik and P. Hazucha, "Characterization of soft errors caused by single event upsets in CMOS processes," *IEEE Transactions on Dependable* and Secure Computing, Volume: 1-2, pp. 128–143, Apr-Jun. 2004.
- [7] V. Degalahal, R. Rajaram, N. Vijaykrishan, Y. Xie and M. J Irwin, "The effect of threshold voltages on soft error rate," 5th International Symposium on Quality Electronic Design, March 2004.
- [8] Y. S. Dhillon, A. U. Diril and A. Chatterjee, "Soft-Error Tolerance Analysis and Optimization of nanometer circuits," *Proceedings of Design, Automation and Test in Europe*, Volume: 1, pp. 288–293, Mar. 2005.
- [9] Chong Zhao, Xiaoliang Bai and S. Dey, "A scalable soft spot analysis methodology for compound noise effects in nano-meter circuits," *Proceedings of Design Automation Conference*, pp. 894–899, Jun. 2004.
- [10] M. Zhang and N. R. Shanbhag, "A Soft Error Rate Analysis (SERA) Methodology" *International Conference on Computer Aided Design*, November, 2004.
- [11] Norbert Seifert *et al.* "Impact of Scaling on Soft-Error Rates in Commercial Microprocessors" *IEEE Transactions on Nuclear Science*, Volume: 49, No. 6, pp. 3100–3106, Dec. 2002.
- [12] P. Hazucha *et al.* "Measurement and Analysis of SER-Tolerant Latch in a 90-nm Dual-V – T CMOS Process" *IEEE Transactions on Solid-State Circuits*, Volume: 39, No. 9, pp. 1536–1543, Sept. 2004.
- [13] T. Rejimon and S. Bhanja, "An Accurate Probabilistic Model for Error Detection," *Proc. IEEE International Conference on VLSI Design*, pp. 717–722, Jan. 2005.
- [14] URL http://www.hugin.com
- [15] "GeNie", URL http://www.sis.pitt.edu/~genie/genie2
- [16] S. Manich and J. Figueras, "Maximizing the weighted switching activity in combinational CMOS circuits under the variable delay model," *European Design and Test Conference*, pp. 597–602, 1997.
- [17] M. Omana, G. Papasso, D. Rossi, C. Metra, "A Model for Transient Fault Propagation in Combinatorial Logic," *On-Line Testing Symposium*, pp. 111-115, 2003.
- [18] M. Henrion, "Propagation of uncertainty by probabilistic logic sampling in Bayes' networks," *Uncertainty in Artificial Intelligence*, 1988.
- [19] J. Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible Inference," Morgan Kaufmann Publishers, Inc., 1988.
- [20] P. Robinson, W. Lee, R. Aguero and S. Gabriel, "Anomalies due to single event upsets," *Journal of Spacecraft and Rockets*, vol. 31, no. 2, pp. 166– 171, Mar-Apr 1994.
- [21] J.T. Wallmark and S.M. Marcus, "Minimum size and maximum packaging density of non-redundant semiconductor devices," *Proceedings of IRE*, vol. 50, pp. 286–298, March 1962.
- [22] G.H. Johnson, J.H. Hohl, R.D. Schrimpf and K.F. Galloway, "Simulating Single-event burnout in n-channel power MOSFETs," *IEEE Transactions* on *Electron Devices*, vol. 40, pp. 1001–1008, 1993.
- [23] R. G. Cowell, A. P. David, S. L. Lauritzen, D. J. Spiegelhalter, "Probabilistic Networks and Expert Systems", Springer-Verlag New York, Inc., 1999.
- [24] N. Ramalingam and S. Bhanja, "Causal Probabilistic Input Dependency Learning for Switching Model in VLSI Circuits", ACM GLSVLSI, 2005.

