Monday, February 4, 2013
From The Desk of
Here are some of the Basic Points of Embedded System Design:
A perfect design is an enemy of a good design. Often, designers striving for a perfect design may end-up with no design at all due to schedule and cost overruns. A simple design may not provide best solution to a given problem but it would probably have the best chance of meeting the schedule and cost constraints with acceptable quality. Also, a simple design is easier to implement, maintain and enhance.
It is advisable not to introduce additional complexity in the name of future design hooks. More often than not, these hooks turn out to be a liability than an asset for the future designers as they might be forced to accept the design based on these hooks. Thus avoid adding design hooks for the future, as they will just add additional work that you don't need to do now and will be of little help to future designers.
"Use a special purpose computing platform only after you have exhausted all possibilities of using a general purpose platform. For example, if your application requires signal processing capabilities, consider if the performance goals can be met by a general purpose PC platform without using Digital Signal Processors (DSPs). General purpose processors might support specialized instructions that might bring them at par with specialized platforms like DSPs. Low cost software and hardware development tools. Its easy to find people with skills in using general purpose platforms. General purpose platforms have much higher market volume so they are often an order of magnitude cheaper than specialized platforms.
When developing a hardware and software architecture, prefer designs that will reuse already developed software and hardware modules. The future reusability of the software and modules should also be a factor in choosing new architectures. Avoid "lets start with a clean slate" approach to developing systems. New projects should build over the results of previous project. This lowers cost by reducing complexity of the system being developed.
Many embedded systems use home grown protocols and operating systems. This leads to additional cost to maintain the associated software. Use of standard protocols and operating systems lowers cost and improves stability of the product, as standard products have been subjected to rigorous testing by countless systems. Proprietary protocols/operating systems often cost a lot more due to need to train developers.
Sunday, July 15, 2012
From The Desk of
Challenges in Independent Verification and Validation of Safety Critical Railway Signalling Systems
CMC Americas, Inc., Pittsburgh, PA, 15220
A railway signalling system is safety critical system that controls the traffic which includes train routes, shunting moves and the movements of all other railway vehicles in accordance with railway rules, regulations and technological processes required for the operation of the railway system. The overall signalling system consists of Microprocessor based Wayside controllers, On-board systems controlling the railway vehicle and supervision systems to monitor the vehicle movements from a centralized location. The complex nature of railway signalling rules and operational practices adopted by different railroads pose a difficult task for the software development of these systems. The complex nature of the software poses an even more challenging task during the Independent Verification and Validation of the system. The CENELEC set of standards is the widely accepted as the governing standard for design, development and Independent Verification and Validation (IV and V) of railway signalling systems. This paper describes the challenges faced during the different phases of IV and V of safety critical railway signalling software which is unique compared to other domains.
IV and V = Independent Verification and Validation
ATP = Automatic Train Protection
On-Board = Embedded systems used on the Train
CENELEC = European standards for Railway Signalling
IV and V is the most important phase of any safety critical system life cycle. The result of this phase decides the final outcome of the project and decides whether the product is fit for use. The IV and V of safety critical software for railway signalling applications is faced with many challenges due to the complexity of the systems and the variations it has depending upon the geography and environment in which it needs to operate. This paper particularly focuses on the experiences and challenges during different phases of the IV and V in a railway signalling project. The following areas will be discussed:
1) Systematic Problems
2) Challenges during Software Analysis
3) Challenges during System Integration and Field Validation Testing
4) Challenges during Test Result Analysis
II. Systematic Problems during IV and V of Railway Signalling Software
The following systematic problems are experienced during the IV and V of safety critical software developed for railway signalling applications:
5) Lack of formal methods in developing the control algorithms results in poor understanding of the system by a test engineer.
6) Lack of domain knowledge in railway signalling systems creates a technological gap between the software test engineers and the domain consultants. This leads to errors in software testing, which might lead to unsafe failures being un-detected.
7) Since the software and hardware is so complex, complete test of the system is not possible and most of the faults are revealed at the field Installation stage or during normal working of the system in field.
8) The software is often changed for every geographical location and results in specific code for each location. When the software structure is not in a generic form, it becomes difficult for the test engineer to develop test cases for every possible scenario.
9) The lack of standardization in the railway working principles results in incomplete test cases as test engineers are not well versed with all types of railroads.
10) Increase in the complexity of the software leads to difficulty in testing, since most of the railway systems are sequential machines they are error prone and are very difficult to test.
III. Challenges during Software Analysis
The following section describes the challenges faced during Static and Dynamic analysis of safety critical software developed for railway signalling applications:
1) Static Analysis of software is the analysis of the software code without actually executing it. The railway signalling software particularly the vehicle braking algorithms are very complex and require the test engineer to be well versed with the dynamics of the vehicle as well as good mathematical knowledge. These algorithms require the test engineer to envisage all the possible states of the algorithms and then create a formal model of the system. In many cases the lack of Test Engineer’s knowledge about these algorithms results in insufficient test cases for the model and results in many of the errors revealing in the latter part of the dynamic analysis.
2) Dynamic Analysis of software is the analysis of the software code by actually executing the software and observing the executions. The dynamic analysis of the safety critical software is an important phase of the Independent Verification and Validation of the system. The Test engineer should be well versed with the domain of inputs and outputs of the system. In many cases the test engineer chooses the boundary values based on the data range of the variable type. In the real world the boundary values depends on the actual working environment of the system, for example the GPS signal boundary values received by the vehicle to determine its position varies based on the geographical location of the railroad and is embedded in the vehicle database. When the test engineer creates test cases for this part of the software it should be changed based on the geographical location where the train is operating, instead the literal boundary value of the variable type would pass the dynamic analysis and this error would only be revealed during field validation tests.
3) Inexperienced test engineers just follow the rule book and often result in insufficient test scenarios. Testing experience and intuition combined with knowledge and curiosity about the system under test may add some uncategorized test cases to the designed test case set. Special values or combinations of values may be error-prone. Some interesting test cases may be derived from inspection checklists.
4) Test engineers generally are not well versed with the concept of error seeding and do not try to measure the effectiveness of their test cases. Some known error types should be inserted in the program, and the program should be executed with the test cases under test conditions. If only some of the seeded errors are found, the test case set is not adequate. The ratio of found seeded errors to the total number of seeded errors is an estimate of the ratio of found real errors to total number errors. This gives a possibility of estimating the number of remaining errors and thereby the remaining test effort.
5) Performance testing of the system in the lab environment is often inadequate, since the simulators are not exactly replicating the field environment, this result in many errors being revealed during field validation phase.
6) Test engineers often follow the concept of “Equivalence Classes and Input Partition Testing” to save time and testing effort. This principle is often flawed due to the inexperience of the test engineer or insufficient coverage of the data classes.
7) Lack of formal methods in developing software prevents the test engineer to take full advantage of the “Structure Based Testing” concept where clearly defined states and modules are required for generating test cases for complete coverage of the system.
IV. Challenges during System Integration and Field Validation Testing
The following section describes the challenges faced during System Integration testing of safety critical railway signalling systems:
1) The System Integration testing should be ideally started after the unit/Module tests are successful passed, but in reality due to the delayed nature of the railway signalling projects, the system integration tests are carried out in parallel with the unit/module tests, this causes many problems to be revealed only at the system level tests.
2) Many unexpected behaviors of the system are revealed at this stage since the software has not completely gone through unit tests. This causes more delays in the system integration tests and changes in the requirements are required since all the scenarios at the unit level have not been accounted.
3) The test engineers who have been devoted to find integration issues find themselves more involved in sorting out the problems that should have been caught during unit tests.
4) The integration issues found during this phase often lead to design changes which are costly to fix and in-turn increase the complexity of the system.
5) The Field Validation tests are executed in the actual field environment and many of these scenarios are not accounted in the lab, so the same software has different behavior in the lab and field, this leads to confusion and is hard for the developers to debug.
V. Challenges in Analysis of Test Results
The Railways signalling systems generally have large test scenarios to be performed in the field and lot of the data collected during the tests require offline analysis, the below section describes the challenges and problems associated with this phase.
1) In many railway projects, often the field engineers are recruited locally to ensure easy access to the test site, often these field engineers are new to railway signalling and have little or no training to execute the tests.
2) The complex nature of the offline log analysis requires test analyst to be fully in synchronization with the field engineer who executed the test, in many cases the lack of communication between the field and offline test analysts result in falsely reporting the test as failed.
3) In many cases, due to some inherent errors in the test procedure, the log file analyst reports the test as failed and goes through multiple cycles of test execution.
4) In On-board ATP log analysis, often it is required to check the braking profiles of the systems and requires complete knowledge of the braking algorithms. In many cases the test engineer is not qualified to perform this analysis and results in poorly reported analysis.
5) The lack of co-ordination between the test lead and the field technicians often result in incomplete tests and later results in incomplete log analysis.
Railway signalling is very specialized and unique area where high level of planning is required for all the phases of the project lifecycle especially for the IV and V of safety critical software. Poor planning at the start of the project usually result in cost overruns and delays. In our experience with railway signalling projects, generally limited budgets and time is allocated to IV and V phase which in realty takes the majority of the project budget. If the IV and V phase is planned well in advance and sufficient managerial responsibility is assigned specifically for this task, the projects can be completed in time and with better results, which in turn makes the job of the safety assessor easy. We suggest the following mitigation measures to ensure a successful IV and V of railways signalling systems:
1) Care should be taken to recruit test engineers who at least have basic knowledge of railway signalling and associated systems.
2) In case the test engineers are new recruits, they should be put through rigorous training before being assigned critical tasks such as writing test procedures and analyzing the test data logs.
3) Regular training sessions should be conducted for the test engineers in the project to impart in-depth knowledge of the system.
4) Encourage test engineers to be innovative in their testing methods instead of just following the regular patterns, this way more errors in the system are revealed which often get undetected with traditional test methods.
5) Create an environment where test engineers regularly interact with the design team to share each ones experiences and concerns
6) Create a dedicated managerial team to monitor all the test activities occurring a different sites and co-ordinate them. Better co-ordination between the Lab and field test teams leads to better analysis of the system.
7) Never follow the approach of parallel testing activities, for example, the system integration tests should never be planned in parallel with the unit tests.
The author would like to express his gratitude to Stephen A. Jacklin from the NASA Ames Research Center for his encouragement to take up this study and present my experiences with IV and V in railway signalling domain.
1S.Vinogradov, V.Okulevich, M.Gitsels, “Approaches to meet Certification Requirements for Mission-Critical Domains”, Software Engineering Conference (Russia), 16th Nov. 2006
2Ulrich Haspel, Gunni S. Frederiksen., “The Automated Copenhagen Metro In The First Year Of Operation
- Experience And Outlook,” 9th International Conference On Automated People Movers 2-5 September 2003, Singapore
3K.K Bajpayee, “Emerging Trends in Signalling on Indian Railways” in IRSTE Conference, 2003
4Peter Wigger, “Experience with Safety Integrity Level (SIL) Allocation in Railway Applications”, WCRR 2001 – 25. – 29. November 2001, Köln
5Dr. Hendrik Schäbe, “The Safety Philosophy Behind the CENELEC Railway Standards”, ESREL 2002, Lyon, March 19-21, 2002
6G.Biswas, S.Kumar,T.K.Ghoshal,V.Chandra, “Independent Verification and validation of Software with reference to UFSBI”, presented at IRSTE Seminar, 1999.
7Chinnarao Mokkapati, Terry Tse, Alan Rao “A Practical Risk Assessment Methodology for Safety-Critical Train Control Systems”, Office of Research and Development Washington, D.C. 20590, DOT/FRA/ORD-09/15
8EN 50126: Railway Applications - The Specification and Demonstration of Dependability, Reliability, Availability, Maintainability and Safety (RAMS). Issue: March 2000.
9prEN 50129: Railway Applications- Communications, signalling and processing systems - Safety related electronic systems for signalling. Issue: May 2002
10prEN 50128: Railway Applications- Communications, signalling and processing systems - Software for railway control and protection systems. Issue: March 2001
 Senior Systems Engineer, Embedded Systems Group, CMC Americas, Inc., email@example.com.
Saturday, January 21, 2012
From the desk of
Predictive Maintenance of Railway
Senior Systems Engineer, CMC Ltd
Abstract— The railway points (switches) are vital component of any Railway Interlocking system. Regular maintenance of points is required to keep them in operating condition. Present maintenance of points involves frequent inspection by maintenance staff and is not fool proof. Currently Electronic Monitoring systems are available which only logs the event and does not give any predictive analysis about the health of the points subsystem. This paper discusses a new approach for maintenance and diagnosis of railway points which is capable of remote monitoring and is intelligent enough to give predictive maintenance reports about the railway point’s health. This reduces the effort and huge costs in reducing manual monitoring and also it fool proof avoiding accidents. Distributed data gathering and centralized data processing methods have been discussed that not only report the fault but also give predictive measures to be taken by the field staff to avoid catastrophic failures.
Railways traverse through the length and breadth of our country covering 63,140 route kms, comprising broad gauge (45,099 kms), meter gauge (14,776 kms) and narrow gauge (3,265 kms). The most important part of the railways to carry out operations like safe movement of trains and communications between different entities is Signalling. The Railway signalling is governed by a concept called Interlocking. The main component of the interlocking is the Railways Points consisting of DC electrical motors to switch the rails to a different route. These vast and widespread assets to meet the growing traffic needs of developing economy is no easy task and makes Indian Railways a complex cybernetic system. The current mechanism in place to maintain the railway points are completely manual and requires large pool of maintainers to check the validity of the point machine and the related point infrastructure regularly, this process employed is neither cost effective nor fool proof. By employing the traditional method of manual maintenance, the rail operators do not have any prior warning for replacement or repair of points. The discussion in this paper mainly focuses on development of a system that not only monitors the points remotely without manual intervention, but also diagnosis the problem in the point thus saving human lives and huge manual maintenance costs. The motivation for developing a predictive maintenance system for Railway Points is as follows:
- To use an array of sensors to monitor all relevant parameters, in order to provide advanced warning of degradation prior to railways points failure.
- To provide predictive maintenance reports about the point machines to the maintainers.
- To provide continuous monitoring at both local and centralized locations.
- To provide an automated archival record from which broad trends can be extracted from the entire railway asset base.
- To provide, in the event of a catastrophic failure, the immediate past history to identify the cause.
Railway Points Structure
The following Figure 1 describes the architecture of railway points in operation.
Points, or switches as they are known, allow a rail vehicle to move from one set of rails to another. They are a ‘digital output device’ in that there are only two acceptable states for the point to be set in, ‘normal’, and ‘reverse’. Movement is carried out by way of a geared motor, which actuates the stretcher bar. Location or state detection is made by a two-position, polarized, magnetic stick contactor. A signal is fed back from these switches to the signal box where all point directions are controlled and monitored. The snap-action switches at the end of the stroke stop the machine and help brake the motor to help reduce any impact at the end of the travel. Two stretcher bars (Figure 1) make sure that the switch rails remain the correct distance apart – this can vary between installations depending on the curvature of the main rails, and the speed limit of that section of the track. There are usually two stretcher bars for each point machine. Any fault in this mechanism like poorly securing of the bolts holding these stretcher bars, loose bolts etc. may lead to deadly accidents.
Proposed Predictive Maintenance System Architecture
The proposed architecture of Predictive Maintenance System (PMS) for Railways points is discussed below using the Figure 2
Figure 2 Architecture of PMS
Sensors are used to measure Voltage, Current, load and temperature of Point Motor. The Throwing load sensor is used to measure the stress in the operating rod of the point machine. The sensor values are read on real time basis by the wayside device and sent to a central location for analysis. The wayside device uses GSM/GRPS network to transmit this data to a central location. The Central Station analyzes the data in real time and makes predictions on the point machines and stores them in to a database. The status of any point machine can be viewed using any internet browser in the central station. The Local station maintainers can view the data by logging in to the web server using any internet browser. Based on the Current consumption, the load sensor values and the point motor temperature, predictions are made for the maintenance or replacement of the Point Motors. The central location is a Web server based architecture, where anyone with a Web browser can login and see the details.
Data processing and analytics
The system has a database of current and load characteristics of good working railway points. This data is used as a reference for processing real time data received from the wayside units. The following figures show the current (i) and load sensor values plotted against time during point machine operation.
Data Processing Techniques
Various Signal Processing Techniques are available for analysis of real time data described below:
1) Data Cluster method – This involves recording the characteristics of a parameter of a subsystem under different simulated conditions and then using this as a reference to validate the real time data. This method is different from template matching, since it not entirely based on matching the plotted characteristics.
2) Template matching – Entails comparing complete data sets with pre-recorded examples of data resulting from known fault conditions. The method can be used effectively in some circumstances, provided a representation of the data that produces good discrimination between pattern classes can be made. However, this requires a substantial amount of experimentation with different transformations of the data sets to find such distinctions, and would be a computationally intensive process.
3) Statistical and decision theoretic methods – Matches are made based on statistical features of the signal. For example, the mean and peak-to-peak value are evaluated for each vector, and plotted in feature space, whereby different patterns are distinguishable because they form clusters for each class that are located apart from the fully functioning case.
4) Structural or syntactic methods – Involves deconstructing a pattern or vector into structural components, to enable comparisons to be made on more simple, sub-segments of data rather than a complete vector. Mathematically, these methods are similar to fractal-based compression routines.
The method that was of specific interest to this project was to use a data clustering methodology where a database of good measurements as well as load sensor data readings under various simulated faults in the laboratory on some specimen railway points is stored and then the real time load sensor data is plotted against it. This generates very unique clusters of data points which represent each type of fault.
By applying the above techniques, we get clusters of fault data. We have found that these data clusters are unique in the sense that these represent different types of faults.
Figure 5 Force Data Clusters
Types of faults detectable
- Tight lock on reverse side (sand on bearers both sides) – Refers to the lock which holds the point in position after it has changed direction. This lock prevents the point from moving out of position because of vibration.
- A 12-mm obstruction at toe on normal side – Simulates a piece of ballast impeding point motion between the toe of the switch rail (the mobile section of rail), and the stock rail.
- Back drive slackened off at toe end on LHS – The drive to the midpoint of the switch rail is only loosely connected to the stretcher bar. The stretcher bar holds the mobile rails a fixed distance apart.
- Back drive slackened off at toe end on RHS – Similar to the above.
- Back drive tightened at heel end on RHS – Similar to the above.
- Back drive tightened at heel end on LHS – Similar to the above.
- Diode snubbing block disconnected – An electrical fault.
- Drive rod stretcher bar loose on RHS – Connecting bar between the switch rails is loose. A dangerous fault.
- Operational contact slackened off by four holes – Applies to the contact for detecting when the point has completed motion.
Saturday, September 24, 2011
From the desk of
Background: The dynamic analysis of the safety critical software is an important phase of the Independent Verification and Validation of the system. The EN 50128 has detailed the methods that shall be used for this phase of the verification life cycle. This phase is so critical to the project output that it demands meticulous planning and organization. Here we discuss the dynamic analysis methods suggested by the CENELEC standards for SIL 4 software.
Boundary Value Analysis
The aim of this method is to remove software errors occurring at parameter limits or boundaries. The input domain of the program is divided into a number of input classes. The tests should cover the boundaries and extremes of the classes. The tests check that the boundaries in the input domain of the specification coincide with those in the program. The use of the value zero, in a direct as well as in an indirect translation, is often error-prone and demands special attention:
- Zero divisor;
- Blank ASCII characters;
- Empty stack or list element;
- Null matrix;
- Zero table entry.
Normally the boundaries for input have a direct correspondence to the boundaries for the output range. Test cases should be written to force the output to its limited values. Consider also, if it is possible to specify a test case which causes output to exceed the specification boundary values. If output is a sequence of data, for example a printed table, special attention should be paid to the first and the last elements and to lists containing none, 1 and 2 elements.
The aim of this method is to remove common programming errors. Testing experience and intuition combined with knowledge and curiosity about the system under test may add some uncategorised test cases to the designed test case set. Special values or combinations of values may be error-prone. Some interesting test cases may be derived from inspection checklists. It may also be considered whether the system is robust enough. Can the buttons be pushed on the front-panel too fast or too often? What happens if two buttons are pushed simultaneously?
The aim of this method is to ascertain whether a set of test cases is adequate. Some known error types are inserted in the program, and the program is executed with the test cases under test conditions. If only some of the seeded errors are found, the test case set is not adequate. The ratio of found seeded errors to the total number of seeded errors is an estimate of the ratio of found real errors to total number errors. This gives a possibility of estimating the number of remaining errors and thereby the remaining test effort.. The detection of all the seeded errors may indicate either that the test case set is adequate, or that the seeded errors were too easy to find. The limitations of the method are that in order, to obtain any usable results, the error types as well as the seeding positions must reflect the statistical distribution of real errors.
The aim of the method is to ensure that the working capacity of the system is sufficient to meet the specified requirements. The requirements specification includes throughput and response requirements for specific functions, perhaps combined with constraints on the use of total system resources. The proposed system design is compared against the stated requirements by:
- Defining a model of the system processes, and their interactions,
- Identifying the use of resources by each process, for example, processor time, communications bandwidth, storage devices etc),
- Identifying the distribution of demands placed upon the system under average and worst-case conditions,
- Computing the mean and worst-case throughput and response times for the individual system functions.
For simple systems, an analytic solution may be possible whilst for more complex systems, some form of simulation is required to obtain accurate results. Before detailed modelling, a simpler ’resource budget’ check can be used which sums the resources
requirements of all the processes. If the requirements exceed designed system capacity, the design is infeasible. Even if the design passes this check, performance modelling may show that excessive delays and response times occur due to resource starvation. To avoid this situation engineers often design systems to use some fraction (e.g. 50 %) of the total resources so that the probability of resource starvation is reduced
Equivalence Classes and Input Partition Testing
The aim of this method is to test the software adequately using a minimum of test data. The test data is obtained by selecting the partitions of the input domain required to exercise the software. This testing strategy is based on the equivalence relation of the inputs, which determines a partition of the input domain.
Test cases are selected with the aim of covering all subsets of this partition. At least one test case is taken from each equivalence class. There are two basic possibilities for input partitioning which are:
- Equivalence classes may be defined on the specification. The interpretation of the specification may be either input oriented, for example the values selected are treated in the same way or output oriented, for example the set of values leading to the same functional result; and
- Equivalence classes may be defined on the internal structure of the program. In this case the equivalence class results are determined from static analysis of the program, for example the set of values leading to the same path being executed
Structure Based Testing
The aim of this method to apply tests which exercise certain subsets of the program structure. Based on an analysis of the program a set of input data is chosen such that a large fraction of selected program elements are exercised. The program elements exercised can vary depending upon the level of rigour required.
- Statements: This is the least rigorous test since it is possible to execute all code statements without exercising both branches of a conditional statement.
- Branches: Both sides of every branch should be checked. This may be impractical for some types of defensive code.
- Compound Conditions: Every condition in a compound conditional branch (i.e. linked by AND/OR is exercised).
- LCSAJ: A linear code sequence and jump is any linear sequence of code statements including conditional jumps terminated by a jump. Many potential sub-paths will be infeasible due to constraints on the input data imposed by the execution of earlier code.
- Data Flow: The execution paths are selected on the basis of data usage for example a path where the same variable is both written and wrote.
- Call Graph: A program is composed of subroutines which may be invoked from other subroutines. The call graph is the tree of subroutine invocations in the program. Tests are designed to cover all invocations in the tree.
- Entire Path: Execute all possible path through the code. Complete testing is normally infeasible due to the very large number of potential paths.
Saturday, September 17, 2011
From the desk of
The Computer Based Interlocking Architecture
The Solid state Interlocking systems for Railways should ensure the following:
- Fail safety
Architecture and methodology
Generally following three types of redundancy techniques are used for achieving fail-safety in the design of signaling systems:
Hardware Redundancy – In this case, more than one hardware modules of identical design with common software are used to carry out the safety functions and their outputs are continuously compared. The hardware units operate in tightly syncronised mode with comparison of outputs in every clock cycle. Due to the tight syncronisation, it is not possible to use diverse hardware or software. In this method, although random failures are taken care of, it is difficult to ensure detection of systematic failures due to use of identical hardware and software.
Software Redundancy – This approach uses a single hardware unit with diverse software. The two software modules are developed independently and generally utilize inverted data structures to take care of common mode failures. However, rigorous self check procedures are required to be adopted to compensate for use of a single Hardware unit.
Hybrid Model - The hardware units have been loosely syncronised where the units operate in alternate cycle and the outputs are compared after full operation of the two modules. Therefore, it is no more required to use identical hardware and software. Although the systems installed in the field utilize identical hardware and software, the architecture permits use of diverse hardware and software. Moreover, operation of the two units in alternate cycles permits use of common system bus and interface circuitry.
To ensure the above said points hardware and software is designed accordingly. There are various techniques to meet the above said requirements as discussed below:
Table 1: Existing Failsafe Methods employed in Design of Computer Based Interlocking Systems
Method of Implementation
Type of Errors Detected
Practical Problems with the Method
The same software is executed on the same hardware during two different time intervals
(Refer: Figure 5: Time Redundancy)
Errors Caused by transients. They are avoided by reading at two different time Intervals
Single hardware Fault leads to Shut down of the System. This method is not used since software faults are not completely found in validation. And the Self diagnostics employed for checking of hardware faults is not complete.
The same software is executed on two
identical hardware channels
(Refer: Figure 6: Hardware Redundancy)
Hardware faults are detected since outputs from both the channels are compared. And single hardware fault does not lead to shut down of the system
Software faults are not detected since the same software is running on two identical hardware channels. Software Faults at design stage are still not detected.
Identical Software is Executed on Different hardware Versions
(Refer: Figure 7: Hardware Diversity)
Hardware Design faults at the Initial stage are Detected
Software Faults at the design stage are still not detected
The different software versions are
executed on the same hardware during
two different time intervals
(Refer: Figure 8: Software Diversity)
Software Faults at design stage are detected
Even though the software is diverse, they are executed on the single hardware channel; single hardware fault leads to Shut down of the system.
Diverse software on
The different software versions are
executed on two identical hardware
(Refer: Figure 9: Diverse software on redundant hardware)
Software Faults at design stage are detected and single hardware faults does not lead to system shut down
Hardware faults at the design stage are not detected.
Diverse software on
The different software versions are
executed on two different hardware
(Refer: Figure 10: Diverse software on Diverse Hardware)
Software Faults and Hardware Faults are detected at the design stage.
This methods is rarely used, Since design complexity involved is high
Sunday, September 11, 2011
From the Desk of
CENELEC Standard: Faults and Effects
Effects of single faults
It is necessary to ensure that the system/sub-system/equipment meets its THR in the event of single random fault. It is necessary to ensure that SIL 3 and SIL 4 systems remain safe in the event of any kind of single random hardware fault which is recognized as possible. Faults whose effects have been demonstrated to be negligible may be ignored. This principle, which is known as fail-safety, can be achieved in several different ways:
1) Composite fail-safety
With this technique, each safety-related function is performed by at least two items. Each of these items shall be independent from all others, to avoid common-cause failures. Non-restrictive activities are allowed to progress only if the necessary number of items agree. A hazardous fault in one item shall be detected and negated in sufficient time to avoid a co-incident fault in a second item.
2) Reactive fail-safety
This technique allows a safety-related function to be performed by a single item, provided its safe operation is assured by rapid detection and negation of any hazardous fault (for example, by encoding, by multiple computation and comparison, or by continual testing). Although only one item performs the actual safety-related function, the checking/testing/detection function shall be regarded as a second item, which shall be independent to avoid common-cause failures.
3) Inherent fail-safety
This technique allows a safety-related function to be performed by a single item, provided all the credible failure modes of the item are non-hazardous. Any failure mode which is claimed to be incredible (for example, because of inherent physical properties) shall be justified using the procedure defined in Annex C. Inherent fail-safety may also be used for certain functions within Composite and Reactive fail-safe systems, for example to ensure independence between items, or to enforce shut-down if a hazardous fault is detected.
Whichever technique or combination of techniques is used, assurance that no single random hardware component failure mode is hazardous shall be demonstrated using appropriate structured analysis methods. The component failure modes to be considered in the analysis shall be identified using the procedures defined in Annex C.
In systems containing more than one item whose simultaneous malfunction could be hazardous, independence between items is a mandatory precondition for safety concerning single faults. Appropriate rules or guidelines shall be fulfilled to ensure this independence. The measures taken shall be effective for the whole life-cycle of the system. In addition, the system/sub-system design shall be arranged to minimize potentially hazardous consequences of loss-of-independence caused by, for example, a
Systematic design fault, if it could exist.
Detection of single faults
A first fault (single fault) which could be hazardous, either alone or if combined with a second fault, shall be detected and a safe state enforced (i.e.: negated) in a time sufficiently short to fulfill the specified quantified safety target. Demonstration of this shall be achieved by a combination of Failure Modes and Effects Analysis (FMEA) and quantified assessment of Random Failure Integrity.
In the case of Composite fail-safety, this requirement means that a first fault shall be detected, and a safe state enforced, in a time sufficiently short to ensure that the risk of a second fault occurring during the detection-plus-negation time is smaller than the specified probabilistic target. In the case of Reactive fail-safety, this requirement means that the maximum total time taken for detection-plus-negation shall not exceed the specified limit for the duration of a transient, potentially hazardous, condition.
Effects of multiple faults
A multiple fault (for example, a double or triple fault) which could be hazardous, either directly or if combined with a further fault, shall be detected and a safe state enforced (i.e.: negated) in a time sufficiently short to fulfill the specified safety target. A suitable method, for example Fault Tree Analysis (FTA), shall be used to demonstrate the effects of multiple faults. The techniques used to achieve detection-plus-negation of multiple faults within the permitted time shall be shown, including supporting calculations.