Sunday, December 30, 2007

Design principles in Safety Technology

Design principles in Safety Technology

In safety technology, several basic design principles are applied. Two of them are briefly described in the following.

Fail safe

The fail safe principle requires that upon failure of a safety relevant system or component, it enters a safe state. A main precondition for the application of this principle is the existence of a safe state. For the railway this is a state, where all trains are at standstill in a certain track. If such a state exists, technical systems can be designed to enter it when they fail. A typical example is the train protection system.. However, the fail safe principle cannot always be applied.

Safe life

A system that does not have a safe state is e.g. the airplane. Then, the safe life principle has to be applied. It requires application of redundant and high reliable components to make sure, that the system always functions.

 



DISCLAIMER "The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you"

Friday, December 28, 2007

HALT and HASS Testing: Learning to Handle the Big Guns

 

A lack of standards for the correct implementation of the stress test techniques known as HALT and HASS has resulted in widespread confusion. When implemented correctly, HALT and HASS provide a fast, cost-effective path to greater product reliability and customer satisfaction, as well as reduced warranty costs.

Since they were first introduced in the early 1980s, Highly Accelerated Life Testing (HALT) and Highly Accelerated Stress Screening (HASS) have been successfully adopted for a host of high-performance applications, such as mission-critical avionics equipment. With their promise of quickly providing valuable information about the reliability of a new or modified design, and the ability to monitor production and prevent component variations from causing latent field reliability issues, HALT and HASS techniques are ideal for designing and manufacturing with commercial-grade components.

Both test methods use direct inject, high flow rate liquid nitrogen cooling, tens of kilowatts of heating and powerful, multi-axis broad-spectrum vibration. Although these aggressive test methods are very different from standard life testing, design verification testing (DVT) and end-of-production testing, there are no published industry standards that define these powerful test methods. Since they deploy extreme stresses designed to rapidly precipitate flaws and force them to failure, misapplications or misinterpretations of these tests can easily result in damaged products, wasted money and frustrated engineers.

HALT is used as part of the new product design process and is typically performed on pilot or pre-production units. During HALT testing, the product is subjected to increasing stresses until weak points in the design emerge. Failure modes are identified and analyzed, and the product design is modified based on the results of that analysis. A typical HALT test will take three to five days. HASS, on the other hand, is a production screen, and typically tests 100% of production units. HASS uses similar stresses to those used in HALT, but at lower levels based on the limits identified in HALT. HALT must be completed before HASS can be implemented, and HALT is the most widely used of the two tests.

HALT and DVT

Although HALT may appear similar to DVT, it has different goals, uses different stresses and provides different results. The goal of DVT is to demonstrate whether a product will function in its intended environment and meet its specifications. The purpose of HALT, however, is to subject the product to environmental overstress, effectively forcing failure modes to emerge by accelerating mechanical fatigue. HALT quickly identifies a particular product's set of failure modes by applying the same environmental stresses that occur in the field, but at much higher levels. DVT and life testing can sometimes identify those failures, but this rarely occurs because the required time and number of units in test would be extreme.

One of the most significant characteristics of HALT is that it is not a pass/fail test. There are no pre-established limits. The test concludes when product destruct limits have been reached or the engineers determine that no more useful information can be gained. A final HALT test report includes detailed data on the product's operating margin, destruct margin and design flaws, along with what the new margins will be if each of the design flaws is eliminated.

When HALT is used, it is performed before DVT, so failure modes are exposed quickly and inexpensively before DVT begins. At that point, they can be analyzed and corrected without the pressure of a looming release date. If this is not done, many products will exhibit multiple failures during DVT. This can initiate costly and time-consuming redesign/retest cycles. But as a product nears its scheduled release date, the pressure to pass DVT can be intense. Too often, dealing with these critical failures may be postponed until after product launch, resulting in even greater losses and customer dissatisfaction.

The HALT Test Method

The stresses used in HALT are applied beginning with the least destructive and ending with the most destructive. A test sequence starts with cold step stressing and proceeds to hot step, rapid thermal ramps and vibration. It ends with a combined environment of vibration and rapid thermal ramps, dwelling at both temperature extremes. Other stresses include input voltage variations, loading, clock frequency variations and mechanical loading, if appropriate. Combining stresses will often reveal failure modes that individual stresses cannot.

Each time a failure occurs it is carefully documented and, if possible, a quick work-around is identified. Testing concludes when multiple failures occur simultaneously or fundamental design or technology limits have been reached for individual and combined stresses.

The potential benefits from HALT are significant. A single failure mode, caught before it becomes an issue that requires field rework, can save millions of dollars and help maintain a company's reputation and likelihood of getting future contracts. In addition, using HALT helps DVT go smoothly, so products are more likely to be released on time.

HALT may be considered successful when DVT and product launch proceed without last-minute design changes caused by late detection of failures. Success is further characterized by a lack of field issues in the weeks and months following launch. But a successful HALT also requires other conditions. The development team must accept ownership of the process from the beginning. HALT must be applied as early as practical in the design process, and failure analysis must be fast and accurate. It is imperative that failures are not overlooked or explained away, and the product development team must apply solid judgment when deciding which failure modes to eliminate.

The vibration stress used in HALT can be another source of confusion, since it deploys a type of shaker system different from that used in DVT. The Electro-Dynamic (ED) shakers deployed in DVT can be carefully controlled to provide exactly the stimulus needed for an analysis of the product's vibration response. They provide this stimulation in only one axis at a time. In HALT, rapid fatigue, not analysis, is the goal, and Repetitive Shock (RS) systems are used. These systems (Figure 1) can stimulate a product with a much wider range of frequencies, in all three axes and the rotations about these axes simultaneously. This stimulation will rapidly drive a poor solder joint or weak mechanical connection to failure.

HASS Production Screening

Once a product has been ruggedized with HALT, the question of production testing arises. Manufacturing variations and vendor changes can mean disaster, whether in a high-dollar, low-volume product, or one to be used in critical applications where failure can be very expensive or dangerous. Companies often use long burn-in tests to reduce these risks, only to discover that burn-in failures are rare, yet warranty issues are still a problem.

This is where the HASS production screen comes in. It applies stresses similar to those used in HALT, but at substantially reduced levels, based on the limits identified in HALT for each of the applied stresses. HASS provides continuous verification that additional failure modes, resulting from manufacturing or component variations, have not crept into the product.

Unlike HALT, HASS is a pass/fail test. A HASS screen consists of a “precipitation” phase that may exceed operating limits. This is followed by a detection phase in which the stresses are reduced to within operating limits and the product is monitored for failures. The test usually requires from 30 minutes to two hours, and in many cases eliminates the need for 24 or 48 hours of largely ineffective burn-in. The potential lot-to-lot variations that have been introduced with commercial-grade components mean the risk of a component change, which could introduce a new field failure mode that would be undetected by functional testing or a few days of burn-in. HASS applies combined stresses to precipitate these failure modes and then detect them via a change in the operating margins or a hard failure.

Many engineers have expressed the concern that HASS can damage products and may actually cause field failures. However, proper implementation of the HASS Proof of Screen provides a clear understanding of screen effectiveness and ensures that there is no effect on product life or performance. Proof of Screen includes repetitive application of the HASS stress profile to a small population of production samples. HASS is only implemented after it has been proven that all “good” samples can withstand from 20 to 50 repeated HASS cycles without damage or wear.

Conclusion:

HALT and HASS chambers are expensive, but the cost is minimal compared to what many companies pay in direct costs and lost business if failures occur in the field. Furthermore, most companies approach HALT and HASS carefully, in stages. The first stage might consist of using HALT on a single new product and conducting the tests in an established commercial test lab. As more products follow and confidence increases, it may become cost-effective to purchase a chamber. After additional time and solid experience with HALT, many companies are making the move to HASS.

When designing with commercial-grade components, there is always a valid concern about potential degradation of product life and performance. With adequate training, the right equipment and a clear commitment from the organization, the powerful tools of HALT and HASS can very effectively reduce those risks.

 

 

 

 

 

 



DISCLAIMER "The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you"

Thursday, December 27, 2007

Analysis of Failures of Solid State Interlocking Systems

 

  1. Lack of domain Knowledge in Signalling and Traditional Route Relay interlocking Systems, This creates a technological gap between the software programmers and the Domain consultants. This leads to Errors in software, which might lead to unsafe failures of the system
  2. Increasing the complexity of the System by Employing distributed architecture, which is difficult to validate and verify and difficult to maintain, thus leading to very high time repair
  3. Extending the working scope of the Interlocking systems  for monitoring and  other non-Interlocking functions, which leads to degraded performance of the system
  4. Employing Non-Formal Interlocking principles instead of traditional RRI Principles leads to software complexity. For Ex: The Geographical method needs every system that is installed for new Yard needs validation, which is not practicable.
  5. Since the software and hardware is so complex, complete test of the system is not possible and most of the faults are revealed at the field Installation stage or during normal working of the system in field.
  6. The software is to be changed for every yard , the software structure should be in a generic form, but we seldom see a generic form and this the stage errors creep in.
  7. The lack of standardization in the railway working principles and the core Interlocking principles, the software developers are forced to do changes in the software for every yard in Different railway zones, this is the time that errors in the software creep in.

 

Because of the above said reasons the Interlocking systems have failed to create the necessary confidence in the railway operators. Because of this reason the Solid state Interlocking systems have become unpopular.

 


If we examine broadly the reasons for failure and lack of reliability and maintainability that are forced by the designers are as follows:

 

  1. Lack of standardization of interlocking principles, every railway zone has its own set of rules and principles which are conflicting with other railways, this makes the life of the developers difficult because they have change their systems settings and software accordingly.
  2. There is no standard book or reference available describing the core interlocking principles, since these rules are only known by the people working in this domain.
  3. Increase in the complexity of the software leads to difficulty in testing, since most of the Interlocking systems are sequential machines they are error prone are very difficult to test.

 



DISCLAIMER "The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you"

Thursday, December 20, 2007

ALARP and software

At one level the ALARP principle seems like common sense, and would be expected to be broadly applicable. However people have found difficulty in applying it to software (and in some other circumstances, e.g. ordnance and explosives). Why should this be?
There seem to be three related issues which make it difficult to apply the ALARP principle to software:
1. Most of the techniques we are interested in, e.g. rigorous testing, provide information about risk, they do not reduce risk (in this sense ALARP simply doesn’t apply);
2. Even if we assume we will remove faults we find by carrying out some analysis we cannot predict what these faults will be in advance – so we cannot know the benefit of applying the technique in advance so there is no prior basis to make the judgement whether or not application of the technique complies with ALARP;
3. Less obviously, there is an implicit assumption behind the ALARP principle that determining risk is cheap, but that reducing risk is expensive. This is not the case for software – finding the problems through testing, etc. is the expensive part of the process, and writing the code is only 5-10% of the cost.

Research Work in Solid state Interlocking

The research in the field of solid state interlocking systems data back to 1970s, the first prototypes came in to existence in Britain in the early 80s. In India the research in this area started in the mid 80s by IIT Delhi, the project was funded by RDSO, Lucknow. Prof. Vinod Chandra and Dr. M. Verma were involved in this project. They developed a prototype in which they allocated different Safey Integrity Levels to each module so that the complexity involved in validating the whole system which has only one safety integrity level is solved. The Prototype was developed by 2/2 hardware redundancy Method.

Union switch and Signal Company developed a System with single processor with diverse software method. The Total system had a hot stand by system that would take over if one system failed. Westrace Inc, Australia has implemented the Distributed architecture of SSI where the control is distributed in the Entire Yard as apposed to Centralized Systems
Michele Banci, ISTI - CNR, Formal Methods and Tools Group Pisa, Italy has worked on the method of state charts and graphical method to implement Interlocking.
Dejan Lutovac and Tatjana Lutovac of RMIT University, Melbourne, Australia have worked on generalizing the software and working towards an Universal Interlocking System.
Peter Wigger, Institute for Software, Electronics, Railroad Technology (ISEB), TÜV InterTraffic GmbH, Berlin-Brandenburg Group has worked on the allocating Safety Integrity Level (SIL) in Railway Applications.
Radek Dobias, Hana Kubatova, Department of Computer Science and Engineering, Czech Technical University Prague have worked on the use of FPGAs in safety critical railway applications.
Kotaro Shimamura, Shin’ichiro Yamaguchi of Hitachi Research Laboratory, Hitachi Ltd have worked on fail safe hardware by using dual synthesizable processor cores which gives redundancy in the component level itself
Tomoji Kishi, Natsuko Noda of Software Design Laboratories, NEC Corporation has worked on software architecture, architecture conformance, non-functional properties,
design method, layered system for Interlocking software.

Conclusion:
It is suggested that in the complex field of Railway Signalling, where safety, availability and maintainability are the prime issues, the railway operators must be taken in to confidence and the method applied to design these systems should be reliable, validatable and should create confidence in the railway operators. The Method suggested in the paper describes the method to be employed for design and development of SSIs for safe and reliable operation.

Types of Software Interlocking

2. Types of Methods applied in software for Interlocking

2.1 Geographical Method:
In the Geographical method the input to the SSI is given as the position of the signals, points, tracks Circuits and Slots. The Interlocking is implemented based on the generic rules that no part of the track are shared by the two routes at a time, Conflicting routes should not be set at a time etc. This type of implementation requires a great knowledge of the Yard Elements and the interconnection between them. In this method the software does not have one to one relation ship to the relay circuits used for RRI and is very difficult validate, so this method has failed to create the necessary confidence in the railway operators



2.2 Boolean Equation Method:
The Boolean equation method is the implementation of the traditional relay interlocking principles. In this method the relay circuits are implemented as Boolean equations, so there is one to one relation ship between the relay circuits and the software variables. Since there is a one to one relation ship between the software and the RRI Relay circuits, Railway operator can easily validate the software entrees made and this method will give him sufficient confidence.

Failsafe and Fault Tolerant Systems

Embedded System:

An Embedded system is a combination of hardware and software to do a specific job, unlike a general purpose computing system like a PC even though having good amount of hardware and software is not an embedded system because it does not do a specific function.

Real Time System: A Real time System is an Embedded System, Which operates on current data and not on saved data.

Hard Real Time System or Mission critical System: A real-time computer system must react to inputs from controlled object and from the operator. The instant at which a result must be produced is called a deadline. If by missing a firm deadline a catastrophe could happen, then the deadline is called hard. A real-time computer system that must meet at least one hard deadline is called a hard real-time computer system or a safety-critical real-time computer system.

Railway Interlocking System: A railway interlocking system controls the traffic in a railway station, and between adjacent stations. The control includes train routes, shunting moves and the movements of all other railway vehicles in accordance with railway rules, regulations and technological processes required for the operation of the railway station.

Interlocking Logic: A term used for the logical relationships between physical entities in the railway yard such as points, signals, track circuits, and so on. In SSI, this is programmed in the Software; in relay-based interlocking this is hardwired into the relay circuitry, and in ground-frame interlocking it is manifest in the mechanical linkages between physical components.

Ground-frame interlocking: An Interlocking System When built using mechanical linkages between Levers (Physical Entities) is called Ground-frame interlocking System.

Route Relay Interlocking System (RRI): An Interlocking System When built completely using Electro mechanical relays is called as Route Relay Interlocking System.

Solid State Interlocking System (SSI): An Interlocking System When built using Electronics replacing traditional Mechanical Levers and Electro mechanical relays is called as Solid state Interlocking System.

Reliability: The reliability can be defined as the ability of an item to perform a required function under stated conditions for a stated period of time.

Redundancy: The existence of more than one means of accomplishing a given function. Each means of accomplishing the function need not be necessarily identical.

Hardware (Software Diversity): Two or more different Versions of Hardware (Software) working in a system to achieve a same result.

Failure: The termination of the ability of an item to perform a required function.

Maintainability: The ability of an item, under stated conditions of use, to be retained in, or restore to, a state in which it can perform its required function, when maintenance is performed under stated conditions and using prescribed procedure and resources.

Availability: The ability of an item (Under combined aspects of its reliability, maintainability, and maintenance support) to perform its required function over a stated period of time.

Software/Hardware Integration

Software Hardware Integration is the process of testing the interfaces between Hardware and the associted software and also the functional behavoiur of the system. The Input document is Software Requirement Specification, Hardware Requirement Specification and Detailed design Documents.