RTES participation in Super Computer 2003

This is the initial ideas for what we want to do for Super Computer 2003.

Overview

Where do we fit in? We will use one panel of the central kiosk for a fault handling demonstration and explanation of the project and its purpose and benefits. The fault handling demonstration will show quick error or problem detection, some sort of problem analysis, and corrective actions being carried out. What is available for us to use? It sounds like we have a 2'x3' area that will contain two 19" LCD monitors. In front of the monitors will be a small bar-like table with stools. What will we bring? We want to bring a 7-slot VME crate containing DSPs working together to process events like the ones that will be generated from a Fermilab collider/detector experiment. We also need two PCs, one running windows and another running Linux.

The Context

The BTeV first level event filter will be the setting for the primary demonstration. This demonstration can be done using hardware that we will bring to the conference. This particular trigger is a good setting because each crossing will be processed by a CPU running a physics filter algorithm. The actual trigger will require about 2500 processors. The demonstration will only contain a small number of processors (<10) acting as event processors in order to show fault handling. If possible, we would like to demonstrate fault handling in the third level trigger using a remote farm of Linux PCs. We have not discussed this in any detail and do not know if it is possible at this point. The real level 2/3 combined trigger processing will require an additional 2500 CPUs. It might be possible to show fault handling and management at level 2/3 using 100 or so nodes from the Fermilab offline reconstruction farms in a demonstration.

Presentation

As a conference goer enters the area around this panel, he or she will see introductory information about In addition, the viewers will see the crate with the processors running the demonstration. The screens on the kiosk panel will have available or show: Conference goers will experience the following from the demonstration (listed chronologically):
  1. a screen showing a normal running system web pages that ask the user if he/she wants to know more about any of the things present on the other screen.
  2. A planned sequence of failures caused from interacting with the control windows. Associated with this part there will also be the ability to find out more about the changes observed on the monitoring display using the web pages.
  3. A path leading to a place where cables can be unplugged from the running system in order to observe the effects.
  4. A chance to bring up the tool that configured to system. This will lead to an explanation of the variously components of the system and how they relate to the BTeV experiment.
  5. A chance to go through some presentation material on the technology that is involved in the demonstration and where it is headed in the future.
This seems like a backwards way to present the material, but I think that it is a very good way to generate interest quickly. Slide material will be available to explain the technology that the group is exporing and the complications of developing such tools.

Inside the Demonstration

The demonstration is geared to show a handful of scenarios that are likely to be common when the real system is operating. Below is a list of the particular failures we expect to generate, the monitoring information necessary to discover the problem, and the associated responses or actions.
Problem Response InfoNeeded
increased data rate threshold/prescale change, disable services ?
decreased data rate threshold/prescale change, enable services ?
broken communication link to a DSP take DSP out of service ?
the death of a DSP take unit out of service ?
trigger filter application hung restart application ?
death of manager process on the host PCs restart application ?
increased processing time per event log problem ?
input queue high water mark reached threshold/prescale change, disable services ?
unable to keep up switch to hot spare ?

Sample Screen Ideas

Other Information

List of the infomation needed to make decisions: