Fermilab
BTeV Logo BTeV Document 1857-v1 Fermilab
General Information BTeV for the Public BTeV for Physicists BTeV Search

Understanding and Coping with Hardware and Software Failures in a Very Large Trigger Farm

Document #:
BTeV-doc-1857-v1
Document type:
Proceeding
Submitted by:
Jim Kowalkowski
Updated by:
Jim Kowalkowski
Document Created:
13 Jun 2003, 16:35
Contents Revised:
13 Jun 2003, 16:35
DB Info Revised:
13 Jun 2003, 16:35
Accessible by:
  • Public document
Abstract:
When thousands of processors are involved in performing event filtering on a trigger farm, there is likely to be a large number of failures within the software and hardware systems. BTeV, a proton/antiproton collider experiment at Fermi National Accelerator Laboratory, has designed a trigger, which includes several thousand processors. If fault conditions are not given proper treatment, it is conceivable that this trigger system will experience failures at a high enough rate to have a negative impact on its effectiveness. The RTES (Real Time Embedded Systems) collaboration is a group of physicists, engineers, and computer scientists working to address the problem of reliability in large-scale clusters with real-time constraints such as this. Resulting infrastructure must be highly scalable, verifiable, extensible by users, and dynamically changeable.
Files in Document:
Topics:
Authors:
Keywords:
RTES CHEP2003
Associated with Events:
held from 24 Mar 2003 to 28 Mar 2003 in La Jolla, CA

DocDB Home ]  [ Search ] [ Last 20 Days ] [ List Authors ] [ List Topics ]

Fermilab at Work ]  [ BTeV Home ]  [ BTeV for Physicists ]  [ BTeV Search ]


DocDB Version: 7.6.0, contact BTeV Document Database Administrators
Security, Privacy, Legal Fermi National Accelerator Laboratory