Data Distribution Strategies with Fault-Tolerant Components OMG's Workshop on Distributed Object Computing for Real-time and Embedded Systems Richard T. McClain [email protected] Joseph N. Licameli [email protected] Lockheed Martin Naval Electronic & Surveillance Systems Syracuse, NY Advanced Systems Architecture Session Outline • • • • Intro Component Design Challenge Fault Tolerant Strategies – Master/Standby Component – Master/Standby Data Distribution – Master/Standby Storage with Distribution Capability • Lessons Learned • Wrap-Up • Q&A Data Distribution Strategies for Fault Tolerant Components RTM/JNL 2 Intro • AN/SQQ-89 is an undersea warfare (USW) system for Aegis destroyers • Multiple variants exist • COTS Incorporation – Development began in 1990’s – Design Constraints • • • • • Open System Architecture Distributed, real-time and embedded (DRE) VxWorks & HP-UX C++, Ada-95 and Java Combinations of legacy and new software OMG CORBA chosen as middleware for communications among different software entities Data Distribution Strategies for Fault Tolerant Components RTM/JNL 3 AN/SQQ-89 USW Combat System • • Full COTS Implementation Network centric utilizing Internet Protocol suite OMG CORBA as primary communication protocol 2 Million Software Lines of Code - C++, Ada-95 and Java. Unit 861 - Signal & Data Processor A11A3 - CSSFS (Pri.) /UCFS (Pri.) A21A3 - LSFS to ALIS Unit 866 - Signal & Data Processor A11A3 -CSSFS (Pri.)/TRAFS A21A3 IFCP WS App. AEGIS LSPCP CMCP DCCP FCCP (M) WS App. 834 DMCP (M) IPCP (M) RDPCP(M) TRAFS WS App. 830 DMCP (M) A1A1 HP744 A1A6 HP744 A1A11 PPC A1A1 HP744 A1A6 HP744 A1A11 HP744 A2A1 HP744 A2A6 HP744 A2A11 HP744 A1A1 HP744 A1A11 PPC(2) A2A1 HP744 A2A6 HP744 SRMCP IPCP (M) A2A11 HP744 WS App. 831 A1A1 HP744 WS App. TRAFS RPMICP A1A6 HP744 A1A11 HP744 Unit 804 - System Equipment Rack S-L Recor der RMS Recorder Unit 863 A11A3 HRBCP A2A1 PPC VLS TSP AN/SRQ-4 Color/BW Printer Unit 861 15(16)/16 AN/ARR-75 HRBCP A2A14 PPC ATP FODMS XBT Unit 866 18(22)/24 SPPFS App. Host HP J5600 ATM Switch ATM Switch to ALIS Unit 862 A11A3 XMCP A2A1 PPC A2A1 HP744 Spare A2A6 HP744 OBTHCP A2A11 PPC TCW A2A1 HP744 OBTLCP A1A1 HP744 HTCCP Unit 864 19(22)/24 HTCCP RAACP ATM Switch X-Server Unit 831 - AN/UY Q-70 A1a HP744 X-Server ATM Switch Unit 832 - AN/UY Q-70 A1a HP744 VLS TSP ATP FODMS X-Server Unit 833 - AN/UY Q-70 A1a HP744 A2A11 PPC A1A6 HP744 RCSCP A3A4 PPC A1a HP744 to RMS Unit 865 16/16 OBTFS ASFS OBTCCP A3A1 PPC Unit 830 - AN/UY Q-70 Unit 860 A11A3 A21A3 • • X-Server Unit 834 - AN/UY Q-70 A1A11 PPC HSPCP A2A1 HP744 TMCP A2A6 HP744 DMCP (S) A11A3 - CSSFS (Sec.) / ASFS (Pri.) Unit 865 - Signal & Data Processor A2A11 HP744 SRMCP IPCP (S) A2A1 HP744 WS App. 833 A21A3 A1A1 HP744 VDPCP A1A6 HP744 RDMCP A1A11 HP744 RDPCP(S) A2A1 HP744 FCCP (S) A2A6 HP744 DMCP (S) A11A3 -CSSFS (Sec.)/UCFS(Sec.) Unit 864 - Signal & Data Processor A2A11 HP744 IPCP (S) A1A1 HP744 CMCP A21A3 A1A6 HP744 DCCP A1A11 HP744 WS App. 832 A1a HP744 X-Server Unit 540 - Dual Display A1 HP744 CADRT-L A2 HP744 CADRT-S 8 Ma y 2002 Data Distribution Strategies for Fault Tolerant Components RTM/JNL 4 Real-Time - Hard vs Soft • Hard Real-Time – Used Motorola PPC running VxWorks • Mercury Quad PPC for Digital Signal Processing (DSP) – Processing localized to node; avoided network interference – Inter-process communications via UDP • Connectionless transmission and low latency • Used CORBA as a means for senders to setup channels to receivers • Soft Real-Time – Used HP744 running HP-UX 11.0 – Used Motorola PPC running VxWorks • Non real-time string – Inter-process communications via CORBA Data Distribution Strategies for Fault Tolerant Components RTM/JNL 5 Hard Real-Time Processing Nodes Unit 861 - Signal & Data Processor A11A3 - CSSFS (Pri.) /UCFS (Pri.) A21A3 - LSFS to ALIS Unit 866 - Signal & Data Processor A11A3 -CSSFS (Pri.)/TRAFS A21A3 IFCP WS App. AEGIS LSPCP CMCP DCCP FCCP (M) WS App. 834 DMCP (M) IPCP (M) RDPCP(M) TRAFS WS App. 830 DMCP (M) A1A1 HP744 A1A6 HP744 A1A11 PPC A1A1 HP744 A1A6 HP744 A1A11 HP744 A2A1 HP744 A2A6 HP744 A2A11 HP744 A1A1 HP744 A1A11 PPC(2) A2A1 HP744 A2A6 HP744 SRMCP IPCP (M) A2A11 HP744 WS App. 831 WS App. TRAFS RPMICP A1A1 HP744 A1A6 HP744 A1A11 HP744 Unit 804 - System Equipment Rack S-L Recorder RMS Recorder Unit 863 A11A3 HRBCP A2A1 PPC VLS TSP AN/SRQ-4 Color/BW Printer Unit 861 15(16)/16 AN/ARR-75 HRBCP A2A14 PPC ATP FODMS XBT Unit 866 18(22)/24 SPPFS App. Host HP J5600 ATM Switch ATM Switch to ALIS Unit 862 A11A3 XMCP A2A1 PPC Unit 860 Unit 865 16/16 A11A3 A21A3 OBTFS ASFS OBTCCP Spare A2A6 HP744 OBTHCP A2A11 PPC TCW A2A1 HP744 OBTLCP A2A11 PPC RCSCP A3A4 PPC RAACP Unit 830 - AN/UYQ-70 A1a HP744 to RMS A2A1 HP744 A3A1 PPC Unit 864 19(22)/24 ATM Switch X-Server Unit 831 - AN/UYQ-70 A1a HP744 X-Server ATM Switch Unit 832 - AN/UYQ-70 A1a HP744 VLS TSP ATP FODMS X-Server Unit 833 - AN/UYQ-70 A1a HP744 X-Server Unit 834 - AN/UYQ-70 A1A1 HP744 A1A6 HP744 A1A11 PPC A2A1 HP744 A2A6 HP744 A2A11 HP744 A2A1 HP744 A1A1 HP744 A1A6 HP744 HTCCP HTCCP HSPCP TMCP DMCP (S) SRMCP IPCP (S) WS App. 833 VDPCP RDMCP A11A3 - CSSFS (Sec.) / ASFS (Pri.) Strategies for Fault UnitData 865Distribution - Signal & Data Processor Tolerant Components A21A3 A1A11 HP744 A2A1 HP744 A2A6 HP744 A2A11 HP744 A1A1 HP744 A1A6 HP744 A1A11 HP744 RDPCP(S) FCCP (S) DMCP (S) IPCP (S) CMCP DCCP WS App. 832 A11A3 -CSSFS (Sec.)/UCFS(Sec.) Unit 864 - Signal & Data Processor A21A3 A1a HP744 X-Server Unit 540 - Dual Display A1 HP744 CADRT-L A2 HP744 CADRT-S RTM/JNL 6 Applied Elements of CORBA • Naming Service – Developed Fault Tolerant Scheme • Static Invocations – Did not utilize Dynamic CORBA • Basic Object Adapter (BOA) – POA not yet adopted • Publish-Subscribe Semantics for Data Distribution – Event Service Not Readily Available • Immature Implementations • Some products relied on Multicast Data Distribution Strategies for Fault Tolerant Components RTM/JNL 7 Interoperability HP-UX 11.0 Java Windows-NT HP Java Orb Java Ada-95 SUN Java Orb ORBexpress RT IIOP C++ Visibroker C++ Visibroker VxWorks 5.3.1 Data Distribution Strategies for Fault Tolerant Components RTM/JNL 8 Component Development • Why Components? – IDL does not define all of the requirements placed upon an interface. • Fault Tolerant Requirements • Communication Maintenance • Failure Events – Multiple IDL users leads to multiple solutions to meet requirements • Leads to multiple problems that are unique • Examples of Common System Services (CSS) Components with Fault Tolerant Requirements – System Resource Management – Ownship Data Management – Database Management Components allow for a common mechanism to meet system requirements Data Distribution Strategies for Fault Tolerant Components RTM/JNL 9 CSS Components Unit 861 - Signal & Data Processor A11A3 - CSSFS (Pri.) /UCFS (Pri.) A21A3 - LSFS to ALIS Unit 866 - Signal & Data Processor A11A3 -CSSFS (Pri.)/TRAFS A21A3 IFCP WS App. AEGIS LSPCP CMCP DCCP FCCP (M) WS App. 834 DMCP (M) IPCP (M) RDPCP(M) TRAFS WS App. 830 DMCP (M) A1A1 HP744 A1A6 HP744 A1A11 PPC A1A1 HP744 A1A6 HP744 A1A11 HP744 A2A1 HP744 A2A6 HP744 A2A11 HP744 A1A1 HP744 A1A11 PPC(2) A2A1 HP744 A2A6 HP744 SRMCP IPCP (M) A2A11 HP744 WS App. 831 WS App. TRAFS RPMICP A1A1 HP744 A1A6 HP744 A1A11 HP744 Unit 804 - System Equipment Rack S-L Recorder RMS Recorder Unit 863 A11A3 HRBCP A2A1 PPC VLS TSP AN/SRQ-4 Color/BW Printer Unit 861 15(16)/16 AN/ARR-75 HRBCP A2A14 PPC ATP FODMS XBT Unit 866 18(22)/24 SPPFS App. Host HP J5600 ATM Switch ATM Switch to ALIS Unit 862 A11A3 XMCP A2A1 PPC Unit 860 Unit 865 16/16 A11A3 A21A3 OBTFS ASFS OBTCCP Spare A2A6 HP744 OBTHCP A2A11 PPC TCW A2A1 HP744 OBTLCP A2A11 PPC RCSCP A3A4 PPC RAACP Unit 830 - AN/UYQ-70 A1a HP744 to RMS A2A1 HP744 A3A1 PPC Unit 864 19(22)/24 ATM Switch X-Server Unit 831 - AN/UYQ-70 A1a HP744 X-Server ATM Switch Unit 832 - AN/UYQ-70 A1a HP744 VLS TSP ATP FODMS X-Server Unit 833 - AN/UYQ-70 A1a HP744 X-Server Unit 834 - AN/UYQ-70 A1A1 HP744 A1A6 HP744 A1A11 PPC A2A1 HP744 A2A6 HP744 A2A11 HP744 A2A1 HP744 A1A1 HP744 A1A6 HP744 HTCCP HTCCP HSPCP TMCP DMCP (S) SRMCP IPCP (S) WS App. 833 VDPCP RDMCP A11A3 - CSSFS (Sec.) / ASFS (Pri.) Strategies for Fault UnitData 865Distribution - Signal & Data Processor Tolerant Components A21A3 A1A11 HP744 A2A1 HP744 A2A6 HP744 A2A11 HP744 A1A1 HP744 A1A6 HP744 A1A11 HP744 RDPCP(S) FCCP (S) DMCP (S) IPCP (S) CMCP DCCP WS App. 832 A11A3 -CSSFS (Sec.)/UCFS(Sec.) Unit 864 - Signal & Data Processor A21A3 A1a HP744 X-Server Unit 540 - Dual Display A1 HP744 CADRT-L A2 HP744 CADRT-S RTM/JNL 10 CSS Component Design • CSS consists of an interface layer and a realization layer • Interface layer provides a set of services commonly required by tactical applications in their native language • For example: Alerts, Time, Ownship and Database • Realization layer contains a set of executables that collaborate to provide an implementation to the services offered at the interface layer • For example: Ownship Management provides the implementation to the Ownship interface. Tactical Application Data Distribution Strategies for Fault Tolerant Components API CSS Interface Layer CORBA CSS Realization Layer RTM/JNL 11 Fault Tolerant Challenge & Strategies • Challenge – Provide a common strategy for meeting systemic fault tolerant requirements – Isolate the strategies from the applications utilizing the fault tolerant components – Handle different levels of fault tolerant quality of service • Strategies – Master/Standby • System Resource Management Component – Master/Standby Data Distribution • Ownship Management Component – Master/Standby Storage with Distribution Capability • Database Management Component Data Distribution Strategies for Fault Tolerant Components RTM/JNL 12 Master/Standby – System Resource Management Cont rol User Application Interface St C on tr o l at us Secondary ... Implementation Framework Interface Implementation Framework Component Standby User Application C on tr o l us at Arbitration Interface t r ol C on Primary St Component Master Implementation Framework connected to Primary Component running as Master • User Application Implementation Framework t us St a Primary Interface Interface Implementation Framework ... User Application St at us Implementation Framework Interface User Application Implementation Framework User Application St a tus Co ntr ol St atus Cont rol Component Standby Arbitration Component Master Secondary Implementation Framework connected to Secondary Component after a transition to Master Mechanism – Interface Layer maintains connection to one component site (Master) – Only master site publishes objects into the Naming Service – Upon failure standby site overwrites references in the Naming Service • Interface Layer reestablishes connection to new master site – Arbitration Exists between Primary and Secondary Sites • Suitability – Bi-directional Data Flow – Temporary loss of data is allowed during transition – Data Exchanged – Posting Alerts & System State Updates Data Distribution Strategies for Fault Tolerant Components RTM/JNL 13 Master/Standby Data Distribution – Ownship Management Data Distribution Standby da Arbitration ta ... User Application Interface Secondary User Application Implementation Framework ta da Data Distribution Standby ta Arbitration Interface a dat Primary da Data Distribution Master Implementation Framework connected to Primary & Secondary Component. Data distributed from Master only. • User Application Implementation Framework Interface Implementation Framework User Application Primary Interface Interface Implementation Framework ... data Implementation Framework Interface User Application Implementation Framework User Application Data Distribution Master data Secondary Implementation Framework connected to Primary & Secondary Component. Data distributed from Master only. Mechanism – Interface Layer maintains connection to both sites – Publish-Subscribe pattern utilized – Arbitration exists between sites • Only master site distributes data – Hides multiple distribution sites from applications and provides uninterrupted flow of data • Suitability – Unidirectional Data Flow (Pure Data Sources) – Data Exchanged – Ship Positional Data Data Distribution Strategies for Fault Tolerant Components RTM/JNL 14 Master/Standby Storage with Distribution Capability – Database Management Primary User Application Data or e St User Application Interface at a D Implementation Framework connected to Primary & Secondary Component. Bi-directional interface with Master only. St o re Component Standby Arbitration Da ta ... Component Standby Secondary at a Implementation Framework Interface • Implementation Framework User Application D e or Arbitration User Application Primary St a D at Interface re St o Component Master Implementation Framework Interface Implementation Framework ... Interface St ore Implementation Framework Interface User Application Implementation Framework User Application St ore Data Component Master Secondary Implementation Framework connected to Secondary & Primary Component. During transition, framework updated to communicate with new Master. Mechanism – Interface Layer maintains connection to both sites – Arbitration exists between sites • Only master site distributes data – Publish-Subscribe pattern • Applications only allowed to send and retrieve from master site • Standby site notifies Interface Layer when transitioning to master • – Hides multiple storage/distribution sites from applications and provides uninterrupted flow of data Suitability – Bi-directional Data Flow where information is stored, retrieved and distributed – Loss of Data Not Tolerated – Data Exchanged – System Information, Persistent Information, … Data Distribution Strategies for Fault Tolerant Components RTM/JNL 15 Lessons Learned • Implementation of Publish-Subscribe patterns – Obtaining a Key when Registering with a Data Server • Subscribers need to pass a unique key to the Data Servers • Data Servers providing the key to the subscriber can cause performance issues – Clean Up of Failed Subscriber References • Destroy Failed References in a separate thread • QoS of underlying TCP socket closure can block thread of execution – Data Servers Provide Tolerance on Subscriber Reference Failures • Intermittent failures can cause the break down of otherwise valid communication channel • Do not generically process exceptions Data Distribution Strategies for Fault Tolerant Components RTM/JNL 16 Lessons Learned -- Observations with Subscription Failures • CORBA Setup – Oneway invocations for distributing subscription data – Original Assumption – Oneway Invocations will not block • Observations on TCP socket in flow control – Oneway invocations blocked thread – Eventually No Response exception thrown • Observations on Kernel Crashes and TCP sockets – Mainly Observed with VxWorks nodes which needed a reset to restart applications – Oneway invocations blocked thread – Eventually No Response exception thrown • Summary – One subscriber effected distribution to all other subscribers • Solution – Distribution with Active Objects Data Distribution Strategies for Fault Tolerant Components RTM/JNL 17 Lessons Learned -- Distribution with Active Objects Subscriber List No Failures Scenario Subscriber Thread Subscriber Distribute To Subscriber Subscriber Thread Trigger Data Distribution Subscriber List Failure Scenario Distribute To Subscriber Thread Main Distribution Thread Mark For Clean Up Distribute To Subscriber Subscriber Thread Subscriber Thread Subscriber Distribute To Subscriber Subscriber Thread Thread Distribute To Subscriber Subscriber Distribute To Subscriber Data Distribution Strategies for Fault Tolerant Components Thread Distribute To Subscriber RTM/JNL 18 Clean Up Failed Subscriber Lessons Learned • Real-Time Implications – Handling Failures • Minimize effect of a failed subscriber on the data flow to other system subscribers • Build in a tolerance for intermittent failures on subscribers – Prevents time consuming recovery actions – Performance • Simplify the subscription process to minimize the processing involved with removal and additions to the distribution list. Data Distribution Strategies for Fault Tolerant Components RTM/JNL 19 Wrap-Up • IDL alone is not sufficient for defining all of the requirements of an interface – Fault tolerant requirements are not apparent from the IDL • Component Development provides a common mechanism for meeting these interface requirements • Isolate communication failure events from the main thread of execution – Do not allow the handling of a failure on a subscriber reference to interrupt the flow of data to other system subscribers • SQQ-89 successfully deployed real-time embedded system that used CORBA as the system middleware Data Distribution Strategies for Fault Tolerant Components RTM/JNL 20 Questions & Answers Data Distribution Strategies for Fault Tolerant Components RTM/JNL 21
© Copyright 2026 Paperzz