Data Distribution Strategies with Fault-Tolerant Components

Data Distribution Strategies with
Fault-Tolerant Components
OMG's Workshop on Distributed Object Computing for
Real-time and Embedded Systems
Richard T. McClain
[email protected]
Joseph N. Licameli
[email protected]
Lockheed Martin
Naval Electronic & Surveillance Systems
Syracuse, NY
Advanced Systems Architecture
Session Outline
•
•
•
•
Intro
Component Design
Challenge
Fault Tolerant Strategies
– Master/Standby Component
– Master/Standby Data Distribution
– Master/Standby Storage with Distribution Capability
• Lessons Learned
• Wrap-Up
• Q&A
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 2
Intro
• AN/SQQ-89 is an undersea warfare (USW) system for
Aegis destroyers
• Multiple variants exist
• COTS Incorporation
– Development began in 1990’s
– Design Constraints
•
•
•
•
•
Open System Architecture
Distributed, real-time and embedded (DRE)
VxWorks & HP-UX
C++, Ada-95 and Java
Combinations of legacy and new software
OMG CORBA chosen as middleware for
communications among different software entities
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 3
AN/SQQ-89 USW Combat System
•
•
Full COTS Implementation
Network centric utilizing
Internet Protocol suite
OMG CORBA as primary
communication protocol
2 Million Software Lines of
Code - C++, Ada-95 and
Java.
Unit 861 - Signal & Data Processor
A11A3 - CSSFS (Pri.) /UCFS (Pri.)
A21A3 - LSFS
to
ALIS
Unit 866 - Signal & Data Processor
A11A3 -CSSFS (Pri.)/TRAFS
A21A3
IFCP
WS App.
AEGIS
LSPCP
CMCP
DCCP
FCCP
(M)
WS App.
834
DMCP
(M)
IPCP
(M)
RDPCP(M)
TRAFS
WS App.
830
DMCP
(M)
A1A1
HP744
A1A6
HP744
A1A11
PPC
A1A1
HP744
A1A6
HP744
A1A11
HP744
A2A1
HP744
A2A6
HP744
A2A11
HP744
A1A1
HP744
A1A11
PPC(2)
A2A1
HP744
A2A6
HP744
SRMCP
IPCP
(M)
A2A11
HP744
WS App.
831
A1A1
HP744
WS App.
TRAFS
RPMICP
A1A6
HP744
A1A11
HP744
Unit 804 - System
Equipment Rack
S-L Recor der
RMS Recorder
Unit 863 A11A3
HRBCP
A2A1
PPC
VLS
TSP
AN/SRQ-4
Color/BW Printer
Unit 861
15(16)/16
AN/ARR-75
HRBCP
A2A14
PPC
ATP
FODMS
XBT
Unit 866
18(22)/24
SPPFS App. Host
HP J5600
ATM Switch
ATM Switch
to
ALIS
Unit 862 A11A3
XMCP
A2A1
PPC
A2A1
HP744
Spare
A2A6
HP744
OBTHCP
A2A11
PPC
TCW
A2A1
HP744
OBTLCP
A1A1
HP744
HTCCP
Unit 864
19(22)/24
HTCCP
RAACP
ATM Switch
X-Server
Unit 831 - AN/UY Q-70
A1a
HP744
X-Server
ATM Switch
Unit 832 - AN/UY Q-70
A1a
HP744
VLS
TSP
ATP
FODMS
X-Server
Unit 833 - AN/UY Q-70
A1a
HP744
A2A11
PPC
A1A6
HP744
RCSCP
A3A4
PPC
A1a
HP744
to
RMS
Unit 865
16/16
OBTFS ASFS
OBTCCP
A3A1
PPC
Unit 830 - AN/UY Q-70
Unit 860
A11A3 A21A3
•
•
X-Server
Unit 834 - AN/UY Q-70
A1A11
PPC
HSPCP
A2A1
HP744
TMCP
A2A6
HP744
DMCP
(S)
A11A3 - CSSFS (Sec.) / ASFS (Pri.)
Unit 865 - Signal & Data Processor
A2A11
HP744
SRMCP
IPCP
(S)
A2A1
HP744
WS App.
833
A21A3
A1A1
HP744
VDPCP
A1A6
HP744
RDMCP
A1A11
HP744
RDPCP(S)
A2A1
HP744
FCCP
(S)
A2A6
HP744
DMCP
(S)
A11A3 -CSSFS (Sec.)/UCFS(Sec.)
Unit 864 - Signal & Data Processor
A2A11
HP744
IPCP
(S)
A1A1
HP744
CMCP
A21A3
A1A6
HP744
DCCP
A1A11
HP744
WS App.
832
A1a
HP744
X-Server
Unit 540 - Dual Display
A1
HP744
CADRT-L
A2
HP744
CADRT-S
8 Ma y 2002
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 4
Real-Time - Hard vs Soft
• Hard Real-Time
– Used Motorola PPC running VxWorks
• Mercury Quad PPC for Digital Signal Processing (DSP)
– Processing localized to node; avoided network interference
– Inter-process communications via UDP
• Connectionless transmission and low latency
• Used CORBA as a means for senders to setup channels to
receivers
• Soft Real-Time
– Used HP744 running HP-UX 11.0
– Used Motorola PPC running VxWorks
• Non real-time string
– Inter-process communications via CORBA
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 5
Hard Real-Time Processing Nodes
Unit 861 - Signal & Data Processor
A11A3 - CSSFS (Pri.) /UCFS (Pri.)
A21A3 - LSFS
to
ALIS
Unit 866 - Signal & Data Processor
A11A3 -CSSFS (Pri.)/TRAFS
A21A3
IFCP
WS App.
AEGIS
LSPCP
CMCP
DCCP
FCCP
(M)
WS App.
834
DMCP
(M)
IPCP
(M)
RDPCP(M)
TRAFS
WS App.
830
DMCP
(M)
A1A1
HP744
A1A6
HP744
A1A11
PPC
A1A1
HP744
A1A6
HP744
A1A11
HP744
A2A1
HP744
A2A6
HP744
A2A11
HP744
A1A1
HP744
A1A11
PPC(2)
A2A1
HP744
A2A6
HP744
SRMCP
IPCP
(M)
A2A11
HP744
WS App.
831
WS App.
TRAFS
RPMICP
A1A1
HP744
A1A6
HP744
A1A11
HP744
Unit 804 - System
Equipment Rack
S-L Recorder
RMS Recorder
Unit 863 A11A3
HRBCP
A2A1
PPC
VLS
TSP
AN/SRQ-4
Color/BW Printer
Unit 861
15(16)/16
AN/ARR-75
HRBCP
A2A14
PPC
ATP
FODMS
XBT
Unit 866
18(22)/24
SPPFS App. Host
HP J5600
ATM Switch
ATM Switch
to
ALIS
Unit 862 A11A3
XMCP
A2A1
PPC
Unit 860
Unit 865
16/16
A11A3 A21A3
OBTFS ASFS
OBTCCP
Spare
A2A6
HP744
OBTHCP
A2A11
PPC
TCW
A2A1
HP744
OBTLCP
A2A11
PPC
RCSCP
A3A4
PPC
RAACP
Unit 830 - AN/UYQ-70
A1a
HP744
to
RMS
A2A1
HP744
A3A1
PPC
Unit 864
19(22)/24
ATM Switch
X-Server
Unit 831 - AN/UYQ-70
A1a
HP744
X-Server
ATM Switch
Unit 832 - AN/UYQ-70
A1a
HP744
VLS
TSP
ATP
FODMS
X-Server
Unit 833 - AN/UYQ-70
A1a
HP744
X-Server
Unit 834 - AN/UYQ-70
A1A1
HP744
A1A6
HP744
A1A11
PPC
A2A1
HP744
A2A6
HP744
A2A11
HP744
A2A1
HP744
A1A1
HP744
A1A6
HP744
HTCCP
HTCCP
HSPCP
TMCP
DMCP
(S)
SRMCP
IPCP
(S)
WS App.
833
VDPCP
RDMCP
A11A3 - CSSFS (Sec.) / ASFS (Pri.)
Strategies
for Fault
UnitData
865Distribution
- Signal & Data
Processor
Tolerant Components
A21A3
A1A11
HP744
A2A1
HP744
A2A6
HP744
A2A11
HP744
A1A1
HP744
A1A6
HP744
A1A11
HP744
RDPCP(S)
FCCP
(S)
DMCP
(S)
IPCP
(S)
CMCP
DCCP
WS App.
832
A11A3 -CSSFS (Sec.)/UCFS(Sec.)
Unit 864 - Signal & Data Processor
A21A3
A1a
HP744
X-Server
Unit 540 - Dual Display
A1
HP744
CADRT-L
A2
HP744
CADRT-S
RTM/JNL 6
Applied Elements of CORBA
• Naming Service
– Developed Fault Tolerant Scheme
• Static Invocations
– Did not utilize Dynamic CORBA
• Basic Object Adapter (BOA)
– POA not yet adopted
• Publish-Subscribe Semantics for Data Distribution
– Event Service Not Readily Available
• Immature Implementations
• Some products relied on Multicast
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 7
Interoperability
HP-UX 11.0
Java
Windows-NT
HP Java
Orb
Java
Ada-95
SUN Java
Orb
ORBexpress
RT
IIOP
C++
Visibroker
C++
Visibroker
VxWorks 5.3.1
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 8
Component Development
• Why Components?
– IDL does not define all of the requirements placed upon an
interface.
• Fault Tolerant Requirements
• Communication Maintenance
• Failure Events
– Multiple IDL users leads to multiple solutions to meet requirements
• Leads to multiple problems that are unique
• Examples of Common System Services (CSS) Components
with Fault Tolerant Requirements
– System Resource Management
– Ownship Data Management
– Database Management
Components allow for a common
mechanism to meet system requirements
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 9
CSS Components
Unit 861 - Signal & Data Processor
A11A3 - CSSFS (Pri.) /UCFS (Pri.)
A21A3 - LSFS
to
ALIS
Unit 866 - Signal & Data Processor
A11A3 -CSSFS (Pri.)/TRAFS
A21A3
IFCP
WS App.
AEGIS
LSPCP
CMCP
DCCP
FCCP
(M)
WS App.
834
DMCP
(M)
IPCP
(M)
RDPCP(M)
TRAFS
WS App.
830
DMCP
(M)
A1A1
HP744
A1A6
HP744
A1A11
PPC
A1A1
HP744
A1A6
HP744
A1A11
HP744
A2A1
HP744
A2A6
HP744
A2A11
HP744
A1A1
HP744
A1A11
PPC(2)
A2A1
HP744
A2A6
HP744
SRMCP
IPCP
(M)
A2A11
HP744
WS App.
831
WS App.
TRAFS
RPMICP
A1A1
HP744
A1A6
HP744
A1A11
HP744
Unit 804 - System
Equipment Rack
S-L Recorder
RMS Recorder
Unit 863 A11A3
HRBCP
A2A1
PPC
VLS
TSP
AN/SRQ-4
Color/BW Printer
Unit 861
15(16)/16
AN/ARR-75
HRBCP
A2A14
PPC
ATP
FODMS
XBT
Unit 866
18(22)/24
SPPFS App. Host
HP J5600
ATM Switch
ATM Switch
to
ALIS
Unit 862 A11A3
XMCP
A2A1
PPC
Unit 860
Unit 865
16/16
A11A3 A21A3
OBTFS ASFS
OBTCCP
Spare
A2A6
HP744
OBTHCP
A2A11
PPC
TCW
A2A1
HP744
OBTLCP
A2A11
PPC
RCSCP
A3A4
PPC
RAACP
Unit 830 - AN/UYQ-70
A1a
HP744
to
RMS
A2A1
HP744
A3A1
PPC
Unit 864
19(22)/24
ATM Switch
X-Server
Unit 831 - AN/UYQ-70
A1a
HP744
X-Server
ATM Switch
Unit 832 - AN/UYQ-70
A1a
HP744
VLS
TSP
ATP
FODMS
X-Server
Unit 833 - AN/UYQ-70
A1a
HP744
X-Server
Unit 834 - AN/UYQ-70
A1A1
HP744
A1A6
HP744
A1A11
PPC
A2A1
HP744
A2A6
HP744
A2A11
HP744
A2A1
HP744
A1A1
HP744
A1A6
HP744
HTCCP
HTCCP
HSPCP
TMCP
DMCP
(S)
SRMCP
IPCP
(S)
WS App.
833
VDPCP
RDMCP
A11A3 - CSSFS (Sec.) / ASFS (Pri.)
Strategies
for Fault
UnitData
865Distribution
- Signal & Data
Processor
Tolerant Components
A21A3
A1A11
HP744
A2A1
HP744
A2A6
HP744
A2A11
HP744
A1A1
HP744
A1A6
HP744
A1A11
HP744
RDPCP(S)
FCCP
(S)
DMCP
(S)
IPCP
(S)
CMCP
DCCP
WS App.
832
A11A3 -CSSFS (Sec.)/UCFS(Sec.)
Unit 864 - Signal & Data Processor
A21A3
A1a
HP744
X-Server
Unit 540 - Dual Display
A1
HP744
CADRT-L
A2
HP744
CADRT-S
RTM/JNL 10
CSS Component Design
• CSS consists of an interface layer and a realization
layer
• Interface layer provides a set of services commonly
required by tactical applications in their native
language
• For example: Alerts, Time, Ownship and Database
• Realization layer contains a set of executables that
collaborate to provide an implementation to the
services offered at the interface layer
• For example: Ownship Management provides the
implementation to the Ownship interface.
Tactical
Application
Data Distribution Strategies for Fault
Tolerant Components
API
CSS
Interface
Layer
CORBA
CSS
Realization
Layer
RTM/JNL 11
Fault Tolerant Challenge & Strategies
• Challenge
– Provide a common strategy for meeting systemic fault
tolerant requirements
– Isolate the strategies from the applications utilizing the fault
tolerant components
– Handle different levels of fault tolerant quality of service
• Strategies
– Master/Standby
• System Resource Management Component
– Master/Standby Data Distribution
• Ownship Management Component
– Master/Standby Storage with Distribution Capability
• Database Management Component
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 12
Master/Standby – System Resource Management
Cont rol
User
Application
Interface
St
C
on
tr o
l
at
us
Secondary
...
Implementation
Framework
Interface
Implementation
Framework
Component
Standby
User
Application
C
on
tr o
l
us
at
Arbitration
Interface
t r ol
C on
Primary
St
Component
Master
Implementation Framework connected to Primary
Component running as Master
•
User
Application
Implementation
Framework
t us
St a
Primary
Interface
Interface
Implementation
Framework
...
User
Application
St at us
Implementation
Framework
Interface
User
Application
Implementation
Framework
User
Application
St a
tus
Co
ntr
ol
St atus
Cont rol
Component
Standby
Arbitration
Component
Master
Secondary
Implementation Framework connected to Secondary
Component after a transition to Master
Mechanism
– Interface Layer maintains connection to one component site (Master)
– Only master site publishes objects into the Naming Service
– Upon failure standby site overwrites references in the Naming Service
• Interface Layer reestablishes connection to new master site
– Arbitration Exists between Primary and Secondary Sites
•
Suitability
– Bi-directional Data Flow
– Temporary loss of data is allowed during transition
– Data Exchanged – Posting Alerts & System State Updates
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 13
Master/Standby Data Distribution –
Ownship Management
Data Distribution
Standby
da
Arbitration
ta
...
User
Application
Interface
Secondary
User
Application
Implementation
Framework
ta
da
Data Distribution
Standby
ta
Arbitration
Interface
a
dat
Primary
da
Data Distribution
Master
Implementation Framework connected to Primary &
Secondary Component. Data distributed from Master only.
•
User
Application
Implementation
Framework
Interface
Implementation
Framework
User
Application
Primary
Interface
Interface
Implementation
Framework
...
data
Implementation
Framework
Interface
User
Application
Implementation
Framework
User
Application
Data Distribution
Master
data
Secondary
Implementation Framework connected to Primary &
Secondary Component. Data distributed from Master only.
Mechanism
– Interface Layer maintains connection to both sites
– Publish-Subscribe pattern utilized
– Arbitration exists between sites
• Only master site distributes data
– Hides multiple distribution sites from applications and provides
uninterrupted flow of data
•
Suitability
– Unidirectional Data Flow (Pure Data Sources)
– Data Exchanged – Ship Positional Data
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 14
Master/Standby Storage with Distribution Capability –
Database Management
Primary
User
Application
Data
or
e
St
User
Application
Interface
at
a
D
Implementation Framework connected to Primary & Secondary
Component. Bi-directional interface with Master only.
St o
re
Component
Standby
Arbitration
Da
ta
...
Component
Standby
Secondary
at
a
Implementation
Framework
Interface
•
Implementation
Framework
User
Application
D
e
or
Arbitration
User
Application
Primary
St
a
D at
Interface
re
St o
Component
Master
Implementation
Framework
Interface
Implementation
Framework
...
Interface
St ore
Implementation
Framework
Interface
User
Application
Implementation
Framework
User
Application
St ore
Data
Component
Master
Secondary
Implementation Framework connected to Secondary & Primary
Component. During transition, framework updated to
communicate with new Master.
Mechanism
– Interface Layer maintains connection to both sites
– Arbitration exists between sites
• Only master site distributes data – Publish-Subscribe pattern
• Applications only allowed to send and retrieve from master site
• Standby site notifies Interface Layer when transitioning to master
•
– Hides multiple storage/distribution sites from applications and provides uninterrupted
flow of data
Suitability
– Bi-directional Data Flow where information is stored, retrieved and distributed
– Loss of Data Not Tolerated
– Data Exchanged – System Information, Persistent Information, …
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 15
Lessons Learned
• Implementation of Publish-Subscribe patterns
– Obtaining a Key when Registering with a Data Server
• Subscribers need to pass a unique key to the Data Servers
• Data Servers providing the key to the subscriber can cause
performance issues
– Clean Up of Failed Subscriber References
• Destroy Failed References in a separate thread
• QoS of underlying TCP socket closure can block thread of
execution
– Data Servers Provide Tolerance on Subscriber Reference
Failures
• Intermittent failures can cause the break down of otherwise valid
communication channel
• Do not generically process exceptions
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 16
Lessons Learned -- Observations with Subscription Failures
• CORBA Setup
– Oneway invocations for distributing subscription data
– Original Assumption – Oneway Invocations will not block
• Observations on TCP socket in flow control
– Oneway invocations blocked thread
– Eventually No Response exception thrown
• Observations on Kernel Crashes and TCP sockets
– Mainly Observed with VxWorks nodes which needed a reset to
restart applications
– Oneway invocations blocked thread
– Eventually No Response exception thrown
• Summary
– One subscriber effected distribution to all other subscribers
• Solution
– Distribution with Active Objects
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 17
Lessons Learned -- Distribution with Active Objects
Subscriber List No
Failures Scenario
Subscriber
Thread
Subscriber
Distribute To
Subscriber
Subscriber
Thread
Trigger Data
Distribution
Subscriber List
Failure Scenario
Distribute To
Subscriber
Thread
Main Distribution
Thread
Mark For
Clean Up
Distribute To
Subscriber
Subscriber
Thread
Subscriber
Thread
Subscriber
Distribute To
Subscriber
Subscriber
Thread
Thread
Distribute To
Subscriber
Subscriber
Distribute To
Subscriber
Data Distribution Strategies for Fault
Tolerant Components
Thread
Distribute To
Subscriber
RTM/JNL 18
Clean Up
Failed
Subscriber
Lessons Learned
• Real-Time Implications
– Handling Failures
• Minimize effect of a failed subscriber on the data flow to other
system subscribers
• Build in a tolerance for intermittent failures on subscribers
– Prevents time consuming recovery actions
– Performance
• Simplify the subscription process to minimize the processing
involved with removal and additions to the distribution list.
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 19
Wrap-Up
• IDL alone is not sufficient for defining all of the
requirements of an interface
– Fault tolerant requirements are not apparent from the IDL
• Component Development provides a common
mechanism for meeting these interface requirements
• Isolate communication failure events from the main
thread of execution
– Do not allow the handling of a failure on a subscriber
reference to interrupt the flow of data to other system
subscribers
• SQQ-89 successfully deployed real-time embedded
system that used CORBA as the system middleware
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 20
Questions & Answers
Data Distribution Strategies for Fault
Tolerant Components
RTM/JNL 21