SRS Red Team Approach

RABA’s Red Team
Assessments
QuickSilver
14 December 2005
Agenda
•
•
•
•
Tasking for this talk…
Projects Evaluated
Approach / Methodology
Lessons Learned
o
and Validations Achieved
• The Assessments
o
o
General Strengths / Weaknesses
AWDRAT (MIT)
• Success Criteria
• Assessment Strategy
• Strengths / Weaknesses
o
o
o
LRTSS (MIT)
QuickSilver / Ricochet (Cornell)
Steward (JHU)
The Tasking
“Lee would like a presentation from the Red Team
perspective on the experiments you've been involved with.
He's interested in a
• talk that's heavy on lessons learned and benefits gained.
Also of interest would be
• red team thoughts on strengths and weaknesses of the
technologies involved.
Keeping in mind that no rebuttal would be able to take place
beforehand,
• controversial observations should be either generalized
(i.e., false positives as a problem across several projects) or
left to the final report.”
-- John Frank e-mail (November 28, 2005)
Specific Teams We Evaluated
• Architectural-Differencing, Wrappers, Diagnosis, Recover,
Adaptive Software and Trust Management (AWDRAT)
o
o
October 18-19, 2005
MIT
• Learning and Repair Techniques for Self-Healing Systems
(LRTSS)
o
o
October 25, 2005
MIT
• QuickSilver / Ricochet
o
o
November 8, 2005
Cornell University
• Steward
o
o
Dec 9, 2005
JHU
Basic Methodology
• Planning
o
o
o
o
Present High Level Plan at July PI Meeting
Interact with White Team to schedule
Prepare Project Overview
Prepare Assessment Plan
• Coordinate with Blue Team and White Team
• Learning
o
o
o
Study documentation provided by team
Conference Calls
Visit with Blue Team day prior to assessment
• Use system, examine output, gather data
• Test
• Formal De-Brief at end of Test Day
Lessons Learned
(and VALIDATIONS achieved)
Validation / Lessons Learned
• Consistent Discontinuity of Expectations
o
Scope of the Assessment + Success Criteria
• Boiling it down to “Red Team Wins” or “Blue Team Wins” on each
test required significant clarity
o
Unique to these assessments because the metrics were unique
• Lee/John instituted an assessment scope conference call ½ way
through
o
o
we think that helped a lot
Scope of Protection for the systems
• Performer’s Assumptions vs. Red Team’s Expectations
• In all cases, we wanted to see a more holistic approach to the
security model
• We assert each program needs to define its security policy
o
And especially document what it assumes will be protected / provided
by other components or systems
LL: Scope of Protection
LRTSS:
- Protect ALL Data Structures
- Protect Dependent Relationships
- Detect Pointer Corruption
AWDRAT:
- OS Environment
(s/w & data at rest; services)
- Complete Path Protection
Existing System’s Perspective
Red Team’s Perspective
QuickSIlver:
- OS Environment
(s/w & data at rest; services)
Steward:
- Protect Keys & Key Mgmt
- Defense against Evil Clients
- OS Environment
(s/w & data at rest; services)
Validation / Lessons Learned
• More time would have helped A LOT
o
Longer Test Period (2-3 day test vice 1 day test)
• Having an evening to digest then return to test would have allowed more
effective additional testing and insight
o
We planned an extra 1.5 days for most, and that was very helpful
• We weren’t rushing to get on an airplane
• We could reduce the data and come back for clarifications if needed
• We could defer non-controversial tests to the next day to allow focus with
Government present
• More Communication with Performers
o
Pre-Test Site/Team Visit (~2-3 weeks prior to test)
• Significant help in preparing testing approach
• The half-day that we implemented before the test was crucial for us
o
o
More conference calls would have helped, too
Hard to balance against performers main focus, though
Validation / Lessons Learned
• A Series of Tests Might Be Better
o
o
Perhaps one day of tests similar to what we did
Then a follow-up test a month or two later as prototypes matured
• With the same test team to leverage understanding of system gained
• We Underestimated the Effort in Our Bid
o
o
Systems were more unique and complex than we anticipated
20-25% more hours would have helped us a lot in data reduction
• Multi-talented team proved vital to success
o
We had programming (multi-lingual), traditional red team,
computer security, systems engineering, OS, system admin,
network engineering, etc. talent present for each test
• Highly tailored approach proved appropriate and
necessary
o
Using more traditional network-oriented Red Team Assessment
approach would have failed
The Assessments
Overall Strengths /
Weaknesses of Projects
• Strengths
o
o
Teams worked hard to support our assessments
The technologies are exciting and powerful
• Weaknesses
o
Most Suffered a Lack of System Documentation
• We understand there is a balance to strike – these are research
prototypes essentially after all
• Really limited ability to prepare for assessment
o
o
o
o
All are Prototypes -- stability needed for deterministic test results
All provide incomplete security / protection almost by definition
Most Suffered a Lack of Configuration Management / Control
Test “Harnesses” far from optimal for Red Team use
• Of course, they are oriented around supporting the development
• But, we’re fairly limited in using other tools due to uniquenesses of
the technologies
AWDRAT
Assessment
October 18-19, 2005
Success Criteria
• The target application can successfully and/or correctly
perform its mission
• The AWDRAT system can
o
o
o
detect an attacked client’s misbehavior
interrupt a misbehaving client
reconstitute a misbehaving client in such a way that the reconstituted
client is not vulnerable to the attack in question
• The AWDRAT system must
o
o
Detect / Diagnose at least 10% of attacks/root causes
Take effective corrective action on at least 5% of the successfully
identified compromises/attacks
Assessment Strategy
• Denial of Service
o
o
aimed at disabling or significantly modifying the operation of the application to an
extent that mission objectives cannot be accomplished
attacks using buffer-overflow and corrupted data injection to gain system access
• False Negative Attacks
o
o
a situation in which a system fails to report an occurrence of anomalous or
malicious behavior
Red Team hoped to perform actions that would fall "under the radar". We
targeted the modules of AWDRAT that support diagnosis and detection.
• False Positive Attacks
o
o
system reports an occurrence of malicious behavior when the activity detected
was non-malicious
Red Team sought to perform actions that would excite AWDRAT's monitors.
Specifically, we targeted the modules supporting diagnosis and detection.
• State Disruption Attacks
o
interrupt or disrupt AWDRAT's ability to maintain its internal state machines
• Recovery Attacks
o
o
disrupt attempts to recover or regenerate a misbehaving client
target the Adaptive Software and Recovery and Regeneration modules in an
attempt to allow a misbehaving client to continue operating
Strengths / Weaknesses
• Strengths
o
o
With a reconsideration of system’s scope of responsibility, we
anticipate the system would have performed far better in the tests
We see great power in the concept of wrapping all the functions
• Weaknesses
o
o
o
o
o
Scope of Responsibility / Protection far too Limited
Need to Develop Full Security Policy
Single points of failure
Application-Specific Limitations
Application Model Issues
•
•
•
•
Incomplete – by design?
Manually Created
Limited Scope
Doesn’t really enforce multi-layered defense
LRTSS
Assessment
October 25, 2005
Success Criteria
• The instrumented Freeciv server does not core dump
under a condition in which the uninstrumented Freeciv
server does core dump
• The LRTSS system can
o
o
Detect a corruption in a data structure that causes an
uninstrumented Freeciv server to exit
Repair the data corruption in such a way that the instrumented
Freeciv server can continue running
• The LRTSS system must
o
o
Detect / Diagnose at least 10% of attacks/root causes
Take effective corrective action on at least 5% of the successfully
identified compromises/attacks
Assessment Strategy
• Denial of Service
o
o
o
o
o
Aimed at disabling or significantly modifying the operation of the
Freeciv server to an extent that mission objectives cannot be
accomplished
In this case, not achieving mission objectives is defined as the
Freeciv server exits or dumps core
Attacks using buffer-overflow, corrupted data injection, and
resource utilization
Various data corruptions aimed at causing the server to exit
Formulated the attacks by targeting the uninstrumented server
first, then running the same attack against the instrumented
server
• State Disruption Attacks
o
interrupt or disrupt LRTSS's ability to maintain its internal state
machines
Strengths / Weaknesses
• Strengths
o
Performs very well under simple data corruptions
• (that would cause the system to crash under normal operation)
o
Performs well under a large number of these simple data
corruptions
• (200 to 500 corruptions are repaired successfully)
o
Learning and Repair algorithms well thought out
• Weaknesses
o
o
o
o
o
o
Scope of Responsibility / protection too limited
Complex Data Structure Corruptions not handled well
Secondary Relationships are not protected against
Pointer Data Corruptions not entirely tested
Timing of Check and Repair Cycles not optimal
Description of “Mission Failure” as core dump may be
excessive
QuickSilver
Assessment
November 8, 2005
Success Criteria
• Ricochet can successfully and/or correctly perform its
mission
o
“Ricochet must consistently achieve a fifteen-fold reduction in latency
(with benign failures) for achieving consistent values of data shared
among one hundred to ten thousand participants, where all
participants can send and receive events."
• Per client direction, elected to use average latency time
as the comparative metric
o
o
Ricochet’s Average Recovery demonstrates 15-fold improvement
over SRM
Additional constraint levied requiring 98% update saturation
(imposing the use of the NACK failover for Ricochet)
Assessment Strategy
•
Scalability Experiments -- test scalability in terms of number of groups per node
and number of nodes per group. Here, no node failures will be simulated, and no
packet losses will be induced (aside from those that occur as a by-product of
normal network traffic).
o
o
o
o
•
Simulated Node Failures – simulate benign node failures.
o
•
Baseline Latency
Group Scalability
Large Repair Packet Configuration
Large Data Packet Storage Configuration
Group Membership Overhead / Intermittent Network Failure
Simulated Packet Losses – introduce packet loss into the network.
o
High Packet Loss Rates
•
•
•
o
o
•
Node-driven Packet Loss
Network-driven Packet Loss
Ricochet-driven Packet Loss
High Ricochet Traffic Volume
Low Bandwidth Network
Simulated Network Anomalies – simulate benign routing and network errors
that might exist on a deployed network. The tests will establish whether or not
the protocol is robust in its handling of typical network anomalies, as well as
those atypical network anomalies that may be induced by an attacker.
o
o
o
o
Out of Order Packet Delivery
Packet Fragmentation
Duplicate Packets
Variable Packet Sizes
Strengths / Weaknesses
• Strengths
o
o
o
Appears to be very resilient when operating within its
assumptions
Very stable software
Significant performance gains over SRM
• Weaknesses
o
o
FEC-orientation - focus in statistics belies valuable data
regarding complete packet delivery
Batch-oriented Test Harness –
• Impossible to perform interactive attacks
• Very limited insight into blow-by-blow performance
o
Metrics collected were very difficult to understand fully
STEWARD
Assessment
December 9, 2005
Success Criteria
• The STEWARD system must:
o
Make progress in the system when under attack.
• Progress is defined as the eventual global ordering, execution,
and reply to any request which is assigned a sequence number
within the system
o
Maintain a consistency of data replicated on each of the
servers in the system
Assessment Strategy
• Validation Activities - tests we will perform to
verify that STEWARD can endure up to five
Byzantine faults while maintaining a three-fold
reduction in latency with respect to BFT
o
o
Byzantine Node Threshold
Benchmark Latency
• Progress Attacks - attacks we will launch to
prevent STEWARD from progressing to a
successful resolution of an ordered client
request
o
o
o
o
o
o
o
o
o
o
o
o
Packet Loss
Packet Delay
Packet Duplication
Packet Re-ordering
Packet Fragmentation
View Change Message Flood
Site Leader Stops Assigning Sequence Numbers
Site Leader Assigns Non-Contiguous Sequence
Numbers
Suppressed New-View Messages
Consecutive Pre-Prepare Messages in Different
Views
Out of Order Messages
Byzantine Induced Failover
• Data Integrity Attacks - attempts to create an
inconsistency in the data replicated on the various
servers in the network
o
o
o
o
o
o
Arbitrarily Execute Updates
Multiple Pre-Prepare Messages using Same Sequence
Numbers and Different Request Data
Spurious Prepare, Null Messages
Suppressed Checkpoint Messages
Prematurely Perform Garbage Collection
Invalid Threshold Signature
• Protocol State Attacks - attacks focused on
interrupting or disrupting STEWARD's ability to
maintain its internal state machines
o
o
o
Certificate Threshold Validation Attack
Replay Attack
Manual Exploit of Client or Server
Note: We did not try to
validate or break the
encryption algorithms.
Strengths / Weaknesses
• Strengths
o
o
o
o
First system that assumes and actually tolerates corrupted
components (Byzantine attack)
Blue Team spent extensive time up front in analysis, design
and proof of the protocol – it was clear in the performance
System was incredibly stable and resilient
We did not compromise the system
• Weaknesses
o
Limited Scope of Protection
• Relies on external entity to secure and manage keys which are
fundamental to the integrity of the system
• STEWARD implicitly and completely trusts the client
o
Client-side attacks were out of scope of the assessment
Going Forward
• White Team will generate definitive report on this
Red Team Test activity
o
It will have the official scoring and results
• RABA (Red Team) will generate a test report
from our perspective
o
We will publish to:
• PI for the Project
• White Team (Mr. Do)
• DARPA (Mr. Badger)
Questions or Comments
Any Questions, Comments,
or Concerns?