Cyber Genome TA3 Volume I 1.34 MiB application

HBGary Federal, LLC.
3604 Fair Oaks Blvd. Suite 250, Sacramento, CA. 95864
Phone: (916) 459-4727
Fax: (916) 481-1460
VOLUME I TECHNICAL MANAGEMENT PROPOSAL
Prepared for DARPA
CYBER GENOME PROGRAM
STRATEGIC TECHNOLOGY OFFICE
DARPA-BAA-10-36
March 21, 2010
Table of Contents
This proposal includes data that shall not be disclosed outside the Government and shall not be
duplicated, used, or disclosed-in whole or in part-for any purpose other than to evaluate this
proposal. If, however, a contract is awarded to this proposer as a result of, or in connection with,
the submission of this data, the Government shall have the right to duplicate, use, or disclose the
data to the extent provided in the resulting contract. This restriction does not limit the
Government's right to use information contained in this data if it is obtained from another source
without restriction. The data subject to this restriction are contained in marked sheets.
The table of contents is empty because none of the paragraph styles selected in the Document
Inspector are used in the document.
Section I. Administrative
A. Proposal Cover Sheet
1
Broad Agency
Announcement
DARPA-BAA-10-36
Cyber Genome Program
2
Prime Organization
HBGary Federal, LLC.
3
Proposal Title
Cyber Genome Program
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
4
5
□ Large Business
□ Small Disadvantaged
Type of Business (Check
Business
one)
□ Other Small Business
□ Government Laboratory
or FFRDC
Contractor’s Reference
Number
(if applicable)
Contractor and
Government Entity
(CAGE) Code
Dun and Bradstreet (DUN)
7
Number
6
8
9
□ Historically-Black Colleges
□ Minority Institution (MI)
□ Other Educational
□ Other Nonprofit
(if applicable)
(if applicable)
North American Industrial
NOTE: This was formerly the Standard Industrial Classification
Classification System
(SIC) Number
(NAICS) Number
Taxpayer Identification
Number (TIN)
Include: salutation, last name, first name, street address, city,
10 Technical Point of Contact state, zip code, telephone, fax (if available), electronic mail (if
available)
Include: salutation, last name, first name, street address, city,
Administrative Point of
11
state, zip code, telephone, fax (if available), electronic mail (if
Contact
available)
Include: salutation, last name, first name, street address, city,
12 Security Point of Contact state, zip code, secure telephone, secure fax, sterile mailing
address
13
Technical POC salutation, last name,
first name, street address, city, state,
Other Team Members (if Name, type of business
zip code, telephone, fax (if available),
applicable)
(see 4 above)
electronic mail (if available), CAGE
Code
Base Effort:
(Phase 1)
14
Funds Requested From
DARPA
Option Effort:
(Phase 2)
Total Proposed Cost
(Including Options)
Base Effort Cost
Base Options Cost: (list all)
Option Effort Cost
Phase II Options Cost: (list all)
Total
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
Amount of Cost Share
15
Award Instrument
Requested
16
Proposers Cognizant
Government
Administration Office
17
□cost-plus-fixed-fee
□cost-contract-no-fee
□cost sharing contract-no fee
□other procurement
contract:______________
Amount of cost share (if any)
□grant
□agreement
□other award instrument:
__________
Name, mailing address, telephone number and Point of Contact
of the Proposers cognizant government administration office
(i.e., Defense Contract Management Agency (DCMA))
Proposer’s Cognizant
Defense Contract Audit Name, mailing address, telephone number, and Point of Contact
Agency (DCAA) audit
if known
Office
18
Other
Any Forward Pricing Rate Agreement, other such Approved
Rate Information, or such other documentation that may assist in
expediting negotiations (if available)
19
Date Proposal Prepared
Date
20 Proposal Expiration Date
21
22
Date cost proposal valid through date
Place(s) and Period(s) of Location where the proposed work will be performed and dates
of proposed performance
Performance
Technical Area
(check one)
□ Technical Area 1 - Cyber Genetics
□ Technical Area 2 - Cyber Anthropology and Sociology
□ Technical Area 3 - Cyber Physiology
□ Technical Area 4 - Other
B. Official transmittal letter.
HBGary Federal, LLC.
3604 Fair Oaks Blvd. Suite 250, Sacramento, CA. 95864
Phone: (916) 459-4727
Fax: (916) 481-1460
March 21, 2010
Attn: XXX
DARPA
Subject: DARPA Cyber Genome Program
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
HBGary Federal is pleased to present this proposal to DARPA in response to DARPA BAA-1036 Cyber Genome Program. This proposal assumes a XXX type contract and is valid through
July 30, 2010.
Intellectual Property Claims / Technical Data Rights
HBGary understands and appreciates DARPA’s needs for rights in data and provides the
following good faith representation that we own and possess appropriate licensing rights to all
intellectual property that will be utilized under this proposal. While this contract, if awarded to
HBGary would constitute a Phase III contract, HBGary has also structured this proposal to
provide DARPA extensive rights by waiving some of its Phase III SBIR rights. This would
constitute a Phase III award, if awarded to HBGary because the award would “derive from,
extend, or logically conclude HBGary’s prior SBIR effort and would be funded with non-SBIR
funds.” See SBA SBIR Policy Directive of September 24, 2002 at Sec. 4(c)(2). Phase IIIs can
also be subcontracts and subcontracts can be Phase III awards. Id. at 4(c)(5). Thus, HBGary has
a right to exert SBIR data rights to all of the data generated under the award, if given to
HBGary. However, because HBGary understands DARPA’s needs for flexibility with the data
generated under this award. HBGary is providing the data generated under this contract to the
Government with Unlimited Rights. This waiver of HBGary’s SBIR rights is a clear
demonstration of HBGary’s commitment to the mission and goals of DARPA.
HBGary notes that it has developed two patented technologies that it brings to the table for
possible use to fulfill this requirement -- Digital DNA Sequence and Fuzzy Hash Algorithm.
These two technologies, described elsewhere in this proposal, are patented and HBGary
developed both at private expense. HBGary proposes these technologies for possible use to
fulfill this requirement, although it is possible these technologies may have no role in developing
the methodology that DARPA seeks.
At very least, HBGary will leverage the experience gained in developing these two technologies.
If and to the extent that these two technologies become deliverables in the resulting contract,
HBGary will deliver them with Restricted Rights. (See table below). To the extent that any
modifications to these two existing, proprietary technologies need to be made, HBGary will
perform such modifications under pre-existing administrative codes billed to HBGary indirect
accounts, and not charged under the contract.
Assertion of Technical Data Rights in accordance with DFARS 252.227-7018
Technical Data Computer
Software To be Furnished
With Restrictions
Basis for
Assertion
Asserted Rights
Category
Name of Person Asserting
Restrictions
Digital DNA Sequence
Developed at
Bob Slapnik, Vice President
Restricted Rights
Private Expense
HBGary, Inc.
Fuzzy Hash Algorithm
Developed at
Bob Slapnik, Vice President
Restricted Rights
Private Expense
HBGary, Inc.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
5
HBGary Digital DNA™
commercial software (1)
Developed at
Bob Slapnik, Vice President
Restricted Rights
Private Expense
HBGary, Inc.
HBGary Responder™
Professional commercial
software (1)
Developed at
Private Expense
Bob Slapnik, Vice President
Restricted Rights
and SBIR, nonHBGary, Inc.
severable
HBGary Responder™
Field Edition commercial
software (1)
Developed at
Private Expense
Bob Slapnik, Vice President
Restricted Rights
and SBIR, nonHBGary, Inc.
severable
(1) Data involved in and related to commercial software products listed above will not be
delivered nor do they need to be delivered to fulfill the requirements of this BAA contract, if
awarded, but will be discussed in the proposal.
HBGary Technologies Summary Descriptions
HBGary Federal owns the Digital DNA Sequence and Fuzzy Hash Algorithm inventions for
which patent application(s) have been filed and are pending. The inventions, developed at the
private expense of HBGary Inc., will be referenced and utilized in this proposal for the DARPA
Cyber Genome Program. The following Patent applications have not yet been made publicly
available, contain proprietary information, and may not be disclosed outside of the federal
government without the permission of HBGary.
Digital DNA Sequence
·
·
Patent application number: 12/386,970
Inventor name(s): Michael Gregory Hoglund
·
·
·
·
Assignee names: HBGary, Inc.
Filing date: April 24, 2009
Filing date of any related provisional application: not applicable
Summary of the patent title: Digital DNA Sequence
HBGary's ownership of the invention is indicated in Reel/Frame 023009/0815 in the
Assignment Division of the US Patent and Trademark Office.
Fuzzy Hash Algorithm
·
·
·
·
·
·
Patent application number: 12/459,203
Inventor name(s): Michael Gregory Hoglund
Assignee names: HBGary, Inc.
Filing date: June 26, 2009
Filing date of any related provisional application: not applicable
Summary of the patent title: Fuzzy Hash Algorithm
HBGary's ownership of the invention is indicated in Reel/Frame 023441/0496 in the Assignment
Division of the US Patent and Trademark Office.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
6
HBGary’s proposal will discuss these inventions. However, the inventions, methodologies, traits
and other concepts that these inventions represent will not be delivered under the proposal and
will not be deliverables.
Organizational Conflict of Interest
HBGary Federal, LLC. does not provide scientific, engineering and technical assistance (SETA)
or similar support to any DARPA technical office(s) through active contracts or subcontracts.
We therefore do not have any organizational conflicts of interest that require affirmation.
Sincerely yours,
Aaron D. Barr
CEO
HBGary Federal, LLC.
Section II. Summary of Proposal
A. Innovative Claims for the Proposed Research.
This section is the centerpiece of the proposal and should succinctly describe the uniqueness and
benefits of the proposed approach relative to the current state-of-art alternate approaches.
We will focus using binary analysis to define triggers, code branches, and other external
characteristics that can aid in the more comprehensive methods of direct memory and dynamic
analysis of functions and behaviors of malware. Our methodology of static analysis of memory
images and low-level runtime analysis of binaries provides the most comprehensive malware
analysis approach, circumventing many of the challenges of static analysis which could employ
de-obfuscation and encryption techniques to hide malicious code.
Our core research will focus in Four areas:
1. Traits Library - Malware traits or genes that identify and describe mathematically discrete
behaviors and functions within suspected malware objects. In order to understand
malware you need to first have a methodology for describing malware at the atomic level
which is expressed as individual functions or actions taken by the malware.
2. Genomes Library - Malware genome types to identify and classify patterns of traits
within suspected malware objects. It is specific patterns, or genome sequences of
malware that express themselves during runtime that illustrate higher level functions,
behaviors, purpose, and severity of malware.
3. Static Memory Analysis and Runtime Tracing - Iterative Static and Dynamic automated
analysis for intelligently exercising execution paths during runtime analysis and
recording of all the low level data. In order to fully understand a specimens behaviors
and functions you need as full of a data set as possible.
4. Reasoning Engine Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
7
We have assembled the most capable team and have selected the right approach within an
operational framework. Each company is well documented leads in their area of expertise and
round out all the capabilities you would want on such an effort. We have research company with
research and IP in static binary analysis, we have two small product companies with unique
capabilities in malware memory and behavior analysis on windows and linux platforms, we have
a leading visualization company with a distinguished track record in research and development
for the DoD, and we have an large defense contractor with critical programs in incident response,
net defense, and cyber investigations. If any team can tackle the challenges of this project AND
produce working capabilities, it is this team.
Innovative Claim
State of the Art
Benefit
Hybrid passive and active malware
harvesting instrumented through
feedback of malware analysis
results
All malware is just a collection of instructions and data. Instructions must flow to the processor
in order to execute and the data must be readily available if software is to run quickly. Memory
is the bridge all software takes to reach the processor, thus it is ideal for analyzing malware.
Additionally, malware often exists in an obfuscated and/or encrypted form on disk, however
when it moves to memory it is most often deobfuscated and/or decrypted. This makes memory
the prime location for capturing malware for static analysis and monitoring the malware during
dynamic analysis. Finally, some malware only resides in memory, so traditional static binary
analysis would not be effective.
B. In many cases the capabilities this team has in windows and linux malware collection and
analysis is the current state of the art. There are many technical challenges in automation,
full-code execution, complete malware behavioral and functional analysis, but as this
proposal will demonstrate, the HBGary Team is confidant our innovative approach,
technically talented engineers, and sound management approach will achieve success in these
areas.Deliverables.
associated with the proposed research and the plans and capability to accomplish technology
transition and commercialization. Include in this section all proprietary claims to the results,
prototypes, intellectual property, or systems supporting and/or necessary for the use of the
research, results, and/or prototype. If there are not proprietary claims, this should be stated.
Table #, describes the products that will be delivered in execution of the Cyber Genome Project.
DARPA will have unrestricted rights to all developed capabilities and deliverables under this
effort.
HBGary and Pikewerks are leaders in memory and behavior based malware analysis products, as
such, they are in highly capable positions to transition technologies developed under this project
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
8
to their existing product lines, or if not commercially viable but deemed essential by the
government to operations, develop stable and mature products under a separate contract
specifically for the government.
#
Deliverable
Type
Phase
Delivered
1
Linux and Windows-based malware feeds
Data
1a
2
Linux and Windows-based malware harvesters
Prototype
2a
3
Binary re-writing tool and corresponding de-obfuscation rewrite rules
Prototype
2b
Algorithms
2a
4
Table #: Cyber Genome Project Deliverables
Table # describes existing technologies and capabilities developed by their respective companies
that will be used in the coarse of project execution to expedite the research and development of
cyber genome capabilities for DARPA. All technologies and capabilities listed in table # are the
sole intellectual property of their respective companies and said companies assert restricted
rights over these technologies.
Name
Digital DNA
Company
Description
HBGary
Restricted Rights
Responder
Table #: Existing Intellectual Property Table
HBGary notes that sec. 7 of the BAA states that: “A more favorable evaluation will be given to
those proposals that do not contain any limitations on the software and technical data, and
associated license rights, respectively.” If HBGary wins an award under this competition, such
an award will constitute an SBIR Phase III award. That is because the requirement stated in the
BAA “derives from, extends, or logically concludes prior [HBGary] research and will be funded
with non-SBIR funds.” See SBA SBIR Policy Directive, September 24, 2002 at sec. 4. Phase III
award can be competitively awarded. Id. at Sec. 4(c)(2). The Government cannot by this
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
9
Solicitation diminish HBGary’s rights. Id. at sec. 8(b)(4). That said, HBGary recognizes and
respects DARPA’s needs for flexibility with data generated under the contract. Therefore,
HBGary proposes to provide DARPA with all of the data rights it requires to accomplish its
mission under the award. HBGary proposes to provide the Government with Specially
Negotiated Data Rights in data generated under the award, in accordance with DFARS 252.2277018(b)(5). HBGary will modify its SBIR rights to such data provide for both a license and
agreements for necessary use by and disclosure to entities of the cyber security community that
DARPA will designate. HBGary will not refuse to provide use by or disclosure to any entity
DARPA will designate. HBGary and DARPA will negotiate a mutually agreeable license
governing such use and disclosure.
Need to insert language for data rights related to Digital DNA.
5.1.5 Plans and Capability to Accomplish Technology Transition
HBGary has commercial software products in the cyber security space focused on Windows
malware detection and malware analysis. Many of the underlying technologies were researched
and prototyped during various Government SBIR contracts. HBGary has successful transitioned
the technologies as evidenced by hundreds of active customers throughout the Government and
private sectors.
Technology transition and commercialization are not easy. Software won’t transition very far if
it is not of commercial grade. Our experience dictates that it costs considerably more money and
effort to develop commercial grade software than the R&D prototypes. HBGary’s extensive
IRAD investments have allowed our exceptional development team to create software that realworld users want. Quality software that meets customer needs is not enough. HBGary has
senior marketing and sales personnel with proven track records to take new products to market.
Elements of effective marketing are messaging that resonates with paying customers, sales
collateral tools, full feature website, trade show presence, conference speaking, press releases,
press interviews, and strategic alliances. We have a field sales team consisting of senior sales
veterans and field cyber security experts. After the sale customers are offered training classes
and receive ongoing software maintenance and onsite and remote tech support. Furthermore,
HBGary has strategy alliances with larger companies such as McAfee, Guidance Software, and
Verdasys.
C. If awarded the contract, we anticipate that promising technologies will emerge from our
research. We will use the same commitment and methodologies described above to
transition those technologies into high quality commercial grade software to serve the cyber
security marketplace where we are active.Cost, schedule and measurable milestones for the
proposed research, including estimates of cost for each task in each year of the effort
delineated by the prime and major subcontractors, total cost and company cost share, if
applicable. Note: Measurable milestones should capture key development points in tasks and
should be clearly articulated and defined in time relative to start of effort. These milestones
should enable and support a decision for the next part of the effort. Additional interim noncritical management milestones are also highly encouraged at a regular interval.
D. Technical rationale, Technical approach, and Constructive plan.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
for accomplishment of technical goals in support of innovative claims and deliverable
production. (In the proposal, this section should be supplemented by a more detailed plan in
Section III.)
While the focus of this effort is to conduct revolutionary research in key areas related to
automated malware analysis for behavior, function, and intent, we believe it is important to
structure this research within an operational framework that is maintainable and functional over
time as the science of malware analysis matures. Our approach provides for continual
improvement and illustrates how the individual research areas fit within this operational
framework.
There are many valid methods for conducting malware analysis and deriving malware
functionality, behaviors, and purpose. However some are likely to yield more accurate and
efficient results than others. The HBGary team has extensive experience and expertise in
malware analysis, including detection of malware based on exhibited functionality and
behaviors, and delivering market leading capability in this area.
Static file analysis probably still represents the largest portion of static malware analysis
conducted in organizations today, but this technique is wrought with growing problems as
malware finds increasingly complex ways to protect itself from static file analysis techniques,
and the mechanisms used to defeat file-based malware security greatly lengthens the time for
complete analysis. Likewise traditional runtime analysis is typically done using debuggers
which requires a person to manually step through the execution of a running program. Using
static file or traditional runtime analysis using debuggers we do not believe is scalable for
automatic analysis of malware. No matter the security techniques used on the filesystem, to be
effective, the malware has to run and when it runs it does so in the clear. Based on our extensive
experience we believe the best approach for analysis to satisfy the requirements of the cyber
genome project is a combination of static memory analysis and runtime tracing. Static memory
analysis is a methodology where you take a snapshot of physical memory and analyze the static
contents of that snapshot, included in the snapshot is all the representations of files in memory at
the time the image is taken. These code representations can then be reconstructed to provide a
full representation of an executed file, un-obfuscated, as it exists in memory. Our approach to
runtime tracing hooks a running binary and collects all the data as the program executes for a set
period of time, with no interruptions or manual input required.
Nearly all malware is small enough that the entire execution space is captured in a single
memory snapshot. But there are some limitations to static memory analysis, such as ..., which is
why we will also monitor the execution of the software within a sandbox. The resulting
combination of the two analysis techniques will give us a nearly complete picture of the linear
execution of any piece of software. As mentioned this only covers the linear execution space of
the analyzed software, which in most cases is all that is needed, for completeness, we will
conduct research in expanding the execution paths of running programs that require specific IO
input or environment conditions for specific branch executions.
Fig. 1, illustrates our malware analysis framework, which emphasizes a fully automated malware
analysis process enhanced by manual analysis to develop new traits and sequences for flagged
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
unknown functions and behaviors. The end goal is to develop and mature an automated malware
analysis system that can recognize new traits or trait patterns automatically and classify and
categorize them, or if it can not flag for manual analysis, and update the trait and genome
libraries. Using an iterative static memory and runtime analysis approach to smartly execute as
much of the code as possible for analysis. The Specimen repository will be continually updated
with specimen data through the analysis process, to include a full malware physiology profile.
The physiology profile will contain mathematical and visual representations of the malware as
well as a human readable summary of the malware's overall and more detailed behaviors,
functions, and purpose.
Fig. 1: Cyber Physiology Analysis Framework
Cyber Physiology Analysis Framework:
1.
2.
3.
Specimen Feeds/Harvester/Samples. Subscriptions to Malware feeds to continually
populate our specimen repository and exercise our automated malware analysis
framework. HBGary currently subscribes to multiple feeds and processes around 20,000
new malware samples per day using our automated malware analysis and detection
technology. In addition we propose to research methods for identifying and collecting
emergent malware specimens using a combination techniques (i.e. intelligence harvesters
and honeynets).
Pre-processor. For this phase we will research automated and comprehensive methods
for static binary preparation, external analysis, and instrumentation, including;
unpacking, de-obfuscating, reconstructing, removing anti-analysis mechanisms, and
discovering environmental triggers. The goal of this phase is to normalize and prepare
malware specimens for automated memory and runtime analysis. Once operationalized
other inputs could be included into the pre-processor such as open source and intelligence
information.
Specimens Repository. The central repository for specimen raw files as well as analytical
information collected during pre-processing and the analysis process, including all of the
memory data related to the specimen, low level data collected during runtime tracing, and
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
4.
5.
6.
7.
8.
the final physiology profile. The goal of this phase is to create a single malware
repository that contains enough data, organized to improve malware analysis and incident
response capabilities. HBGary brings an existing malware repository, approximately
500GB of unique malware samples to start the effort. In this area we will conduct
research for data format normalization and standardization for malware analysis results.
Information maintained will include; specimen raw files, hard artifacts, associated traits
and genomes, all low level data recorded through static and runtime analysis, and a full
malware physiology profile.
Specimen Analysis & Visualization Interface (SAVI). Methodology for streamlined
analysis to assist in identifying new traits and genomes as well as present malware
physiology profiles once a specimen has been processed. Research will focus on visual
representations of malware data to aid in analysis and understanding of malware's
functions and behaviors and purpose. When there are function and behavior traits or
genome sequences that are not fully understood by the automated system, those are
flagged in the malware physiology profile stored in the specimen repository and
scheduled for manual analysis. Both HBGary and Pikewerks have industry leading
experience and products analyzing windows and linux-based malware. SecureDecisions
is a market leader in developing advanced visualization capabilities for a variety of
datasets.
Traits (Gene) Library. Developed trait rules that represent discrete functions and
behaviors of software. We propose the best methodology for understanding the
aggregate functions, behaviors, and purpose of malware is to first identify and understand
the discrete expressed parts of malware at their lowest level and build up, qualifying them
in a way that can be classified and mathematically calculated. The goal of the research in
this phase will be to develop increasingly complex traits for malware functions and
behavior identification.
Both HBGary and Pikewerks have existing methodologies for detecting malware based
on behavioral charateristics/traits. This unique knowledge will significantly aid in our
research and development of traits for more complex behavioral and functional
understanding.
Genomes Library. Much like biological gene/trait sequences. To understand how a
biological system works, or how genes are expressed within an aggregated system you
need to understand the importances of sequences, ordering, and clustering of traits. Our
research here will focus on identifying trait patterns that express an aggregated
functionality or behavior. These are the algorithms and patterns used to develop the
visual and mathematical graphs that examine the malwares overall function, purpose,
severity. Develop behavior and function correlation engines and visual representations
based on exhibited traits, external and environmental artifacts, space and temporal artifact
relationships, sequencing, etc. (fuzzy hashing, etc.)
Static Memory Analysis and Runtime Tracing Engine (SMART) - Uses a combination of
static memory analysis and runtime tracing techniques to record as much of the malware
internals as possible, including exercising as much of the full execution tree as possible.
Our research will focus on full branch execution as well as automated analysis and
tracing. HBGary and Pikwerks have existing semi-automated technologies that we can
leverage for the research and development in this task.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
9.
Bayesian Reasoning Analysis and Inference Node (BRAIN). As our understanding of
malware internals and the relationships of traits and genomes matures we should be able
to instrument something like a bayesian reasoning engine to automatically identify
mutations within the genomes and classify those mutations to some degree without any
manual analysis. Our research will focus on building the malware behavior and function
inference models to do the automated analysis of malware.
E. Cyber Physiology Profile. These profiles are the output of the analysis processes and contain
both mathmatical and human readable representations (fingerprints) of a specimens aggregate
functions, behaviors, and intent. These profiles are stored in the repository for future quick
retrieval.Organizational Chart
Company Name
Position Unique
Capabilties
Tasks
HBGary Federal
Prime
Malware Analysis Program
Services
Management,
Specimen
Harvesting,
Specimen
Repository,
Traits Library,
Genome Library,
Reasoning
Engine
Advanced
Aaron Barr
Malware analysis (PM)
services and
Mark Trynor
information
operations
expertise
HBGary
Sub
Windows
Malware behavior
analysis, memory
forensics, runtime
tracing, exploit
development
Highly skilled,
product based,
Windows
Malware analysis
engineers with
existing behavior
analysis
capabilities and
products.
Existing memory
General
Dynamics
Sub
Malware Analysis Support
Operations
Specimen
Analysis and
Visualization
Interface,
Support to Traits
Library
Static Memory
Analysis and
Runtime Tracing,
Support to Traits
Library
Support to
Genomes
Library,
Support to
Reasoning
Engine
Teaming
Strategy
Key Personnel
% of
work
Greg Hoglund
(PI)
Martin Pillion
Shawn Bracken
Key contracts
Jason Upchurch
related to DoD
Operational
Malware analysis
capabilities,
bring use case
and operational
expertise as well
as strong
malware analysis
shop
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
Pikewerks
Sub
Linux Malware
Analysis
Specimen
Harvesting,
Traits Library,
Genomes Library
High skilled,
Adam Fraser
product based,
Tom O'Connor
Linux Malware
Analysis
engineers with
existing behavior
analysis
capabilities and
products.
SRI
Sub
Static Binary
Analysis
Pre-Processing
Research
Organization
with significant
research in static
binary analysis,
trigger analysis,
de-obfuscation
AVI/Secure
Decisions
Sub
Advanced
Visualization
Specimen
Analysis and
Visualization
Interface
Advanced
Kenneth Prole
visualization
techniques for
malware analysis
Phil Porras
F. Four-slide Summary
PowerPoint that quickly and succinctly indicates the concept overview, key innovations,
expected impact, and other unique aspects of the proposal. The format for the summary slides is
included as APPENDIX 1 to this BAA and does not count against the page limit. Include the
original PowerPoint slides on the program CD-ROMs provided to DARPA.
Row 1
Row 2
row 3
Row 4
Sample Table
Data
Data
Data
Data
Section III. Detailed Proposal Information
A. Statement of Work (SOW)
In plain English, clearly define the technical tasks/subtasks to be performed, their durations, and
dependencies among them. For each task/subtask, provide:
 A general description of the objective (for each defined task/activity);
 A detailed description of the approach to be taken to accomplish each defined
task/activity);
 Identification of the primary organization responsible for task execution (prime, sub, team
member, by name, etc.);
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
 The completion criteria for each task/activity - a product, event or milestone that defines
its completion.
 Define all deliverables (reporting, data, reports, software, etc.) to be provided to the
Government in support of the proposed research tasks/activities.
B. Description of the results, products, transferable technology, and expected technology
transfer path enhancing that of Section II. B.
WBS
Task/Subtask
1.0 Emulation
System
Duration
Dependancies
Description/O ResponsibleOr Completion
bjective
g
Criteria
1.1,1.2
The emulated Prime
process
environment is
designed to
execute the
target under
test.
1.1
Javascript SS
Javascript
interpreter
subsystems to
execute/parse
any language
that is
supported
Prime
1.2 x86
Machine Code
SS
X86
interpreter
subsystems to
execute/parse
any language
that is
supported
Prime
I/O emulation
will be a
subset of the
API surface.
I/O is
important
because the
emulation
environment
will not know
how to
respond to a
data query
made to an
external
element.
Prime
2.0 Input
Expression
Solver
1.0,1.1,1.2
Deliverables
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
3.0 Data State
Progression
Map
2.0
This tree of
data states
represent
important
points along
the control
flow of the
target under
test
Prime
4.0 Recording
System
3.0
The definition Prime
of what
behaviors are
noteworthy
can be defined
in a symbolic
language that
is used and
evaluated
while the data
state tree is
recorded
5.0 Reporting
System
4.0
report would Prime
contain data
about what
network
packets had to
arrive, what
commands
were issued,
which values
had to exist in
memory, etc
IV&V
C. Detailed Technical Rationale
Malware analysis is difficult. Why? It requires large amounts of very specialized knowledge
and a diverse skill set. You need knowledge of programming, assembly language, binary
packing, compiler optimization, debugging, anti-debugging tricks, internet protocols, network
communication, OS APIs, OS design and layout, OS security mechanisms, Internet security
mechanisms, cryptography, and potentially a thousand little techniques that could be used to
obtain sensitive information from a compromised machine (and I’m sure there are other
useful/applicable skills).
here are multiple approaches to reliably extracting all the executable content from a given binary:
1) Statically analyze the binary. This may require unpacking and de-obfuscating
protected code.
2) Execute the binary in a controlled, recorded environment. This may require a special
system for avoiding detection by the binary (anti-debugging tricks).
3) Combined approach using both 1 and 2
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
Option 1 is the traditional method. It relies upon tools like IDA Pro and a strong library
of tools to unpack/de-obfuscate code. One of the largest negatives for this method is that
code packers/obfuscators are usually a step ahead of the unpackers/de-obfuscators.
Another negative is that self-modifying code can be very difficult to analyze. A final
negative is that full analysis may be a non-tractable problem, either requiring too much
processing power or too much memory/space to solve in a reasonable time frame.
Option 2 is a dynamic method, but is limited to recording behaviors that a binary exhibits
in a small window of time. A large negative is that many potential behaviors are never
called in a binary until specifically requested by an attacker. A positive is that we don’t
have to worry about packers and obfuscation, but we do have to prevent the binary from
detecting that it is in a controlled environment.
Option 3 is our desired approach, mixing the information gained from static analysis with
a run-time execution system. This combined approach works like this:
Obtain a memory snapshot of the binary and all its loaded components (or unpack the
binary with a tool). Use static analysis to determine likely code sections and locations of
interest (we should be able to pattern or trait match these areas). Use dataflow and
codeflow emulation to track the expected behavior of the binary. This can provide us
with ideal data and locations for use in our actual tracing/execution system.
Our approach will be to use dynamic analysis of malware in memory as the primary means for
this effort. As part of this effort we will perform static analysis where
A malware object is any piece of executable code, embedded in to any kind of structure (file,
packet, etc) that could be used for malicious purposes. (i.e. Windows executables, Microsoft
Office Files, Adobe pdf files, Linux executables, scripts, shell code, etc.)
We will leverage our existing technologies in memory behavior based analysis of malware. We
propose to introduce revolutionary new features which will vastly improve the capabilities to
reverse engineer malware to identify their behaviors and intent.
D.
Detailed Technical Approach
The HBGary Federal team is a diverse, highly capable, and proven team in different areas of
cyber genome research and development. We have selected an overall approach that allows the
individual team members to work in their areas of speciality within a connected operational
framework so in the end we have a working, integrated system as opposed to a grouping of
individual research projects. When looking at the challenges around automated malware
analysis there are a few questions that automatically rise to the top. What malware analysis
techniques are better for scale and automation? What methods are best suited to characterize
functions and behaviors of malware? How can I differentiate good from bad functions and
behaviors? What are the best methods to automate malware analysis and how can I ensure the
results are accurate and consistent?
We believe the best approach is to start by researching the detailed mechanisms of software and
develop a language and ruleset that accurately qualifies discrete software functions and
behaviors, followed by an aggregate analysis of discrete functions to discern patterns; sequences
and clusters of these traits that connote a higher order of software functionality and behaviors.
Part of our research will also focus on best methods to exercise software in an analysis
environment to expand our visibility into variable dependent branches in code. Then tying this
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
research together through a reasoning engine that can make automatic probability decisions on
the behavior and functionality of malware based on historical inference models. The final goal
will be to submit an unknown malware specimen with previously undocumented functions and
behaviors and automatically generate a cyber physiology profile that characterizes the new traits
and discerns and describes the overall function, behavior, and intent of the malware.
E.1 Specimen Feeds/Harvesters/Samples
Team Lead: HBGary Federal and Pikewerks
Collection methods need to be addressed to ensure we are developing capabilities using the most
recent and challenging malware specimens available. There are feeds for malware to which we
have existing subscriptions and will research to ensure we have the most relevant data available.
In addition we will conduct research and develop malware harvesters and honeynets to collect
malware in the wild not contained in feeds. The challenge here is in finding or attracting
malware that has propogated under the radar enough so as not to have been detected and
collected by one of the feed providers. The team is highly qualified in this area with existing
malware behavior and functionality detection capabilities that can be leveraged to target
previously unidentified samples in the wild.
Variations of honeypots have been in existence for many years on both windows and linux
platforms. Where our research differs is in an integrated approach between collection and
analysis that trains our sensors how to behave in order to maximize new collections. We propose
to research and develop a passive and active collection capability for linux and windows based
malware using virtualized clients and webhosts configured with variations of operating systems,
patches, and services. The passive systems will emulate persistent, commercial web services,
while the active systems will emulate client systems that will browse websites, conduct p2p file
transfers, open email attachments, and perform numerous other high-risk activities. The
personas of the passive and active systems will receive periodic updates through scripts that pull
from the malware repository ensuring maximum exposure to new collections.
Innovative Claims: Hybrid Passive and Active malware harvesting capability instrumented
through feedback from the malware analysis results.
Deliverables: Linux and Windows based malware feeds and smart harvesters
E.2 Pre-Processing
Team Lead: SRI
Increasingly malware employs sophisticated anti-detection and analysis techniques such as;
obfuscation, packing, encryption, and modularization. While conducting malware analysis on
running programs alleviates some of the complexity since in order to run binaries typically need
to be complete, unpacked, and unencrypted, their are exceptions and there are techniques used by
malware authors to try and protect their binaries from analysis. The goal of the research in this
phase will be to investigate methods used to protect malware from detection and analysis and
develop capabilities that allow automated analysis to continue. The HBGary Federal team has
extensive experience in this area, specifically with SRI's Eureka unpacking technology which
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
1
can automatically recover unpacked executable images from packed binaries. Eureka implements
a coarse-grained execution tracking strategy that allows for efficient monitoring of malware
execution progress. A memory snapshot is triggered by its hypothesis testing algorithm when
several criteria are satisfied. These criteria includes the number of system calls, process
execution time, a bigram count indicating a sharp increase of the code to data ratio, or specific
system calls such as process fork or terminate process.
We will research and develop binary evaluation metrics for the purpose of assessing the quality
of the unpacked code and rerunning the Eureka unpacker if necessary to obtain more complete
unpacked code. In addition to integrating SRI's speculative API resolution algorithm to
automatically resolve call sites. The post unpacking analysis capability will be delivered as an
add on to the Eureka framework to enable further analysis and classification of malware.
We also plan on developing additional criteria that determine the optimal moment for taking a
memory snapshot of the running process and recovering the original entry point. We will also
investigate novel ways of hiding Eureka from being detected by the running binary to avoid
triggering suicide logic. We will also explore snapshot-stitching techniques for dealing with
multi-stage packers and block encryption.
Innovative Claims: Application of hypothesis testing and bigram analysis; speculative api
resolution; snapshot stitching
As the origin entry point of the malware binary is usually not known at the point of unpacking,
we will employ novel approaches to determine the OEP in the captured memory image of the
process. We will then automatically rewrite the binary's header to set the OEP and rebuild import
tables. We will also research automated techniques for informed reconstruction of malware
binaries to enable execution and bypass suicide logic. We will use the output from static analysis
of malware samples to enable guided executions of unpacked binaries. An important first step
toward this end is transforming automatically unpacked binaries to running executables for
example by fixing the origin entry point, reconstructing import tables and removing suicide
checks. We will employ novel approaches to determine the OEP in the captured memory image
of the process and automatically rewrite the binary's header to set the OEP and rebuild import
tables. We will also develop static analysis and instrumentation techniques to identify and bypass
unnecessary suicide logic. We will also modify the OEP to point to code segments of interests to
enable exercising specific isolated code logics that have been identified by static analysis.
Innovative Claims: Statically Informed Binary Reconstruction; Automated OEP identification;
Automated suicide logic identification and bypass. Automated code logic isolation and
instrumentation.
Lastly, we will research and develop automated ways to recognize obfuscated code and identify
the obfuscation steps employed to hinder automated analysis, then systematically de-obfuscate to
restore the binary to an equivalent but unobfuscated form. This will be done by using binary
rewriting techniques. To validate the binary rewrite step, we will use decompilation tools to
recover a high-level C and C++ source code of the binary code. By assessing the quality of the
source code, we can assess the quality of our deobfuscation steps and can improve it accordingly.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
Innovative Claims: Automated malware deobfuscation; Malware normalization; Evaluation
metrics
Deliverables: Binary rewriting tool and corresponding de-obfuscation rewrite rules.
E.3 Specimen Repository
Team Lead: HBGary Federal
This needs to talk about the normalization of the data and all the data we will collect from the
collection sensors, pre-processor, traits and genomes, memory analysis, dynamic analysis,
graphical and mathematical models for physiology, etc.
Innovative Claims: A single repository for aggregated information on malware samples.
Deliverables: Database schema and standards for malware analysis information.
E.4 Specimen Analysis and Visualization Interface (SAVI)
Team Lead: Secure Decisions and General Dynamics
Even in an automated malware analysis system there needs to be a human interface to aid in
training the system, verifying data, and viewing results. Today most malware analysis is still a
slow and tedious process that requires highly trained and frequently unavailable reverse
engineers and malware analysts to do the work. Even tools such as those developed by the
HBGary Federal team which expedite the reverse engineering process and display information in
far more digestible forms, stop short of displaying more simplified visual representations of
malware that show at a glance the characteristics of a malware specimen.
We propose to develop a Specimen Analysis and Visualization Interface (SAVI), where we will
investigate various representations of malware that can provide information at a glance to the
analysts, and allow the analyst to visualize malware in different ways from an aggregate view
drilling down to a more detailed view while continuing to provide context of the details within
the specimen. At its end you can imagine this technology being integrated into a multi-touch
form factor where you can interact organically with the visual models. The displays will be
interactive in the sense that the analyst will be able to flag code segments, functions within the
graphical view and pull up a more traditional analyst view for further inspection. If for example,
the analyst within a code view decided to flag a particular trait or sequence differently than had
been automatically defined, the analyst could then flow back up into the graphical model and
immediately discern how the changes effected the overall specimen.
As automated methods mature, visualization can be used to both present the results of automated
tools and serve as a feedback loop back into the automated tools, thus fine tuning the algorithms
based on the human’s visual ability to find new traits and patterns of significance. Allowing the
human analyst to add input into the automated Cyber Genome system can improve both the
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
system’s ability to discover unknowns and the analyst’s ability to understand the results and
mitigate zero-day attacks.
Malware analysis based on multiple dimensions, such as disk, memory and network data, can
lead to copious amounts of data that needs to be presented to the operator.
Secure Decisions proposes to visually represent this copious data using multiple coordinated
views, starting out with a high level overview, and then providing details-on-demand[1]. This
will be the focus of the Phase I visualizations. In addition, Secure Decisions’ approach will
provide the user with an interface that guides the analyst’s approach to analysis, for example
suggesting where the user should start their analysis and what paths they can take next.
This guided approach, accompanied by interactions with the data, will be the focus of Phase II
visualizations.
Current tools used for malware analysis are limited. Popular tools such as hex editors and IDA
Pro[2] have mostly textual visualization displays and simple node-link diagrams based on static
data. Secure Decisions intends to bring a revolutionary approach to malware analysis, by visually
linking all aspects of a threat or attack - malicious code signatures, their variants, their
polymorphic behaviors, their author(s), their behaviors and origins, and other artifacts that have
influenced the character and conduct of those threats or attacks – into a complete, compelling
and understandable picture working with both static and dynamic data.
Secure Decisions has extensive experience in the cyber security domain with respect to
requirements gathering. An example of this is a cognitive task analysis that resulted in the
VIAssist for correlation and threat analysis of network traffic data. VIAssist, which exemplifies
our approach to multiple-coordinated views, is currently deployed in the intelligence community
and scheduled for deployment at US-CERT in Q2 of CY10.
Secure Decisions will develop prototype visualizations based on factors such as exhibited traits,
external and environmental artifacts, space and temporal artifact relationships, sequencing. This
will support the identification and understanding of functions and behaviors to aid malware
analysts in developing new traits and patterns of significance. They will also develop visual
representations of a malware specimens physiology profile to provide visual fingerprinting
capabilities to malware analysts and to provide graphical cues for physiology reports. Secure
Decisions has an extensive visualization toolkit that can be leveraged to create novel
visualization for malware analysis. Our tools and skills have been used to prototype and field a
variety of visualizations for government and commercial cyber defense experts.
AVI-Secure Decisions’ toolkit is currently being used in their iTVO project as part of a Phase II
NAVAIR SBIR to visualize how software code structure and behavior are related and change
through the lifetime of code execution and to facilitate debugging and optimization using both
static and dynamic techniques. This technology is also being used in a DHS SBIR called
Software Assurance Analysis and Visual Analytics that provides visual analytics to facilitate the
understanding and prioritization of software vulnerabilities. Figure 3 is taken from iTVO and
shows dependencies between classes. Stronger dependencies are clearly seen based on the size of
the connecting ribbons. This type of visualization has been used in the past for comparing
genomes and this principle can also be used in malware analysis, for example showing how
various malware traits are related.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
iTVO screenshot
showing
dependencies between software classes
Screenshot showing the contextual information of a running code (top) with the software
structure information (bottom)
Innovative Claims:
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
Deliverables: Graphical interface and malware interactive graphical models for representing
behavior and function.
E.5 Traits Library
Team Lead: HBGary Federal and Pikewerks
At its most fundamental level binaries are a compilation of discrete functions that do work. In
order to build a capability to automatically analyze malware for aggregate function and behavior
we believe you must first be able to accurately qualify all of its discrete parts. We propose to
build a body of knowledge about code (aka, Traits), for example:
•
•
•
Identify Usage of API calls (WriteFile, RegOpenKey, InternetConnect, etc)
Identify algorithms in code logic (copy loop, decrypt block, parse string, etc)
Identify typical coding structures such as (if/else blocks, do/while loops, class structures,
etc)
We propose to research and develop a trait coding system, an example of which is HBGary's
existing trait coding system used to detect the presence of malware, as shown in Fig. 3. The
existing trait system is comprised of the rules, an expression language, and a fuzzy matching
system. We will use the existing system as a basis of research to determine the best
methodology for developing a more complete trait coding system for the purposes of
enumerating the low level and high level functions and behaviors for a more sophisticated
analysis of the malware specimen.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
Figure x: HBGary's Trait Coding System for Detecting Malware
Innovative Claims:
Deliverables: Malware traits library
E.6 Genomes Library
Team Lead: HBGary Federal and Pikewerks
Using the traits library we will research and develop a patterns or genomes library. While some
traits alone can aid in the detection or identification of potentially malicious activity in code,
such as specimen uses a packer, the traits alone are not enough to determine automatically the
aggregate functions and behaviors of a specimen. For example, some malware might try to
elevate privileges, or open up a file and directly after open a network connection, or try to use
obfuscation techniques. In each of these cases there are legitimate programs, even security
programs, that would employ these functions or exhibit this type of behavior. So with traits
alone the best you might be able to develop is a probability based on an aggregate of traits
exhibited.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
To truly develop a comprehensive view of malware behavior and function takes some analysis of
the traits and the patterns they exhibit in malware. We propose to research and develop patterns
of traits, such as sequencing or clustering, of good and bad software, to develop strong indicators
that can be relied upon during automated analysis.
The first is a classifier, much like the genome that HBGary sells commercially. The system
would use the weight values to determine if a program is actually malware. We can call this
the classifier genome.
Once something has been determined as malware, it should be fed into a second genome. The
second genome has trait-codes for all the code idioms used to develop software functions. For
example, it would contain traits for all the ways a developer might code a TCP/IP recv loop. It
would also contain all the traits for malicious behaviors, such as all the ways a developer might
sniff keystrokes. We could call this the lineage genome.
Finally, using the results from the lineage genome, analysts can develop archetypes. We can
spend development money building statistical tools and visualization so that 'colonies' of largely
similar malware can be grouped. When a new colony starts to form in the data-set, we can
construct a new archetype to represent it. The archetype will contain the traits from the lineage
genome that are common to most of the colony. Once the archetype has been created, malware
can be automatically classified into the archetype as it comes in. The archetypes are not a
genome, but a secondary layer of sorting over the lineage genome.
The above system should be able to predict upcoming attacks. When new samples are collected
from the wild, they will automatically be classified into an archetype. A sudden growth of a new
colony would represent a new malware variant that needs to be addressed. Any such outbreak
would soon find a way into DoD and customer networks, so this offers a predictive capability for
defense.
build a body of knowledge, using the traits library, to categorize and classify code behavior and
intent.
- Analyze trait patterns to identify and associate actions and intents
For example:
Noticing the following traits in a code sequence: URLDownloadToFile(somefile.exe) followed
by CreateProcess(somefile.exe). This could be labeled as a “Download and execute” meta-trait
or behavior, and the intent could be identified as “Suspicious”, perhaps we could also call it a
“Risky” or “Dangerous” behavior.
Visualize legos… piece A is identifying all the individual lego blocks… piece B is assembling
some of the blocks into identifiable/recognizable items like squares, pyramids, or eventually
more complex items like cars…
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
I would suggest that several genomes be maintained. The first is a classifier, much like the
genome that HBGary sells commercially. The system would use the weight values to determine
if a program is actually malware. We can call this the classifier genome.
Once something has been determined as malware, it should be fed into a second genome. The
second genome has trait-codes for all the code idioms used to develop software functions. For
example, it would contain traits for all the ways a developer might code a TCP/IP recv loop. It
would also contain all the traits for malicious behaviors, such as all the ways a developer might
sniff keystrokes. We could call this the lineage genome or sequence genome.
Finally, using the results from the lineage genome, analysts can develop archetypes. We can
spend development money building statistical tools and visualization so that 'colonies' of largely
similar malware can be grouped. When a new colony starts to form in the data-set, we can
construct a new archetype to represent it. The archetype will contain the traits from the lineage
genome that are common to most of the colony. Once the archetype has been created, malware
can be automatically classified into the archetype as it comes in. The archetypes are not a
genome, but a secondary layer of sorting over the lineage genome.
I would suggest that three full-time analysts are assigned to genome development and
measurement. These positions should be considered engineering, since they would be
responsible for building statistical tools and visualization aids, in addition to having low level
reverse engineering experience.
At this point the enumeration of malware behavior only goes as far as is needed for detection.
Several new rule types, including:
1. combining a set of rules into a larger group known as a 'strand'. Each rule in the strand
would need to exist in the order they appear
2. allowing a rule body to specify a CLASS as opposed to an individual data artifact. This
allows us to develop a gouping under the factors.
3. allowing an import rule ("I" rule) to include argument and value restrictors. I want to
know not only that a file was created but where the file was created and what the files
name is.
Additional rule types will be added as the team performs research into the malware genome and
new types of data are found to be useful. It will be expected that several new rule types will be
developed.
Applications for Prediction
The above system should be able to predict upcoming attacks. When new samples are collected
from the wild, they will automatically be classified into an archetype. A sudden growth of a new
colony would represent a new malware variant that needs to be addressed. Any such outbreak
would soon find a way into DoD and customer networks, so this offers a predictive capability for
defense.
Data State Progression Map
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
As execution emulation continues in this manner, multiple snapshot will be created and will
result in a single-root, directed graph of data states. This tree of data states represent important
points along the control flow of the target under test. The further down the tree, the more state
transitions that have taken place. It will be possible to define data states that represent known
malware behaviors. For example, writing to a registry key, sniffing a keystroke, or logging
particular kinds of data to a log file. There are nearly limitless possibilities, restricted only by
that which can be defined as software behavior (in other words, nearly limitless). The definition
of what behaviors are noteworthy can be defined in a symbolic language that is used and
evaluated while the data state tree is recorded. Once a clear malware behavior is identified, it
will exist at a leaf node of the data state tree. When that occurs, the data state tree can be
traversed backwards and a complete trace of the malware execution leading up to the suspicious
behavior can be recovered.
Innovative Claims:
Deliverables: Malware Genomes library
4.7 Static Malware Analysis and Runtime Tracing (SMART)
Team Lead: HBGary and Pikewerks
Directly observing program behavior during execution is superior to modeling software using
static analysis. Static analysis requires an interpretation of what would happen when the
program is run. Information collected during runtime reflects what actually happened and is not
subject to interpretation.
To completely reverse engineer software with static analysis requires the resulting software
model to approximate the complexity of the actual program. Why create and study the model
when you can observe the real software in action?
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
Static analysis requires that control flows within a program be resolved through painstaking
analysis and tracing flows by hand. Why do this hard work when the control flow can be
automatically observed during runtime?
Collecting on the linear execution of binaries does not provide information on branches that
require a specific condition to be met in order to execute and that was not met during a trace.
Without an ability to fully or nearly fully exercise all branch conditions within a binary you can
not say for certain that you understand the full function and behaviors, and certainly not the full
intent of a malware specimen. There are a few methodologies
we now need a way to reliably extract all the executable content from a given binary. There are
multiple approaches to reliably extracting all the executable content from a given binary:
1) Statically analyze the binary. This may require unpacking and de-obfuscating
protected code.
2) Execute the binary in a controlled, recorded environment. This may require a special
system for avoiding detection by the binary (anti-debugging tricks).
3) Combined approach using both 1 and 2
Option 1 is the traditional method. It relies upon tools like IDA Pro and a strong library
of tools to unpack/de-obfuscate code. One of the largest negatives for this method is that
code packers/obfuscators are usually a step ahead of the unpackers/de-obfuscators.
Another negative is that self-modifying code can be very difficult to analyze. A final
negative is that full analysis may be a non-tractable problem, either requiring too much
processing power or too much memory/space to solve in a reasonable time frame.
Option 2 is a dynamic method, but is limited to recording behaviors that a binary exhibits
in a small window of time. A large negative is that many potential behaviors are never
called in a binary until specifically requested by an attacker. A positive is that we don’t
have to worry about packers and obfuscation, but we do have to prevent the binary from
detecting that it is in a controlled environment.
Option 3 is our desired approach, mixing the information gained from static analysis with
a run-time execution system. This combined approach works like this:
Obtain a memory snapshot of the binary and all its loaded components (or unpack the
binary with a tool). Use static analysis to determine likely code sections and locations of
interest (we should be able to pattern or trait match these areas). Use dataflow and
codeflow emulation to track the expected behavior of the binary. This can provide us
with ideal data and locations for use in our actual tracing/execution system.
Imagine the execution path of a binary as a tree system. Starting from a single point (the
trunk), execution flows through various branches. Unlike a tree though, branches can
loop back upon themselves, or jump to entirely different branches. The complexity this
presents can quickly overwhelm the processing power and memory of a typical
computer. For example: With 1 branch condition (2 branches), there are three possible
states (A, A->B, A->C), with 2 branch conditions there are five possible states (A, A->B,
A->C, A->B->D, A->B->E), with 3 there are 7 (A, A->B, A->C, A->B->D, A->B->E, A>B->E->F, A->B->E->G)… seems reasonable, looks like just x *2 +1… however, now
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
2
consider a forth branch that loops back to A… we have just introduced many more states
(
A, A->B, A->C, A->B->D, A->B->E, A->B->E->F, A->B->E->G,
A->B->E->G->A,
A->B->E->G->A->C,
A->B->E->G->A->B,
A->B->E->G->A->B->D,
A->B->E->G->A->B->E,
A->B->E->G->A->B->E->F
A->B->E->G->A->B->E->G
A->B->E->G->A->B->E->G->A….
This is with just a single loop and a few branching conditions. Imagine thousands
of branches and hundreds of loops. Or tens of thousands or hundreds of
thousands of branches... The basic fact that a loop could repeat forever and have
different state with every pass means we cannot easily compute all its possible
states (without approximations, models, or shortcuts… and even those are
limited).
What are ideal data and locations for tracing? They are locations and data (aka, state)
that cause alarge change in the execution path. The goal should be to find the minimal
state changes needed to cause the largest run-time execution paths, or the largest changes
from previous execution paths.
Given the following code path:
A->B or A->C, depending on data1
B->D or B->E, depending on data2
D-> ends the program
E->F or E->G, depending on data3
F->H or F->I, depending on data4
B+Data2 is an ideal state because changing it can change the execution path from
A->B->D to potentially A->B->E->F->H, etc… a large increase in the number of
states.
What does the tracing system do with the ideal state information? It uses the information
to record and execute the binary and obtain as large a sampling of executable content as
possible. This sampled data is then fed into the reasoning system built in piece D.
By combining static and dynamic analysis we can leverage the benefits of both to avoid
some of the most difficult challenges of the other. Static analysis can provide us a solid
skeleton or starting set of code and data to trigger the greatest state coverage in dynamic
analysis. One of our biggest problems building AFR was that the execution time and
memory required to examine every possible was too large and too lengthy. There were
literally hundreds of millions of different ways to execute average size binaries… Only a
small percentage of those different ways would yield useful information. Another
problem with the initial AFR implementation was that it led to logically inconsistent
states in the binary that would sometimes lead to crashes. By forcing some branches with
AFR (via direct CPU flag changes), we violated logic in the binary that may have
prevented that branch from ever occurring, yielding, at best, questionable data.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
Going back to the lego analogy, given an already constructed lego set, build an automated
system to take apart the set into each individual trait, while maintaining the relational
information about where each trait exists in the overall set.
Fig X. orange blocks branch based on controlled input, but the second leg of the branch has not
yet been exercised. These represent 'targets'. The purple block represents a branch that is
controlled where both sides of the branch have been successfully visited. This is considered
'fully resolved'.
This is a system to calculate and automatically recover control flow in executable code. A
prototype exists that demonstrates this successfully against test binaries running on a windows
x86 platform. The AFR algorithms can be extended to address code recovery of malware
programs.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
I/O is important because the emulation environment will not know how to respond to a data
query made to an external element. To address the possible tree of control flows, whenever an
I/O operation is performed, any subsequent control flow that is driven by the values contained in
the response data will be crafted based upon the arithmetic comparisons made against the data
once it returns. First, a random or preset response will be provided. Following this, data flow
tracing will be used to track every derived memory location that sources from the response data.
Whenever a control flow decision is based upon this sourced data, the original location it was
sourced from is recorded. Then, using this source location information, the I/O response data
will be precisely mutated to affect the control flow, increasing code coverage. This process will
be repeated as necessary to cover all control flow that is influenced by external I/O response
data.
In order to increase the performance, the design will include the ability to snapshot ( ) the
program state at any point. Using such snapshot capability, the system will snapshot execution
and data state immediately prior to any crafted I/O response. This allows the snapshot state to be
restored for every subsequent crafted data mutation. In other words, the 'target under test' will
not need to be re-executed from the root, but rather can be restored directly before the mutation
operation, thus increasing speed and effectiveness.
4.8 Bayesian Reasoning and Inference Node (BRaIN)
Team Lead: HBGary Federal
Research and develop an expert or AI model that can be trained and used to classify a binary into
categories. This will require processing a large set of known malware and a large set of known
“clean” applications and code so that the model can reliably judge the intent of a given binary. A
stochastic approach, such as a Bayesian inference model, can be matched with the probabilities
learned and weights given to individual traits and behaviors.
[insert blurbs about our previous Bayesian reasoning work]
More about Bayesian reasoning: Bayesian analysis is better thought of as probability theory. It
is a model that can use the probability of events to calculate the probability of a more complex
probability. The simplest examples are usually given as a deck of cards. The probability of
drawing a spade from a normal deck of cards is 13 in 52 or 1 in 4. The probability of drawing a
second spade is 12 in 51, or 4 in 17 times the probability of drawing the first, 1/4*4/17= 1/17
(0.0588235…). In Bayesian terms, the unconditional probability of the event (a card being a
spade), with no additional knowledge or events, is 1 in 4. The conditional probability of an event
(drawing a second spade), requires some additional evidence to compute (that we previously
drew a spade). Bayesian probabilities are either computed analytically, or sampled empirically.
Every possible event and potential evidence increases the complexity of Bayesian calculations,
but is also likely to increase the accuracy and improve the understanding of the relationship
between events and evidence. For our system, we will likely be using empirically sampled traits
and behaviors and conditional probabilities between them to determine the probability of a
binary being malicious or not malicious. [that was a very simplistic explanation of Bayesian
reasoning, there is a lot more that could be explained, such as negative information, avoiding
circular reasoning, joint probabilities, belief networks, etc]
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
E. The purpose of the Bayesian Reasoning Engine is to encode our prior knowledge about traits
and genomes and to provide a mechanism to reason over that prior knowledge when new
evidence is collected. The model construction process involves: identifying the evidence
with discriminatory value, collecting that evidence, and constructing the model. Models for
different malware will have some common elements and some unique elements. The goal for
the model design is to maximize accuracy and generality. Generality is important so that each
type of malware does not require a unique model, which would increase the effort to build
the models and reduces the chances of detecting malware variants.Comparison with other
ongoing research indicating advantages and disadvantages of the proposed effort.
We summarize prior attempts at building automated and generic tools for unpacking malware,
most notably PolyUnpack[], Renovo[], and OmniUnpack[], and differentiate it from the
approach that we adopt in Eureka. One of the early attempts at automated unpacking was the
PolyUnpack system which worked by buildung a static model of the program and uses finegrained execution tracking to detect when an instruction outside of the model is executed.
PolyUnpack uses the Windows debugging API to single-step through the process execution. A
fundamental deficiency of this approach is that most contemporary malware detect attempts to
hook into the debugging API and incorporate suicide logic which is triggered upon detection.
Like PolyUnpack, Renovo uses a fine-grained execution monitoring approach to track unpacking
progress and considers the execution of newly written code as an indicator of unpack
completion. Renovo is implemented using the QEMU emulator, which resides outside the
execution environment of the malware. The overhead of fine-grained execution tracking
limits scalability of this system.
OmniUnpack is most similar to Eureka in that it uses a coarse-grained execution tracking
approach. However, their granularities are orthogonal: OmniUnpack tracks execution at the page
level while Eureka tracks execution at the system call level. OmniUnpack uses page-level
protection mechanisms available in hardware to identify when code is executed from a page that
was newly modified. Eureka's use of bigram analysis and statistical hypothesis testing for
tracking unpacking progress are novel and enable it to handle advanced unpacking strategies
like multiply packed malware more effectively. A fundamental limitation of all existing
automated approaches are strategies like the use of emulators and block level unpacking. We
propose to extend Eureka with innovative snapshot stitching strategies to address this limitation.
As existing automated unpacking systems do not support automated executable recontruction,
our proposed strategies for automated OEP identification and binary reconstruction are also new
areas for exploratory research.
CWSandbox[] and TTAnalyze[] are dynamic analysis systems that execute programs in a
restricted environment and observe sequence of system interactions (using system calls).
Pararoma[] uses system-wide taint propagation to analyze information flow, which it uses for
detecting malware. Bitscope[] incorporates symbolic execution-based static analysis to analyze
malicious behavior.
F. Previous Accomplishments:
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
The HBGary Federal Team brings significant experience and capabilities directly related to the
objectives of the Cyber Genome Program. We can utilize existing foundational technologies to
more quickly focus on the primary goals of automated binary analysis and visualization.
HBGary and Pikewerks lead the way with Windows and Linux technologies to acquire extensive
low level data of binary behaviors with sandbox systems to harvest low level data of executing
binaries and physical memory imaging and reconstruction systems to uncover all digital objects
in a running host system. We will begin the cyber genome work with the existing ability to
acquire lowest level binary and memory data, therefore setting the stage for new research for
automated binary and malware data collection, automated data analysis and visualization.
HBGary has a database of 500GB of malware samples and receives 20,000 new malware
samples per day which are automatically analyzed to test its malware detection system. HBGary
Digital DNA is a patent pending technology to identify and classify digital objects based on their
observed behavioral traits. With approximately 5,000 traits identified as of this date, the work of
trait identification and development has only just begun with the bulk of work still ahead. While
HBGary focuses on Windows, Pikewerks has been doing its own work to identify malware
indicators on Linux systems. These past automation experiences will allow HBGary and
Pikewerks to contribute to new research efforts for binary analysis and reverse engineering.
SRI’s Eureka unpacking technology which automatically recovers unpacked executable images
will provide valuable experience in our efforts to prepare new binaries and malware samples for
analysis. Pikewerks will also be instrumental in these efforts as they have had several
Government contracts developing advanced technologies to protect software against reverse
engineering and have a successful commercial product for protecting digital objects.
Need evidence that we can do automated reasoning of data.
Secure Decisions is a leading provider of computer security visualization systems. Of particular
interest to this effort is their past work called iTVO which is used to visualize static and dynamic
software analysis. Since binaries and malware are also software their past work will be directly
applicable to this program.
While HBGary, Pikewerks and SRI have leading binary reverse engineering expertise, General
Dynamics has done the most malware analysis within Government and law enforcement
programs, albeit using traditional mostly manual reverse engineering tools and techniques. We
look to GD to define real life binary analysis use cases and conduct testing of any prototypes
developed.
BAA Section 5.1.3 says to identify Gov’t sponsors (name PM?), assess performance and ability
to control technical, cost and schedule against what was proposed. This info isn’t included
below.
SRI
HBGary
IP
Malware Feed Processor. Currently, HBGary receives 7,000 malware per day that are
automatically analyzed to assess the quality of its Digital DNA malware detection engine.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
“Digital DNA Sequence” uses a set of rules (“genome”) to evaluate any data object, classify it
into an object type, and identify a sequence of observed behavioral “traits”.
“Fuzzy Hash Algorithm” generates a calculated sequence of hexadecimal bytes to identify
contents of a digital object, to classify digital objects, and to compare them to each other.
Digital DNA™. Implemented today for malware detection only, this malware genome codifies
malware behaviors into a sequence of observed traits with automated analysis.
“Kernel Mode Reverse Engineering Tool” for AFRL. Developed a kernel driver to execute
malware in a sandboxed environment and harvest low level runtime behaviors.
“Next Generation Software Reverse Engineering Tools” for AFRL. Prototyped reverse
HBGary engineering tools to overcome packing, encryption and obfuscation. Research focused on
Contract automated runtime tracing, stealthy debugging, data flow tracing, dynamic data sampling,
automated flow resolution and control flow execution tree graphing.
HBGary “Rootkit Detection & Mitigation” for DARPA. Tested known rootkits against known malware
Contract detection tools to determine that state-of-the-art of rootkit detection was very poor.
Responder™. Cyber security analysis system with mature features for automated Windows
HBGary
physical memory forensics, recover binary objects, binary extraction and disassembly, control
Product
flow graphing, static and dynamic reverse engineering, and reporting.
HBGary FastDump Pro. Windows physical memory imaging tool. Supports Win2K thru Win7, 32- & 64Product bit systems. Image RAM + pagefile and RAM larger than 4GB. Small RAM footprint.
Pikewerks Second Look™. Cyber security analysis system for automated Linux physical memory
Product forensics, recover binary objects, identify rootkits and malware, and binary disassembly.
REcon™. Malware runtime tracing tool that collects low level malware behaviors such as
HBGary
instruction executed, processes and threads launched or killed, changes to memory, registry or
Product
filesystem, and network activity. Has ability to replay and report observed behaviors.
“Anti-Forensics” for AFRL. Characterize and detect many anti-forensics techniques used by
PikewerksCo
malware authors such as file system storage techniques, indirect function hooking, memory
ntract
protection techniques using processor debug registers, and BIOS-based strategies.
Pikewerks Electronic Armor®. Protects Linux and Solaris executables, libraries, and scripts in RAM and
Product on disk from unauthorized access, reverse engineering, and signature detection.
“Botnet Detection & Mitigation” for DHS. Research focused on malware detection with host
HBGary
physical memory reconstruction and digital object analysis. Researched and prototyped Bayesian
Contract
Reasoning Networks to automate analysis of large amount of data for bot and malware detection.
“Visualization for Mission Critical & Mobile Cyber Assets” for DARPA. A framework to
Secure Dec
visualize physical and logical locations of mobile cyber assets to assess vulnerability and
Contract
security status. Featured in DARPA Success Stories. DISA chose for Phase III transition.
“Multilayer Wireless Network Flow Graph for Network and Security Analysis”. A method for
Secure Dec
converting large volumes of packet data into multilayer flow records, rendering them
Patent
graphically into visualizations showing communications patterns with querying capabilities.
“iTVO – Visualization for Static & Dynamic Software Analysis” for NAVAIR. Graphically
Secure Dec
visualizes software operation such as code blocks, object behavior, calls made, memory usage,
Contract
variable changes, disk utilization, network activity and I/O activity.
HBGary HBGary has had numerous unclassified contracts to develop cyber attack vectors and persistent,
Contracts stealthy host agent software with keylogging and covert C&C communications.
GD
DC3 contract. Real world malware r/e.
Contract
HBGary
Patent
HBGary
Patent
HBGary
Product
HBGary
Contract
G. Facilities:
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
Company
Facility
Address
Classification Lvl
HBGary Federal
HBGary
Pikewerks
SRI International
Secure Decisions
General Dynamics
H. Detail support enhancing that of Section II, including formal teaming agreements that are
required to execute this program.
I. Cost schedules and measurable milestones for the proposed research, including estimates of
cost for each task in each year of the effort delineated by the primes and major
subcontractors, total cost, and any company cost share. Note: Measurable milestones should
capture key development points in tasks and should be clearly articulated and defined in time
relative to start of effort. These milestones should enable and support a decision for the next
part of the effort. Additional interim non-critical management milestones are also highly
encouraged at regular intervals. Where the effort consists of multiple portions that could
reasonably be partitioned for purposes of funding, these should be identified as options with
separate cost estimates for each. Additionally, proposals should clearly explain the technical
approach(es) that will be employed to meet or exceed each program metric and provide
ample justification as to why the approach(es) is/are feasible. Note: Task descriptions related
to the technical approach and associated technical elements need to be complete and clearly
related to satisfying the program metrics as stated in Section 1.2.1.
J. All proposals must include a description of the data they will use during their research,
potential privacy issues, and how they propose mitigating any privacy issues.
HBGary Federal subscribes to commercial malware feeds and has an existing 500GB unique
sample malware repositories that will be used for this effort. We will also develop malware
smart crawlers to find and capture new malware that is not available in the feeds.
Section IV. Additional Information
A brief bibliography of relevant technical papers and research notes (published and unpublished)
that document the technical ideas upon which the proposal is based. Copies of not more than
three (3) relevant papers can be included in the submission.
(Minoru 2000; Chouchane, Walenstein et al. 2007; Jiang, Wang et al. 2007; Preda,
Christodorescu et al. 2007; Crandall, Ensafi et al. 2008; Preda, Christodorescu et al. 2008; Zhou
and Inge 2008; Ahmed, Hameed et al. 2009; Anh 2009; Bertrand 2009; Carbone, Cui et al. 2009;
Chen, Yuan et al. 2009; George 2009; Hu, Chiueh et al. 2009; Kang, Yin et al. 2009; Min 2009;
Sean 2009; Shakeel 2009; Tabish, Shafiq et al. 2009; Tengfei 2009; Wei 2009; Wei 2009;
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
Wenjian 2009; Xinran 2009; Xue, Hu et al. 2009; Yuan-yuan 2009; Desmond 2010; Hengli
2010; Lakhotia, Boccardo et al. 2010; Maughan 2010; Mohammad 2010; Norman 2010)
Ahmed, F., H. Hameed, et al. (2009). Using spatio-temporal information in API calls with
machine learning algorithms for malware detection. Proceedings of the 2nd ACM workshop on
Security and artificial intelligence. Chicago, Illinois, USA, ACM: 55-62.
Run-time monitoring of program execution behavior is widely used to discriminate
between benign and malicious processes running on an end-host. Towards this end, most
of the existing run-time intrusion or malware detection techniques utilize information
available in Windows Application Programming Interface (API) call arguments or
sequences. In comparison, the key novelty of our proposed tool is the use of statistical
features which are extracted from both spatial arguments) and temporal (sequences)
information available in Windows API calls. We provide this composite feature set as an
input to standard machine learning algorithms to raise the final alarm. The results of our
experiments show that the concurrent analysis of spatio-temporal features improves the
detection accuracy of all classifiers. We also perform the scalability analysis to identify a
minimal subset of API categories to be monitored whilst maintaining high detection
accuracy.
Anh, M. N. (2009). MAVMM: Lightweight and Purpose Built VMM for Malware Analysis.
Malicious software is rampant on the Internet and costs billions of dollars each year. Safe
and thorough analysis of malware is key to protecting vulnerable systems and cleaning
those that have already been infected. Most current state-of-the-art analysis platforms run
alongside the malware, increasing their detectability. This reduces the value of analysis
because some malware is known to behave differently when being analyzed.
Virtualization offers a compelling platform for malware analysis, with strong isolation
and the ability to save and restore guest state. Current virtual machine monitors (VMMs),
however, are not designed for malware analysis. Due to their complexity, they often fail
to provide transparency and even expose vulnerabilities which could be exploited by the
malware running inside guest system. We propose a lightweight VMM (namely
MAVMM) that is designed specially for a single job: malware analysis. MAVMM does
not implement unnecessary virtualization features commonly found in general purpose
hypervisors, including virtual device emulation. We take advantage of hardware
virtualization support to make MAVMM more simple, secure and transparent. In this
paper, we describe the design and implementation of MAVMM, and the features that we
can extract from programs running inside the guest OS. We evaluate our platform in three
aspects: functionality, detectability and performance. We show that our system can
extract useful information from malicious software, and that it is not susceptible to
known virtualization detection techniques.
Bertrand, A. (2009). Runtime Protection via Dataflow Flattening.
Software running on an open architecture, such as the PC, is vulnerable to inspection and
modification. Since software may process valuable or sensitive information, many
defenses against data analysis and modification have been proposed. This paper
complements existing work and focuses on hiding data location throughout program
execution. To achieve this, we combine three techniques: (i) periodic reordering of the
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
heap, (ii) migrating local variables from the stack to the heap and (iii) pointer scrambling.
By essentially flattening the dataflow graph of the program, the techniques serve to
complicate static dataflow analysis and dynamic data tracking. Our methodology can be
viewed as a data-oriented analogue of control-flow flattening techniques.Dataflow
flattening is useful in practical scenarios like DRM, information-flow protection, and
exploit resistance. Our prototype implementation compiles C programs into a binary for
which every access to the heap is redirected through a memory management unit. Stackbased variables may be migrated to the heap, while pointer accesses and arithmetic may
be scrambled and redirected. We evaluate our approach experimentally on the SPEC
CPU2006 benchmark suite.
Carbone, M., W. Cui, et al. (2009). Mapping kernel objects to enable systematic integrity
checking. Proceedings of the 16th ACM conference on Computer and communications security.
Chicago, Illinois, USA, ACM: 555-565.
Dynamic kernel data have become an attractive target for kernel-mode malware.
However, previous solutions for checking kernel integrity either limit themselves to code
and static data or can only inspect a fraction of dynamic data, resulting in limited
protection. Our study shows that previous solutions may reach only 28% of the dynamic
kernel data and thus may fail to identify function pointers manipulated by many kernelmode malware.
To enable systematic kernel integrity checking, in this paper we present KOP, a system that can
map dynamic kernel data with nearly complete coverage and nearly perfect accuracy.
Unlike previous approaches, which ignore generic pointers, unions and dynamic arrays
when locating dynamic kernel objects, KOP (1) applies inter-procedural points-to
analysis to compute all possible types for generic pointers (e.g., void*), (2) uses a pattern
matching algorithm to resolve type ambiguities (e.g., unions), and (3) recognizes
dynamic arrays by leveraging knowledge of kernel memory pool boundaries. We
implemented a prototype of KOP and evaluated it on a Windows Vista SP1 system
loaded with 63 kernel drivers. KOP was able to accurately map 99% of all the dynamic
kernel data.
To demonstrate KOP's power, we developed two tools based on it to systematically identify
malicious function pointers and uncover hidden kernel objects. Our tools correctly
identified all malicious function pointers and all hidden objects from nine real-world
kernel-mode malware samples as well as one created by ourselves, with no false alarms.
Chen, H., L. Yuan, et al. (2009). Control flow obfuscation with information flow tracking.
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture.
New York, New York, ACM: 391-400.
Recent micro-architectural research has proposed various schemes to enhance processors
with additional tags to track various properties of a program. Such a technique, which is
usually referred to as information flow tracking, has been widely applied to secure
software execution (e.g., taint tracking), protect software privacy and improve
performance (e.g., control speculation).
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
In this paper, we propose a novel use of information flow tracking to obfuscate the whole control
flow of a program with only modest performance degradation, to defeat malicious code
injection, discourage software piracy and impede malware analysis. Specifically, we
exploit two common features in information flow tracking: the architectural support for
automatic propagation of tags and violation handling of tag misuses. Unlike other
schemes that use tags as oracles to catch attacks (e.g., taint tracking) or speculation
failures, we use the tags as flow-sensitive predicates to hide normal control flow
transfers: the tags are used as predicates for control flow transfers to the violation
handler, where the real control flow transfer happens.
We have implemented a working prototype based on Itanium processors, by leveraging the
hardware support for control speculation. Experimental results show that BOSH can
obfuscate the whole control flow with only a mean of 26.7% (ranging from 4% to 59%)
overhead on SPECINT2006. The increase in code size and compilation time is also
modest.
Chouchane, M. R., A. Walenstein, et al. (2007). Statistical signatures for fast filtering of
instruction-substituting metamorphic malware. Proceedings of the 2007 ACM workshop on
Recurring malcode. Alexandria, Virginia, USA, ACM: 31-37.
Introducing program variations via metamorphic transformations is one of the methods
used by malware authors in order to help their programs slip past defenses. A method is
presented for rapidly deciding whether or not an input program is likely to be a variant of
a given metamorphic program. The method is defined for the prominent class of
metamorphic engines that work by probabilistically selecting instruction-substituting
program transformations. A model of the probabilistic engine is used to predictthe
expected distribution of instruction forms for different generations ofvariants. These
predicted distributions form a type of "statistical signature" for the output of the
metamorphic engines. A classifier is defined based on distance between the observed and
the predicted instruction form distributions. A case study using the W32.Evol virus shows
the classifier can distinguish between malicious samples from multiple generations. The
classification method may be useful for practical malware detection by serving as an
inexpensive filter to avoid more in-depth analyses where they are unnecessary
Crandall, J. R., R. Ensafi, et al. (2008). The ecology of Malware. Proceedings of the 2008
workshop on New security paradigms. Lake Tahoe, California, USA, ACM: 99-106.
The fight against malicious software (or malware, which includes everything from worms
to viruses to botnets) is often viewed as an "arms race." Conventional wisdom is that we
must continually "raise the bar" for the malware creators. However, the multitude of
malware has itself evolved into a complex environment, and properties not unlike those
of ecological systems have begun to emerge. This may include competition between
malware, facilitation, parasitism, predation, and density-dependent population regulation.
Ecological principles will likely be useful for understanding the effects of these
ecological interactions, for example, carrying capacity, species-time and species-area
relationships, the unified neutral theory of biodiversity, and the theory of island biogeography. The emerging malware ecology can be viewed as a critical challenge to all
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
3
aspects of malware defense, including collection, triage, analysis, intelligence estimates,
detection, mitigation, and forensics. It can also be viewed as an opportunity.
In this position paper, we argue that taking an ecological approach to malware defense will
suggest new defenses. In particular, we can exploit the fact that interactions of malware
with its environment, and with other malware, are neither fully predictable nor fully
controllable by the malware author--yet the emergent behavior will follow general
ecological principles that can be exploited for malware defense.
Desmond, L. (2010). RBACS: Rootkit Behavioral Analysis and Classification System.
In this paper, we focus on rootkits, a special type of malicious software (malware) that
operates in an obfuscated and stealthy mode to evade detection. Categorizing these
rootkits will help in detecting future attacks against the business community. We first
developed a theoretical framework for classifying rootkits. Based on our theoretical
framework, we then proposed a new rootkit classification system and tested our system
on a sample of rootkits that use inline function hooking. Our experimental results showed
that our system could successfully categorize the sample using unsupervised clustering.
George, S. O. (2009). Using Nature to Best Clarify Computer Security and Threats.
Many computer security technologies are at risk of proving failure to their extreme
vulnerability. Our computers defense mechanisms have been going on for years now and
yet have not resulted in an extremely flexible set of protections. Organizations focus on
the ability of security technology to minimize risks but threats to computer security are
continuing. In other disciplines, looking to nature has proven extremely valuable. Perhaps
we can look to nature to help the understanding of threats to computer systems and even
find strategies for protecting them. A significant focus on nature is more useful in
divulging computer security issues. The work presented here is an attempt to work out
the rudimentary principles of defense mechanisms theory. More specifically, we have
attempted to lay out a framework within which the defence mechanisms may be
systematically defined, generated and classified through nature. The approach we have
used in this paper falls generally within the use of nature to explain security threats,
although the developments are elementary we believe, they are totally self-contained
within the context of our discussion.
Hengli, Z. (2010). Malicious Executables Classification Based on Behavioral Factor Analysis.
Malware is an increasingly important problem that threatens the security of computer
systems. The new concept of cloud security require rapid and automated detection and
classification of malicious software. In this paper, we propose a behavior-based
automated classification method. Depends on behavioral analysis we characterize
malware behavioral profile in a trace report. This report contains the status change caused
by the executable and event which are transfered from corresponding Win32 API calls
and their certain parameters. we extract behaviour unit strings as features which reflect
diffierent malware families behavioral patterns. These features vector space servered as
input to the SVM. We use string similarity and information gain to reduce the dimension
of feature space. Comparative experiments with a real world data set of malicious
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
executables shows that our proposed method can classify malware into diffierent
malware families with higher accuracy and efficiency.
Hu, X., T.-c. Chiueh, et al. (2009). Large-scale malware indexing using function-call graphs.
Proceedings of the 16th ACM conference on Computer and communications security. Chicago,
Illinois, USA, ACM: 611-620.
A major challenge of the anti-virus (AV) industry is how to effectively process the huge
influx of malware samples they receive every day. One possible solution to this problem
is to quickly determine if a new malware sample is similar to any previously-seen
malware program. In this paper, we design, implement and evaluate a malware database
management system called SMIT (Symantec Malware Indexing Tree) that can efficiently
make such determination based on malware's function-call graphs, which is a structural
representation known to be less susceptible to instruction-level obfuscations commonly
employed by malware writers to evade detection of AV software. Because each malware
program is represented as a graph, the problem of searching for the most similar malware
program in a database to a given malware sample is cast into a nearest-neighbor search
problem in a graph database. To speed up this search, we have developed an efficient
method to compute graph similarity that exploits structural and instruction-level
information in the underlying malware programs, and a multi-resolution indexing scheme
that uses a computationally economical feature vector for early pruning and resorts to a
more accurate but computationally more expensive graph similarity function only when it
needs to pinpoint the most similar neighbors. Results of a comprehensive performance
study of the SMIT prototype using a database of more than 100,000 malware demonstrate
the effective pruning power and scalability of its nearest neighbor search mechanisms.
Jiang, X., X. Wang, et al. (2007). Stealthy malware detection through vmm-based "out-of-thebox" semantic view reconstruction. Proceedings of the 14th ACM conference on Computer and
communications security. Alexandria, Virginia, USA, ACM: 128-138.
An alarming trend in malware attacks is that they are armed with stealthy techniques to
detect, evade, and subvert malware detection facilities of the victim. On the defensive
side, a fundamental limitation of traditional host-based anti-malware systems is that they
run inside the very hosts they are protecting ("in the box"), making them vulnerable to
counter-detection and subversion by malware. To address this limitation, recent solutions
based on virtual machine (VM) technologies advocate placing the malware detection
facilities outside of the protected VM ("out of the box"). However, they gain tamper
resistance at the cost of losing the native, semantic view of the host which is enjoyed by
the "in the box" approach, thus leading to a technical challenge known as the semantic
gap.
In this paper, we present the design, implementation, and evaluation of VMwatcher - an "out-ofthe-box" approach that overcomes the semantic gap challenge. A new technique called
guest view casting is developed to systematically reconstruct internal semantic views
(e.g., files, processes, and kernel modules) of a VM from the outside in a non-intrusive
manner. Specifically, the new technique casts semantic definitions of guest OS data
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
structures and functions on virtual machine monitor (VMM)-level VM states, so that the
semantic view can be reconstructed. With the semantic gap bridged, we identify two
unique malware detection capabilities: (1) view comparison-based malware detection and
its demonstration in rootkit detection and (2) "out-of-the-box" deployment of host-based
anti-malware software with improved detection accuracy and tamper-resistance. We have
implemented a proof-of-concept prototype on both Linux and Windows platforms and
our experimental results with real-world malware, including elusive kernel-level rootkits,
demonstrate its practicality and effectiveness.
Kang, M. G., H. Yin, et al. (2009). Emulating emulation-resistant malware. Proceedings of the
1st ACM workshop on Virtual machine security. Chicago, Illinois, USA, ACM: 11-22.
The authors of malware attempt to frustrate reverse engineering and analysis by creating
programs that crash or otherwise behave differently when executed on an emulated
platform than when executed on real hardware. In order to defeat such techniques and
facilitate automatic and semi-automatic dynamic analysis of malware, we propose an
automated technique to dynamically modify the execution of a whole-system emulator to
fool a malware sample's anti-emulation checks. Our approach uses a scalable trace
matching algorithm to locate the point where emulated execution diverges, and then
compares the states of the reference system and the emulator to create a dynamic state
modification that repairs the difference. We evaluate our technique by building an
implementation into an emulator used for in-depth malware analysis. On case studies that
include real samples of malware collected in the wild and an attack that has not yet been
exploited, our tool automatically ameliorates the malware sample's anti-emulation checks
to enable analysis, and its modifications are robust to system changes.
Lakhotia, A., D. R. Boccardo, et al. (2010). Context-sensitive analysis of obfuscated x86
executables. Proceedings of the 2010 ACM SIGPLAN workshop on Partial evaluation and
program manipulation. Madrid, Spain, ACM: 131-140.
A method for context-sensitive analysis of binaries that may have obfuscated procedure
call and return operations is presented. Such binaries may use operators to directly
manipulate stack instead of using native call and ret instructions to achieve equivalent
behavior. Since definition of context-sensitivity and algorithms for context-sensitive
analysis have thus far been based on the specific semantics associated to procedure call
and return operations, classic interprocedural analyses cannot be used reliably for
analyzing programs in which these operations cannot be discerned. A new notion of
context-sensitivity is introduced that is based on the state of the stack at any instruction.
While changes in `calling'-context are associated with transfer of control, and hence can
be reasoned in terms of paths in an interprocedural control flow graph (ICFG), the same
is not true of changes in 'stack'-context. An abstract interpretation based framework is
developed to reason about stack-contexts and to derive analogues of call-strings based
methods for the context-sensitive analysis using stack-context. The method presented is
used to create a context-sensitive version of Venable et al.'s algorithm for detecting
obfuscated calls. Experimental results show that the context-sensitive version of the
algorithm generates more precise results and is also computationally more efficient than
its context-insensitive counterpart.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
Maughan, D. (2010). "The need for a national cybersecurity research and development agenda."
Commun. ACM 53(2): 29-31.
Government-funded initiatives, in cooperation with private-sector partners in key
technology areas, are fundamental to cybersecurity technical transformation.
Min, F. (2009). Detecting virus mutations via dynamic matching.
To defeat current commercial antivirus software, the virus developers are employing
obfuscation techniques to create mutating viruses. The current antivirus software cannot
handle the obfuscated viruses well since its detection methods that are based upon static
signatures are not resilient to even slight variations in the code that forms the virus. In
this paper, we propose a new type of virus signature, called dynamic signature, and an
algorithm for matching dynamic signatures. Our dynamic signature is created based on
the runtime behavior of a virus. Therefore, an obfuscated virus can also be detected using
a dynamic signature as long as it dynamically behaves like the original virus. We also
discuss issues related to deploying our virus detection approach. Our experiments based
upon several known mutating viruses show that our method is effective in identifying
obfuscated viruses.
Minoru, F. (2000). A New Rule Generation Method from Neural Networks Formed Using a
Genetic Algorithm with Virus Infection.
In this paper, a new rule generation method from neural networks is presented. A neural
network (NN) is formed using a genetic algorithm (GA) with virus infection and
deterministic mutation to represent regularities in training data. This method utilizes a
modular structure in GA. Each module learns a different neural network architecture,
such as sigmoid and a high order neural networks. That information is communicated to
the other modules by the virus infection. The results of computer simulations show that
this approach can generate obvious network structures and as a result simple rules.
Mohammad, T. (2010). A Survey of Hardware Trojan Taxonomy and Detection. K. Farinaz. 27:
10-25.
Today's integrated circuits are vulnerable to hardware Trojans, which are malicious
alterations to the circuit, either during design or fabrication. This article presents a
classification of hardware Trojans and a survey of published techniques for Trojan
detection.
Norman, S. (2010). Metrics for Mitigating Cybersecurity Threats to Networks. 14: 64-71.
To achieve their full potential, networks must be secure as well as functional. With this in
mind, the author identifies metrics designed to mitigate vulnerabilities to cyberattacks in
networks that are key to the critical infrastructure of the US. He discusses both growth
metrics — based on data obtained from the US National Institute of Standards and
Technology and Department of Homeland Security vulnerability database — and metrics
designed to mitigate the risk of security vulnerabilities in networks. If used together,
these two types of metrics can help make networks more secure.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
Preda, M. D., M. Christodorescu, et al. (2007). A semantics-based approach to malware
detection. Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of
programming languages. Nice, France, ACM: 377-388.
Malware detection is a crucial aspect of software security. Current malware detectors
work by checking for "signatures," which attempt to capture (syntactic) characteristics of
the machine-level byte sequence of the malware. This reliance on a syntactic approach
makes such detectors vulnerable to code obfuscations, increasingly used by malware
writers, that alter syntactic properties of the malware byte sequence without significantly
affecting their execution behavior.This paper takes the position that the key to malware
identification lies in their semantics. It proposes a semantics-based framework for
reasoning about malware detectors and proving properties such as soundness and
completeness of these detectors. Our approach uses a trace semantics to characterize the
behaviors of malware as well as the program being checked for infection, and uses
abstract interpretation to "hide" irrelevant aspects of these behaviors. As a concrete
application of our approach, we show that the semantics-aware malware detector
proposed by Christodorescu et al. is complete with respect to a number of common
obfuscations used by malware writers.
Preda, M. D., M. Christodorescu, et al. (2008). "A semantics-based approach to malware
detection." ACM Trans. Program. Lang. Syst. 30(5): 1-54.
Malware detection is a crucial aspect of software security. Current malware detectors
work by checking for signatures, which attempt to capture the syntactic characteristics of
the machine-level byte sequence of the malware. This reliance on a syntactic approach
makes current detectors vulnerable to code obfuscations, increasingly used by malware
writers, that alter the syntactic properties of the malware byte sequence without
significantly affecting their execution behavior.
This paper takes the position that the key to malware identification lies in their semantics. It
proposes a semantics-based framework for reasoning about malware detectors and
proving properties such as soundness and completeness of these detectors. Our approach
uses a trace semantics to characterize the behavior of malware as well as that of the
program being checked for infection, and uses abstract interpretation to “hide” irrelevant
aspects of these behaviors. As a concrete application of our approach, we show that (1)
standard signature matching detection schemes are generally sound but not complete, (2)
the semantics-aware malware detector proposed by Christodorescu et al. is complete with
respect to a number of common obfuscations used by malware writers and (3) the
malware detection scheme proposed by Kinder et al. and based on standard modelchecking techniques is sound in general and complete on some, but not all, obfuscations
handled by the semantics-aware malware detector.
Sean, F. (2009). Analyzing and Detecting Malicious Flash Advertisements.
The amount of dynamic content on the web has been steadily increasing. Scripting
languages such as JavaScript and browser extensions such as Adobe's Flash have been
instrumental in creating web-based interfaces that are similar to those of traditional
applications. Dynamic content has also become popular in advertising, where Flash is
used to create rich, interactive ads that are displayed on hundreds of millions of
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
computers per day. Unfortunately, the success of Flash-based advertisements and
applications attracted the attention of malware authors, who started to leverage Flash to
deliver attacks through advertising networks. This paper presents a novel approach whose
goal is to automate the analysis of Flash content to identify malicious behavior. We
designed and implemented a tool based on the approach, and we tested it on a large
corpus of real-world Flash advertisements. The results show that our tool is able to
reliably detect malicious Flash ads with limited false positives. We made our tool
available publicly and it is routinely used by thousands of users.
Shakeel, B. (2009). Protecting Commodity Operating System Kernels from Vulnerable Device
Drivers.
Device drivers on commodity operating systems execute with kernel privilege and have
unfettered access to kernel data structures. Several recent attacks demonstrate that such
poor isolation exposes kernel data to exploits against vulnerable device drivers, for
example through buffer overruns in packet processing code. Prior architectures to isolate
kernel data from driver code either sacrifice performance, execute too much driver code
with kernel privilege, or are incompatible with commodity operating systems. In this
paper, we present the design, implementation and evaluation of a novel security
architecture that better isolates kernel data from device drivers without sacrificing
performance or compatibility. In this architecture, a device driver is partitioned into a
small, trusted kernel-mode component and an untrusted user-mode component. The
kernel-mode component contains privileged and performance-critical code. It
communicates via RPC with the user-mode component which contains the rest of the
driver code. A RPC monitor mediates all control and data transfers between the kerneland user-mode components. In particular, it verifies that all data transfers from the
untrusted user-mode component to the kernel-mode component preserve kernel data
structure integrity. We also present a runtime technique to automatically infer such
integrity specifications. Our experiments with a Linux implementation of this architecture
show that it can prevent compromised device drivers from affecting the integrity of
kernel data and do so without impacting common-case performance.
Tabish, S. M., M. Z. Shafiq, et al. (2009). Malware detection using statistical analysis of bytelevel file content. Proceedings of the ACM SIGKDD Workshop on CyberSecurity and
Intelligence Informatics. Paris, France, ACM: 23-31.
Commercial anti-virus software are unable to provide protection against newly launched
(a.k.a "zero-day") malware. In this paper, we propose a novel malware detection
technique which is based on the analysis of byte-level file content. The novelty of our
approach, compared with existing content based mining schemes, is that it does not
memorize specific byte-sequences or strings appearing in the actual file content. Our
technique is non-signature based and therefore has the potential to detect previously
unknown and zero-day malware. We compute a wide range of statistical and informationtheoretic features in a block-wise manner to quantify the byte-level file content. We
leverage standard data mining algorithms to classify the file content of every block as
normal or potentially malicious. Finally, we correlate the block-wise classification results
of a given file to categorize it as benign or malware. Since the proposed scheme operates
at the byte-level file content; therefore, it does not require any a priori information about
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
the filetype. We have tested our proposed technique using a benign dataset comprising of
six different filetypes --- DOC, EXE, JPG, MP3, PDF and ZIP and a malware dataset
comprising of six different malware types --- backdoor, trojan, virus, worm, constructor
and miscellaneous. We also perform a comparison with existing data mining based
malware detection techniques. The results of our experiments show that the proposed
nonsignature based technique surpasses the existing techniques and achieves more than
90% detection accuracy.
Tengfei, Y. (2009). Anti-debugging Framework Based on Hardware Virtualization Technology.
Anti-debugging technique is widely used to protect executable files in commercial
software applications. However, most of contemporary anti-debugging products fail to
guarantee their functionalities in that when the application code is running on Ring 0 or
above, malicious attackers can still manipulate it to block the anti-debugging process.
This paper introduces an anti-debugging framework based on hardware virtualization
technology called Virtual Machine Monitor (VMM), which can monitor each code
running above its privilege level on Intel x86 platform. Our experiments demonstrate that
major debuggers running on Microsoft Windows, such as VC2005 and WinDBG, are
incapable to debug the target application with the protection of our anti-debugging
framework.
Wei, P. (2009). A Novel Anomaly Detection Approach for Executable Program Security.
Anomaly detection of executable program is a security detection solution that examines
whether security violation issues exist in programs. The paper presents a novel anomaly
detection approach for executable program security (ADEPS), which monitors program
executions and detects anomalous program behaviors. Through reverse analysis of
executable program, critical behavior monitoring points can be extracted from binary
code sequences and memory space. A hybrid neural network model is proposed to detect
abnormal attacks and classify detected attacks from actual program behaviors. The
experimental results demonstrate that the proposed approach can effectively and
accurately perform anomaly detection.
Wei, W. (2009). A Hierarchical Artificial Immune Model for Virus Detection.
As viruses become more complex, existing antivirus methods are inefficient to detect
various forms of viruses, especially new variants and unknown viruses. Inspired by
immune system, a hierarchical artificial immune system (AIS) model, which is based on
matching in three layers, is proposed to detect a variety of forms of viruses. In the bottom
layer, a non-stochastic but guided candidate virus gene library is generated by statistical
information of viral key codes. Then a detecting virus gene library is upgraded from the
candidate virus gene library using negative selection. In the middle layer, a novel storage
method is used to keep a potential relevance between different signatures on the
individual level, by which the mutual cooperative information of each instruction in a
virus program can be collected. In the top layer, an overall matching process can reduce
the information loss considerably. Experimental results indicate that the proposed model
can recognize obfuscated viruses efficiently with an averaged recognition rate of 94%,
including new variants of viruses and unknown viruses.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
Wenjian, Y. (2009). A Control Flow Graph Reconstruction Method from Binaries Based on
XML.
The first step of decompilation and other reverse analysis for binary codes is constructing
control flow graph of program. Due to the loss of structure in compiled code, extracting a
control flow graph from an executable is not simply a matter. The usual way which scans
the branch instructions in the disassembly file and creates the CFG can not suitable for
multiply architecture and file format. This paper analyzes the structure of assembly
program and designs a data information description language based on XML----Reverse
Meta Language, which is to describe structured disassembling information and then
brings forward the control flow graph reconstruction algorithm. This method abstracts
disassembly representations and supports multiple architectures. The testing result shows
that this method can reconstruct control flow graph of binaries effectively.
Xinran, W. (2009). Detecting Software Theft via System Call Based Birthmarks.
Along with the burst of open source projects, software theft (or plagiarism) has become a
very serious threat to the healthiness of software industry. Software birthmark, which
represents the unique characteristic of a program, can be used for software theft
detection. We propose two system call based software birthmarks: SCSSB (System Call
Short Sequence Birthmark) and IDSCSB (Input Dependant System Call Subsequence
Birthmark), and examine how well they reflect unique behavioral characteristics of a
program. To our knowledge, our detection system based on SCSSB and IDSCSB is the
first one that is capable of software component theft detection where only partial code is
stolen. We demonstrate the strength of our birthmarks against various evasion techniques,
including those based on different compilers and different compiler optimization levels as
well as those based on very powerful obfuscation techniques supported by SandMark.
Unlike the existing work that were evaluated through small or toy software, we also
evaluate our birthmarks on a set of large software (web browsers). Our results show that
system call based birthmarks are very practical and effective in detecting software theft
that even adopts advanced evasion techniques.
Xue, J., C. Hu, et al. (2009). Metamorphic malware detection technology based on aggregating
emerging patterns. Proceedings of the 2nd International Conference on Interaction Sciences:
Information Technology, Culture and Human. Seoul, Korea, ACM: 1293-1296.
Obfuscating technology is used widely in metamorphic malware and most of current
detection methods fail to completely identify such ever-increasingly covert metamorphic
malware. In this paper, system call sequences in the process of software execution are
researched and metamorphic malware detection method based on aggregating emerging
patterns is proposed. Experimental results show most metamorphic malware can be
detected effectively by this method and it has higher detection rate and lower false alarm
rate when the minimum support and growth rate thresholds are set reasonably.
Yuan-yuan, L. (2009). AOP-Based Attack on Obfuscated Java Code.
A new approach to attack on obfuscated code is proposed using the join-point model and
the bytecode instrument mechanism provided by AOP. All operations are implemented
on bytecode level without acquiring the source code of target program, which makes
obfuscation useless. As for attackers, they can modify the behavior of target code without
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4
decompiling and recompiling it. As experiment is shown, this approach to attack on
obfuscated code is very straightforward and simple to be implemented.
Zhou, Y. and W. M. Inge (2008). Malware detection using adaptive data compression.
Proceedings of the 1st ACM workshop on Workshop on AISec. Alexandria, Virginia, USA,
ACM: 53-60.
A popular approach in current commercial anti-malware software detects malicious
programs by searching in the code of programs for scan strings that are byte sequences
indicative of malicious code. The scan strings, also known as the signatures of existing
malware, are extracted by malware analysts from known malware samples, and stored in
a database often referred to as a virus dictionary. This process often involves a significant
amount of human efforts. In addition, there are two major limitations in this technique.
First, not all malicious programs have bit patterns that are evidence of their malicious
nature. Therefore, some malware is not recorded in the virus dictionary and can not be
detected through signature matching. Second, searching for specific bit patterns will not
work on malware that can take many forms--obfuscated malware. Signature matching has
been shown to be incapable of identifying new malware patterns and fails to recognize
obfuscated malware. This paper presents a malware detection technique that discovers
malware by means of a learning engine trained on a set of malware instances and a set of
benign code instances. The learning engine uses an adaptive data compression model-prediction by partial matching (PPM)--to build two compression models, one from the
malware instances and the other from the benign code instances. A code instance is
classified, either as "malware" or "benign", by minimizing its estimated cross entropy.
Our preliminary results are very promising. We achieved about 0.94 true positive rate
with as low as 0.016 false positive rate. Our experiments also demonstrate that this
technique can effectively detect unknown and obfuscated malware.
Use or disclosure of data contained on this sheet is subject to the restriction on the title page of
this proposal.
4