Static Analysis of Java and Soot
Mooly Sagiv
Main Java Features
•
•
•
•
Class based object oriented
Type safe
No explicit free
Portable with bytecode
– Interpreted by the Java Virtual Machines
• Clean and rich library
• Verbose
• Carefully designed
Java Bytecode
Source Code
Java
Compile
Linux JVM
Win JVM
Linux Machine
Windows Machine
Mac JVM
Mac Machine
Type of Java Bugs
•
•
•
•
•
Null dereferences
“Memory” and resource leaks
Data races
Concurrent modification via Iterators
Incorrect API usage
Program Slicing
• Program Slice [Mark Weiser]
– the statements of a program that may affect the
values of some variables in a set V at some point
of interest p
read(n);
i := 1 ;
sum := 0;
prod := 1 ;
while (i n) do {
sum := sum + i;
prod := prod * i;
i := i + 1 ;
}
print sum;
print prod ;
read(n);
i := 1 ;
prod := 1 ;
while (i n) do {
prod := prod * i;
i := i + 1 ;
}
print prod ;
Applications of Slicing
•
•
•
•
•
Debugging
Program Comprehension
Reverse Engineering
Program Testing
Measuring Program Metrics
Coverage, Overlap, Clustering
• Refactoring
• Program integration
Program Dependence Graph(PDG)
• A directed graph
• Nodes are basic instructions
(statements/conditions)
• Two type of edges between u and v
– Flow-Dependence
• The value assigned at is directly used at v
– Control-Dependence
• The value of the condition u “controls” the execution of
v
PDG Example
read(n)
read(n);
i := 1 ;
sum := 0;
prod := 1 ;
while (i n) do {
sum := sum + i;
prod := prod * i;
i := i + 1 ;
}
print sum;
print prod ;
i := 1
sum := 0
prod := 1
i n?
F
T
T
sum := sum + i
T
F
prod := prod * i
i :=i + 1
print sum
print prod
Flow Dependences with Pointers
List p, q, y;
q = (List *) malloc();
p = q;
l1: p->d = 5;
l1,5
l2: printf(q->d);
List p, q, y;
q = (List *) malloc();
p = q;
l1: p->d = 5;
l1,5 : t->d = 7;
l2: printf(q->d);
List p, q, y;
q = (List *) malloc();
p = q;
l1: p->d = 5;
l1,5 : p = (List *) malloc();
l2: printf(q->d);
Constructing Flow Dependences
• Instrument the program with statements
which record the location which last write into
memory
• v depends on u iff v reads a location last read
at v
• can be approximated statically
Simple Example
Append() {
List head, tail, temp;
l1: head = (List) malloc();
l2: scanf(“%c", &head->data);
l3: head->n = NULL;
l4: tail = head;
l5: if (tail->data == `x') goto l12
l6: temp = (List) malloc();
l7: scanf(“%c", &temp->data);
l8: temp->n = NULL
l9: tail->n = temp;
l10: tail = tail->n;
l11: goto l5
l12: printf(“%c", head->data);
l13: printf(“%c", tail->data);
Project 1: Java Slicer
• Develop Slicer for Java Programs with Shallow
Pointers
• Develop abstract domain and transformers
• Implement with Soot for intraprocedural
programs
• Evaluate the project on real and artificial
benchmarks
Taint Checking
• Enforce code security by tracking propagated
information
• Prevent bad behaviors (e.g., SQL injection)
HttpServletRequest request = ...;
String userName = request.getParameter("name");
Connection con = ...
String query = "SELECT * FROM Users " +
" WHERE name = ’" + userName + "’";
con.execute(query)
FlowDroid: Precise Context, Flow,
Field, Object-sensitive and Lifecycleaware Taint Analysis for Android App
FlowTwist: Efficient Context-Sensitive InsideOut Taint Analysis for Large Codebases
Johannes Lerch, Ben Hermann, Eric Bodden, and Mira Mezini
{lastname}@cs.tu-darmstadt.de
https://github.com/johanneslerch/FlowTwist
@stg_darmstadt
17.01.2016 | Technische Universität Darmstadt | Software Technology Group | 15
Oracle patches Java 7 vulnerability
Breaking its quarterly update schedule, Oracle has released a new Java runtime that addresses
recent security flaws.
Aug 30, 2012
Theory: Stack-based Access Control
SecurityManager.checkPermission FilePermission, SocketPermission, ...
SecurityManager.checkWrite
FilePermission, SocketPermission, ...
FileOutputStream.<init>
FilePermission, SocketPermission, ...
Attacker.doEvil
Ø
MyApplet.init
Ø
∩=Ø
Theory: Stack-based Access Control
public static Class<?> forName(String className) throws ClassNotFoundException {
return forName0(className,
true,
ClassLoader.getCallerClassLoader());
}
Implicit Permission Check
Class.forName
Privileged ClassLoader
Attacker.doEvil Unprivileged ClassLoader
MyApplet.init
Unprivileged ClassLoader
Leak in sun.beans.finder.ClassFinder (CVE2012-4681)
public static Class<?> findClass(String className)
{
try
{
ClassLoader cl = ...
...
return Class.forName(className, false, cl);
}
catch (ClassNotFoundException e)
{
throws SecurityException
}
catch (SecurityException e)
{
}
return Class.forName(className);
}
“handled” here
Leak in sun.beans.finder.ClassFinder (CVE2012-4681)
public static Class<?> findClass(String className)
{
try
{
ClassLoader cl = ...
...
Privileged
Class.forName
return Class.forName(className,
false, ClassLoader
cl);
throws SecurityException
}
ClassFinder.findClass Privileged ClassLoader
catch (ClassNotFoundException e)
Unprivileged ClassLoader
{
Attacker.doEvil
MyApplet.init
}
catch (SecurityException e)
{
}
return Class.forName(className);
}
Unprivileged ClassLoader
“handled” here
Deriving the Static Program Analysis Problem
Caller Sensitive
Track the return value
Private Method
Class.forName
Privileged ClassLoader
ClassFinder.findClass Privileged ClassLoader
Unprivileged ClassLoader
Attacker.doEvil Private Method
MyApplet.init
Unprivileged ClassLoader
Track the parameter
Public Method
Two Independent Analyses
Source
Caller Sensitive
Sink
Track the return value
Private Method
Private Method
Track the parameter
Sink
Public Method
Source
Two Independent Analyses: Not Context-Sensitive
Caller Sensitive
Track the return value
Private Method
Private Method
Private Method
Track the parameter
Public Method
Public Method
Pure Forward Context-Sensitive Approach
Caller Sensitive
Private Method
Private Method
Private Method
Public Method
Public Method
Source
Sink
Results – only Class.forName
40
35
Runtime [min]
30
25
20
15
10
5
0
10
9
8
7
6
Maximum Heap Size [GB]
5
4
3
Results – all Caller Sensitive Methods
40
35
Runtime [min]
30
25
20
15
10
5
0
10
9
8
7
6
Maximum Heap Size [GB]
5
4
3
Scale of the Problem
Caller Sensitive
Private Method
Private Method
Private Method
Public Method
Public Method
Public Method
~45,000 methods
Scale of the Problem
64 methods
Private Method
Caller Sensitive
Private Method
3,656 call sites
Private Method
Private Method
Private Method
Private Method
Private Method
Private Method
Public Method
Public Method
Public Method
~45,000 methods
Exploit Imbalance to Improve Scalability
64 methods
Private Method
Caller Sensitive
Private Method
3,656 call sites
Private Method
Private Method
Private Method
Private Method
Private Method
Private Method
Public Method
Public Method
Public Method
~45,000 methods
IFDS Algorithm [17,19] Reports Leaks
Caller Sensitive
?
Public Method
IFDS Algorithm: Computing Summaries
foo(a) {
a {a}
b = a
1:
a {a, b}
a {a, b}
2:
c = b
3:
return c
}
IFDS Algorithm: Computing Summaries
foo(a) {
b = a
1:
a {a, b}
c = b
2:
a {a}
b {b, c}
a {a, b, c}
return c
3:
}
IFDS Algorithm: Computing Summaries
foo(a) {
1:
b = a
2:
c = b
a {a, b, c}
return c
3:
}
Path Construction
foo(a) {
1:
b = a
a {a, b}
pred(b) = a
stmt(b) = #1
2:
c = b
a {a}
b {b, c}
pred(c) = b
stmt(c) = #2
a {a, b, c}
return c
3:
}
Path Construction: Merge at Branches
bar(a) {
if (...) {
b = a
1:
}
else {
b = a
2:
}
a {a, b1, b2}
return b
3:
}
a {a, b1}
pred(b1) = a
stmt(b1) = #1
a {a, b2}
pred(b2) = a
stmt(b2) = #2
Results – only Class.forName
40
Pure Forward Baseline
35
Independent Inside-Out
Runtime [min]
30
25
20
15
10
5
0
10
9
8
7
6
Maximum Heap Size [GB]
5
4
3
Results – all Caller Sensitive Methods
40
Pure Forward Baseline
35
Independent Inside-Out
Runtime [min]
30
25
20
15
10
5
0
10
9
8
7
6
Maximum Heap Size [GB]
5
4
3
Two Synchronized/Dependent Analyses
Private Method
Caller Sensitive
Private Method
Private Method
Private Method
Private Method
Private Method
Private Method
Private Method
Public Method
Public Method
Public Method
Two Synchronized/Dependent Analyses
Private Method
Caller Sensitive
Private Method
Private Method
Private Method
Private Method
Balanced Return
Private Method
Private Method
Private Method
Unbalanced Return
Public Method
Public Method
Public Method
Results – only Class.forName
40
Pure Forward Baseline
Independent Inside-Out
Dependent Inside-Out
35
Runtime [min]
30
25
20
15
10
5
0
10
9
8
7
6
Maximum Heap Size [GB]
5
4
3
Results – all Caller Sensitive Methods
40
Pure Forward Baseline
Independent Inside-Out
Dependent Inside-Out
35
Runtime [min]
30
25
20
15
10
5
0
10
9
8
7
6
Maximum Heap Size [GB]
5
4
3
Summary
Android Taint Flow
Analysis for App Sets
Will Klieber*, Lori Flynn,
Amar Bhosale , Limin Jia, and Lujo Bauer
Carnegie Mellon University
*presenting
Motivation
Detect malicious apps that leak sensitive data.
E.g., leak contacts list to marketing company.
“All or nothing” permission model.
Apps can collude to leak data.
Evades precise detection if only analyzed individually.
We build upon FlowDroid.
FlowDroid alone handles only intra-component flows.
We extend it to handle inter-app flows.
62
Introduction: Android
Android apps have four types of components:
Activities (our focus)
Services
Content providers
Broadcast receivers
Intents are messages to components.
Explicit or implicit designation of recipient
Components declare intent filters to receive implicit intents.
Matched based on properties of intents, e.g.:
Action string (e.g., “android.intent.action.VIEW ”)
Data MIME type (e.g., “image/png”)
63
Introduction
Taint Analysis tracks the flow of sensitive data.
Can be static analysis or dynamic analysis.
Our analysis is static.
We build upon existing Android static analyses:
FlowDroid [1]: finds intra-component information flow
Epicc [2]: identifies intent specifications
[1] S. Arzt et al., “FlowDroid: Precise Context, Flow, Field, Object-sensitive and
Lifecycle-aware Taint Analysis for Android Apps”. PLDI, 2014.
[2] D. Octeau et al., “Effective inter-component communication mapping in
Android with Epicc: An essential step towards holistic security analysis”.
USENIX Security, 2013.
64
Our Contribution
We developed a static analyzer called “DidFail”
(“Droid Intent Data Flow Analysis for Information Leakage”).
Finds flows of sensitive data across app boundaries.
Source code and binaries available at:
(or google “DidFail SOAP”)
http://www.cert.org/secure-coding/tools/didfail.cfm
Two-phase analysis:
1. Analyze each app in isolation.
2. Use the result of Phase-1 analysis to determine inter-app flows.
We tested our analyzer on two sets of apps.
65
Terminology
Definition. A source is an external resource (external to the app,
not necessarily external to the phone) from which data is read.
Definition. A sink is an external resource to which data is written.
For example,
Sources: Device ID, contacts, photos, current location, etc.
Sinks: Internet, outbound text messages, file system, etc.
66
Motivating Example
App SendSMS.apk sends an intent (a message) to Echoer.apk,
which sends a result back.
SendSMS.apk
Echoer.apk
Device ID
(Source)
getIntent()
startActivityForResult()
onActivityResult()
setResult()
Text Message
(Sink)
SendSMS.apk tries to launder the taint through Echoer.apk.
Existing static analysis tools cannot precisely detect such inter-app data flows.
67
Analysis Design
Phase 1: Each app analyzed once, in isolation.
FlowDroid: Finds tainted dataflow from sources to sinks.
Received intents are considered sources.
Sent intent are considered sinks.
Epicc: Determines properties of intents.
Each intent-sending call site is labelled with a unique intent ID.
Phase 2: Analyze a set of apps:
For each intent sent by a component,
determine which components can
receive the intent.
Generate & solve taint flow equations.
68
Running Example
src1
sink1
C1
I3
src3
sink3
I1
C2
Three components: C1, C2, C3.
C1 = SendSMS
C2 = Echoer
C3 is similar to C1
C3
• sink1 is tainted with only src1.
• sink3 is tainted with only src3.
69
Running Example
src1
sink1
C1
I3
src3
sink3
I1
C2
C3
Notation:
70
Running Example
src1
sink1
C1
I3
src3
sink3
I1
C2
C3
Notation:
71
Running Example
src1
sink1
C1
I3
src3
sink3
Notation:
I1
C2
C3
Final Sink Taints:
• T(sink1) = {src1}
• T(sink3) = {src3}
72
Phase-1 Flow Equations
Analyze each component separately.
Phase 1 Flow Equations:
src1
sink1
C1
C2
src3
C3
sink3
Notation
• An asterisk (“*”) indicates an unknown component.
73
src1
Phase-2 Flow Equations
sink1
Instantiate Phase-1 equations for all
possible sender/receiver pairs.
Phase 1 Flow Equations:
I1
C1
I3
src3
C2
C3
sink3
Phase 2 Flow Equations:
Notation
74
src1
Phase-2 Taint Equations
sink1
For each flow equation “src → sink”,
generate taint equation “T(src) ⊆ T(sink)”.
Phase 2 Flow Equations:
Notation
I1
C1
I3
src3
C2
C3
sink3
Phase 2 Taint Equations:
If s is a non-intent source,
then T(s) = {s}.
75
Phase 1
Epicc
Original APK
TransformAPK
FlowDroid
(modified)
Extract manifest
76
Implementation: Phase 1
APK Transformer
Assigns unique Intent ID to each call site of intent-sending methods.
Enables matching intents from the output of FlowDroid and Epicc
Uses Soot to read APK, modify code (in Jimple), and write new APK.
Problem: Epicc is closed-source. How to make it emit Intent IDs?
Solution (hack): Add putExtra call with Intent ID.
Phase 1
Epicc
Original APK
TransformAPK
FlowDroid
(modified)
Extract manifest
77
Implementation: Phase 1
FlowDroid Modifications:
Extract intent IDs inserted by APK Transformer, and include in output.
When sink is an intent, identify the sending component.
In base.startActivity, assume base is the sending component.
(Soundness?)
For deterministic output: Sort the final list of flows.
Phase 1
Epicc
Original APK
TransformAPK
FlowDroid
(modified)
Extract manifest
78
Implementation: Phase 2
Phase 2
Take the Phase 1 output.
Generate and solve the data-flow equations.
Output:
1. Directed graph indicating information flow between
sources, intents, intent results, and sinks.
2. Taintedness of each sink.
79
Testing DidFail analyzer: App Set 1
SendSMS.apk
Reads device ID, passes through Echoer,
and leaks it via SMS
Echoer.apk
Echoes the data received via an intent
WriteFile.apk
Reads physical location (from GPS),
passes through Echoer, and writes it to a file
80
Testing DidFail analyzer: App Set 2 (DroidBench)
Int3
= I(IntentSink2.apk, IntentSource1.apk, id3)
Int4
= I(IntentSource1.apk, IntentSink1.apk, id4)
Res8
= R(Int4)
Graph generated using GraphViz.
Src15 = getDeviceId
Snk13 = Log.i
Some taint flows:
81
Limitations
Unsoundness
Inherited from FlowDroid/Epicc
Native code, reflection, etc.
Shared static fields
Implicit flows
Currently, only activity intents
Bugs
Imprecision
Inherited from FlowDroid/Epicc
DidFail doesn’t consider permissions when matching intents
All intents received by a component are conflated together as a single
source
82
Use of Two-Phase Approach in App Stores
We envision that the two-phase analysis can be used as follows:
An app store runs the phase-1 analysis for each app it has.
When the user wants to download a new app, the store runs the phase-2
analysis and indicates new flows.
Fast response to user.
83
DidFail vs IccTA
IccTA was developed (at roughly the same time as DidFail) by:
Li Li, Alexandre Bartel, Jacques Klein, Yves Le Traon (Luxembourg);
Steven Arzt, Siegfried Rasthofer, Eric Bodden (EC SPRIDE);
Damien Octeau, Patrick McDaniel (Penn State).
IccTA uses a one-phase analysis
IccTA is more precise than DidFail’s two-phase analysis.
Two-phase DidFail analysis allows fast 2nd-phase computation.
Future collaboration between IccTA and DidFail teams?
84
Conclusion
We introduced a new analysis that integrates and enhances existing
Android app static analyses.
Demonstrated feasibility by implementing a prototype and testing it.
Two-phase analysis can be used by app store to provide fast response.
Future work:
Implicit flows
Static fields
Distinguish different received intents
Other data channels (file system, non-activity intents)
Etc.
85
Concurrent Modification
class Make { private Worklist worklist;
public static void main (String[] args)
{ Make m = new Make();
m.initializeWorklist(args); m.processWorklist(); }
void initializeWorklist(String[] args) { ...; worklist = new Worklist(); ... }
void processWorklist() { HashSet s = worklist.unprocessedItems();
for (Iterator i = s.iterator(); i.hasNext())
{ Object item = i.next(); // CME may occur here
if (...) processItem(item);
}
}
void processItem(Object i)
{ ...; doSubproblem(...); }
void doSubproblem(...) { ... worklist.addItem(newitem); ... } }
public class Worklist { HashSet s; public Worklist() { s = new HashSet(); ... }
public void addItem(Object item) { s.add(item); }
public HashSet unprocessedItems() { return s; }
}
An Illustrating Example
/* 0 */ Set v = new Set();
/* 1 */ Iterator i1 = v.iterator();
/* 2 */ Iterator i2 = v.iterator();
/* 3 */ Iterator i3 = i1;
/* 4 */ i1.next();
// The following update via i1 invalidates the
// iterator referred to by i2.
/* 5 */ i1.remove();
/* 6 */ if (...) { i2.next(); /* CME thrown */ }
// i3 refers to the same, valid, iterator as i1
/* 7 */ if (...) { i3.next(); /* CME not thrown */ }
// The following invalidates all iterators over v
/* 8 */ v.add("...");
/* 9 */ if (...) { i1.next(); /* CME thrown */ }
Java Project 3:
• Read PLDI’02 paper Deriving Specialized
Program Analyses for Certifying ComponentClient Conformance
• Design a simple abstract domain for an CME
• Implement in Soot
SOOT
By Joe Palmer
Information taken from http://www.sable.mcgill.ca/soot/tutorial/pldi03/tutorial.pdf
General Overview
Developed by Sable Research Group out
of McGill University in 1996-1997
Used to optimize Java Bytecode
4 source languages
4 intermediate representations used
Sources Languages
Primarily takes Java Source as its input
Can also take:
SML
Scheme
Eiffel
Scala
I.R.’s
Baf:
Streamlined, stack-based representation of bytecode
Abstracts type dependent variations of expressions into a single expression
Jimple:
Stack-less, typed, 3-Address representation of bytecode
Mix between java source and java bytecode
Linearization of a single expression into 3 separate statements
Only 15 jimple instructions are used
Compared to 200 possible instructions in java bytecode!
Shimple:
Only refers to 3 local vars or conts at once
SSA-form version of Jimple
Each local var has a single static point of definition (never reassign)
Uses Phi-Nodes for control flow
Grimp:
Similar to Jimple but allows trees of expressions together with a
representation of a “new” operator
Expressions are “aggregated”
main IR
used!!
Phases of the Optimization
© Copyright 2026 Paperzz