Menu Models of Computation Unrestricted Grammar

Menu
cs3102: Theory of Computation
Class 21:
Undecidability in Theory and Practice
Exam 2: Out at end of class
today, due Tuesday at 2:01pm
Spring 2010
University of Virginia
David Evans
What does undecidability mean for problems
real people (not just CS theorists and Busy
Beavers) care about?
•
•
•
•
Turing-equivalent Grammar
Universal Programming Languages
“Return-Oriented Programming”
Problems Computers Can and Cannot Solve
Adapted from Class 1:
Models of Computation
Machine
Finite Automata (Class 2-5)
Replacement Grammar
Regular Grammar (A→aB)
Pushdown Automata (add a Context-free Grammar
stack) (Classes 6-7)
(Classes 7-9) (A→BC)
Turing machine (add an
infinite tape) (Classes 14-20)
Unrestricted Grammar
Right and left sides of a grammar rule can be any sequence of
terminals and nonterminals.
aXbY → cZd
?
How can we prove unrestricted grammars are equivalent to a Turing machine?
Simulation Proof
Show that we can simulate every Unrestricted Grammar with
some TM.
Fairly easy, but tedious (not shown): design a TM that
does the grammar replacements by writing on the tape.
Show that we can simulate every TM with some Unrestricted
Grammar.
Simulating TM with UG
Simulating TM with UG
Simulating TM with UG
(No replacement rules with R on left side)
Models of Computation
Initial Configuration
Machine
w0
w1
w2
...
q0
Replacement Grammar
Finite Automata (Class 2-5)
Regular Grammar (A→aB)
Pushdown Automata (add a Context-free Grammar
stack) (Classes 6-7)
(Classes 7-9) (A→BC)
Turing machine (add an
Unrestricted Grammar
infinite tape) (Classes 14-20) (Class 21) (α→β)
Universal Programming Language
• Definition: a programming language that can
describe every algorithm.
• Equivalently: a programming language that
can simulate every Turing Machine.
• Equivalently: a programming language in
which you can implement a Universal Turing
Machine.
Which of these are
Universal Programming Languages?
BASIC
C++
COBOL
Python
x86
Java
HTML
C#
PDF
JavaScript
PostScript
Fortran
Scheme
TeX
Ruby
Proofs
Why is it impossible for a
programming language to be
both universal
and resource-constrained?
• BASIC, C, C++, C#, Fortran, Java, JavaScript, PDF,
PostScript, Python, Ruby, Scheme, TeX, etc. are
universal
– Proof: implement a TM simulator in the PL
• HTML (before HTML5) is not universal:
– Proof: show some algorithm that cannot be
implemented in HTML
Resource-constrained means it is possible to determine
an upper bound on the resources any program in the
language can consume.
• An infinite loop
– HTML5 might be a universal programming
language! (Proof is worth challenge bonus.)
Why so many equally powerful
programming languages?
All universal programming
language are equivalent in power:
they can all simulate a TM, which
can carry out any mechanical
algorithm.
• “Aesthetics”
– Some people like :=, others prefer =.
– Some people think whitespace shouldn’t matter (e.g.,
Java), others think programs should be formatted like
they mean (e.g., Python)
– Some people like goto, others like throw.
• Expressiveness vs. Simplicity
x86
– Hard to write programs in SUBLEQ
• Expressiveness vs. “Truthiness”
– How much you can say with a little code vs. how likely
it is your code means what you think it does
Programming Language Design Space
high
Expressiveness
Proliferation of Universal PLs
low
C
Scheme
Python
(display “Hello!”)
print ("Hello!")
public class HelloWorld {
public static void main(String[] args) {
System.out.println ("Hello!");
}
}
more mistake prone
Java
strict typing,
static
Ada
Spec#
less mistake prone
“Truthiness”
CS 3330 Condensed
Do most x86 programs contain
Universal Turing Machines?
Instruction Pointer (ip)
Memory
load
Fetch Instruction
Return Address
Update ip
Stack
Return
Address
Execute Instruction
Return Address
Hovav Shacham. The Geometry of Innocent Flesh on the Bone:
Return-into-libc without Function Calls (on the x86). CCS 2007.
Execution Stack
(not like PDA stack)
[Paper link]
CS 2150 Really Condensed
Sequences of Instructions
x86 programs are just sequences of bytes (1 byte = 8 bits = 2 hex
characters)
movl $0x0f000000, (%edi)
Instructions are encoded using variable-length (1-15 bytes)
First byte is opcode that identifies the type of instruction
bb ef be ad de
inc %ebp
ret
xchg %ebp, %eax
f7 c7 07 00 00 00 0f 95 45 c3
MOV $0xDEADBEEF, %eax
c3
setnzb -61(%ebp)
test $0x00000007, %edi
5-byte instruction that writes the constant 0xDEADBEEF into register %eax
RET
1-byte instruction that returns (jumps to the return address that is
stored in a location on the stack)
eb fe
JMP -2
2-byte instruction that moves the instruction pointer back 2
(Example from Shacham’s paper)
Injecting Malicious Code
“Return-Oriented Programming”
41
69
5F
F5
55
D2
93
62
4D
FE
F7
57
98
36
F6
69
E7
EC
5A
EA
D8
D4
…
AC
80
5A
F2
B1
87
15
96
16
A4
88
03
38
82
C7
FB
BA
82
03
A7
76
11
16
B8
B6
16
AD
D6
1E
00
F4
61
25
56
0B
0F
2C
42
DF
0C
CB
E6
B1
2F
FA
C9
4B
C2
57
D2
8F
17
8E
5C
37
FB
D7
DA
1F
6F
B7
40
3A
8C
DD
3F
A0
9F
61
B4
80
C5
5E
F9
36
5D
F6
C8
9B
39
EF
DD
DD
7B
55
0B
F2
4F
44
BB
A8
8D
97
67
4C
CD
7B
E4
F9
07
0F
16
B7
F7
28
51
82
B8
4E
F7
79
34
B4
54
BB
8B
C1
82
7D
8C
B0
4B
5B
35
56
A1
5C
91
60
0A
DF
93
8C
99
AE
13
DA
BE
66
2F
91
8D
E8
9A
8D
55
2D
5D
62
F5
39
53
3F
B5
2D
E9
27
58
39
5F
C1
93
C7
6C
B8
F7
A7
6B
F0
83
6F
19
85
35
C6
CB
43
4E
85
93
B3
09
1D
C2
D3
33
42
FC
CE
C7
1B
C5
2F
31
92
C2
FF
6A
B3
75
05
94
23
BB
82
10
E1
C3
11
1D
F5
D0
39
07
17
B7
BE
BA
A7
AA
47
D5
0B
DC
7D
B8
06
5D
49
46
43
8D
66
48
2F
43
9F
E0
38
54
BF
01
4C
A5
CA
02
B3
F7
B7
DD
DB
7E
4F
0D
50
52
26
C3
D1
C7
C2
1D
4B
E1
62
64
20
16
BD
EF
C1
21
A5
27
6B
D5
5B
89
65
B9
51
0B
DB
CF
89
E6
...
h
Return Address 1
Return Address 2
Return Address 3
int main (void) {
int x = 9;
char s[4];
Return Address 4
gets(s);
printf ("s is: %s\n“, s);
printf ("x is: %d\n“, x);
Return Address 5
Return Address 6
…
}
Execution Stack
g
return address
f
x
s[3]
e
d
s[2]
c
s[1]
b
s[0]
a
C Program
Stack
Buffer Overflows
int main (void) {
int x = 9;
char s[4];
gets(s);
printf ("s is: %s\n“, s);
printf ("x is: %d\n“, x);
}
Note: your results may
vary (depending on
machine, compiler, what
else is running, time of
day, etc.). This is what
makes C fun!
> gcc -o bounds bounds.c
> bounds
abcdefghijkl
(User input)
s is: abcdefghijkl
x is: 9
> bounds
abcdefghijklm
s is: abcdefghijklmn
x is: 1828716553
= 0x6d000009
> bounds
abcdefghijkln
s is: abcdefghijkln
x is: 1845493769
= 0x6e000009
> bounds
aaa... [a few thousand characters]
crashes shell
Code Red
What does this kind of mistake look like
in a popular server?
Defenses
• Use a type-safe programming language (e.g.,
bounds checking)
• Write-xor-Execute pages
– When the OS loads a page into memory, it is
marked as either executable or writable: can’t be
both
– Hence: attacker can inject all the code it wants on
the stack, but can’t jump to it and execute it
Does it really work?
• Likelihood of finding enough gadgets in
“random” bytes to make Turing-complete
libc (C library included in nearly all Unix programs)
contains more than enough (18MB ~ expect to
have ~ 74000 RET (c3) instructions)
• Demonstration of attack on voting machine:
http://www.youtube.com/watch?v=lsfG3KPrD1I
“Return-Oriented Programming”
41 AC 16 FA A0 44 79 8C 2D 43 B3 47 4C 62
69 80 B8 C9 9F BB 34 99 E9 4E 75 D5 A5 64
5F 5A B6 4B 61 A8 B4 AE 27 85 05 0B CA 20
F5 F2 16 C2 B4 8D 54 13 58 93 94 DC 02 16
Return Address 1
55 B1 AD 57 80 97 BB DA 39 B3 23 7D B3 BD
D2 87 D6 D2 C5 67 8B BE 5F 09 BB B8 F7 EF
Return Address 2
93 15 1E 8F 5E 4C C1 66 C1 1D 82 06 B7 C1
Return Address 3
62 96 00 17 F9 CD 82 2F 93 C2 10 5D DD 21
4D 16 F4 8E 36 7B 7D 91 C7 D3 E1 49 DB A5
Return Address 4
FE A4 61 5C 5D E4 8C 8D 6C 33 C3 46 7E 27
Return Address 5
F7 88 25 37 F6 F9 B0 E8 B8 42 11 43 4F 6B
57 03 56 FB C8 07 4B 9A F7 FC 1D 8D 0D D5
Return Address 6
98 38 0B D7 9B 0F 5B 8D A7 CE F5 66 50 5B
36 82 0F DA 39 16 35 55 6B C7 D0 48 52 89
…
F6 C7 2C 1F EF B7 56 2D F0 1B 39 2F 26 65
69 FB 42 6F DD F7 A1 5D 83 C5 07 43 C3 B9
Execution Stack
E7 BA DF B7 DD 28 5C 62 6F 2F 17 9F D1 51
ECDefeats
82 0C WoX
40 7B
51
91
F5
19
31
B7
E0
C7
0B
defense! Attacker can run any code they want by finding a
5A 03 CB 3A 55 82 60 39 85 92 BE 38 C2 DB
Turing-complete set of “gadgets” it in your program and jumping to them!
EA A7 E6 8C 0B B8 0A 53 35 C2 BA 54 1D CF
D8 76 B1 DD F2 4E DF 3F C6 FF A7 BF 4B 89
D4 11 2F 3F 4F F7 93 B5 CB 6A AA 01 E1 E6
…
Vulnerability Detection
Input: an x86 program P
Output: True if there is some input w, such that
running P on w allows the attacker to
overwrite return addresses on stack; False
otherwise.
Example: Morris Internet Worm (1988)
P = fingerd
– Program used to query user status (running on most
Unix servers)
isVulnerable(P)?
Yes, for w = “nop400 pushl $68732f pushl $6e69622f
movl sp,r10 pushl $0 pushl $0 pushl r10 pushl $3 movl
sp,ap chmk $3b”
– Worm infected several thousand computers (~10% of
Internet in 1988)
Vulnerability Detection
Input: an x86 program P
Output: True if there is some input w, such that
running P on w allows the attacker to
overwrite return addresses on stack; False
otherwise.
Vulnerability Detection is Undecidable
Vulnerability Detection is Undecidable
“Solving” Undecidable Problems
“Impossibility” of Vulnerability Detection
• Undecidable means there is no program that
1. Always gives the correct answer, and
2. Always terminates
• Must give up one of these:
– Giving up #2 is not acceptable in most cases
– Must give up #1: cannot be correct on all inputs
• Or change the problem
– e.g., modify P to make it invulnerable, etc.
Actual Vulnerability Detectors
• Sometimes give the wrong answer:
– “False positive”: say P is a vulnerable when it isn’t
– “False negative”: say P is safe when it is
• Heuristics to find common errors
• Heuristics to rank-order possible problems
Can Microsoft squash 63,000 bugs in Windows 2000?
… Overall, there are more than 65,000 "potential issues" that could
emerge as problems, as discovered by Microsoft's Prefix tool. Microsoft is
estimating that 28,000 of these are likely to be "real" problems.
Computability in
Theory and Practice
(Intellectual Computability
Discussion on TV)
http://video.google.com/videoplay?docid=1623254076490030585#
Ali G Problem
Input: a list of numbers (mostly 9s)
Output: the product of the numbers
LALIG = { < k0, k1, …, kn, p> | each ki
represents a number and p represents a
number that is the product of all the kis.
numbers }
Is LALIG decidable?
Yes. It is easy to see a simple algorithm
(e.g., elementary school multiplication)
that decides it.
Can real computers solve it?
Ali G was Right!
• Theory assumes ideal computers:
– Unlimited, perfect memory
– Unlimited (finite) time
• Real computers have:
– Limited memory, time, power outages, flaky
programming languages, etc.
– There are many decidable problems we cannot
solve with real computer: the actual inputs do
matter (in practice, but not in theory!)
Charge
• Exam 2 out now
• Due at beginning of class, Tuesday
• It has some pretty tough questions (and no
really easy questions): don’t get stressed out if
you can’t answer everything