Probabilistic Higher Order Grammar: Probabilistic Model for Code

Probabilistic Higher Order Grammar:
Probabilistic Model for Code
Pavol Bielik, Veselin Raychev, Martin Vechev
Write new code:
Code Completion
Debug code:
Statistical Bug Detection
likely error
...
for x in range(a):
print a[x]
Camera camera = Camera.Open();
camera.SetOrientation(90);
?
Port code:
Programming Language Translation
C#
Java
Console.WriteLine('Hi');
...
System.out.println('Hi');
...
Widely Applicable
Agnostic to
Efficient Learning
Trained as efficiently
as PCFGs and n-gram models
applications
programming language
Probabilistic
Model
15 million
repositories
Probabilistic
Model
How can we learn probabilistic
models directly from data?
Billions of
lines of code
Flexible Representation
Conditioning for the predictions
is determined dynamically
High quality, tested,
maintained programs
number of repositories
last 8 years
The Need for Better Conditioning
Context Free Grammar
� → �1… �n
poor context
Property → x
Property → y
Property → promise
P
0.05
0.03
0.001
ReturnStatement
MemberExpression
Identifier
Property
defer
?
AST
Using Programs to Explain Data
awaitReset = function(){
...
return defer.promise;
}
In general:
Unrestricted programs
(Turing complete)
�[�] → �1… �n
Our Work:
TCond Language for navigating over
trees and accumulating context
richer context
awaitRemoved = function(){
fail(function(error){
if (error.status === 401){
...
}
defer.reject(error);
});
...
richer context
return defer.?
on which
}
to condition
Property
→ promise
P
0.67
TCond
::=
MoveOp
::=
� | WriteOp TCond | MoveOp TCond
Up, Left, Right, DownFirst, DownLast,
NextDFS, PrevDFS, NextLeaf, PrevLeaf
PrevNodeType, PrevNodeValue, PrevNodeContext
[defer, reject, promise]
Property
→ notify
0.12
WriteOp ::=
[defer, reject, promise]
Property
→ resolve
WriteValue, WriteType, WritePos
0.11
Example TCond program for predicting property names
[defer, reject, promise]
1
2
3
4
5
Probabilistic Higher Order Grammar (PHOG)
elem.notify(
... ,
... ,
{
position: ‘top’,
autoHide: false,
?
}
);
N: set of non-terminal symbols
Σ: set of terminal symbols
s: start symbol
Condition the predictions
on richer context
C: conditioning set where �
∊
C
Parameter
Position
API
name
CallExpression
5
6
...
MemberExpression
7
Identifier
elem
...
3
8
4
ObjectExpression
1
Property
notify
2
Property
position
Property
autoHide
String
‘top’
Boolean
false
Property
?
Learning
p: AST → C
q: R → ℝ+
valid probability distribution
∑ q(�[�] → � … � ) = 1
1
Existing programs as trees
n
Search technique
Enumerative search
Genetic programming
�[�] → �1… �n∊ R
Execute
program p
AST
p(
Select most likely
production rule
pbest = arg min cost(D, p) + � ∙ Ω(p)
)= �
arg max q(�[�] → �1… �n)
p ∊ TCond
�1… �n | �[�] → �1… �n ∊ R
Example query for most likely prediction
TCond
::=
� | WriteOp TCond | MoveOp TCond
MoveOp
::=
Up, Left, Right, ...
WriteOp ::=
Regularization to
avoid too complex
programs
|d| << |D|
|cost(d, p) - cost(D,p)| < �
Representative sampling
WriteValue, WriteType, ...
TCond Language
Raychev, V., Bielik, P., Vechev, M. and Krause, A.
Learning Programs from Noisy Data. POPL ’16, ACM
Evaluation
150k JavaScript Programs
http://www.srl.inf.ethz.ch/js150.php
PCFG
8
R: set of rules in form: �[�] → �1… �n
program that dynamically
obtains the context
for given query
176 unique values
7
Left, WriteValue, Up, WritePos, Up, DownFirst, DownLast, WriteValue
Previous
Property
Source
Code
6
Code Completion Error Rate
Non-Terminals
Terminals
48.5%
49.9%
100k: Training Set
(1.07 ∙ 108 completion queries)
50k: Evaluation Set (5.3 ∙ 107 completion queries)
Code Completion Examples
Training Time
Queries
per Second
1 min
71 000
109 unique values
PCFG
n-gram
30.8%
28.7%
n-gram
4 min
15 000
Naive Bayes
41.6%
45.8%
Naive Bayes
3 min
10 000
SVM
32.5%
29.5%
SVM
PHOG
25.9%
18.5%
PHOG
36 hours
12 500
162 + 3 min
50 000
ICML @ NYC
JUNE 19-24 2016
Completion Kind
Error Rate
Completion Example
Identifier
38%
contains = jQuery …
Property
35%
start = list.length;
String
48%
‘[‘ + attrs + ‘]’
Number
36%
canvas(xy[0], xy[1], …)
RegExp
34%
line.replace(/(&nbsp;| )+/, …)
UnaryExpr
3%
BinaryExpr
26%
LogicalExpr
8%
if (!events || !…)
while (++index < …)
frame = frame || …