Hidden Information in the DNS Protocol

Hidden Information in the DNS Protocol
Christian Dietz and Rocco Mandrysch
Universität der Bundeswehr München and RUAG Defence
SIGS Technology Summit
16.06.2016
Contents
1
Introduction
2
DNS Name Properties
3
Emergent Self-organizing Maps
4
Summary and Conclusion
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
1 / 18
Introduction
Introduction
Approach:
Bots are using DNS protocol as a communication channel:
a covert channel
to transport commands or information
In many cases DNS names are generated
via ”Domain Generation Algorithm” (DGA)
Example: Tinba, Pisloader, PadCrypt, ...
Goal:
Goal: bot detection via identification of generated domain names with
machine learning (ESOMS - which has not been used in any analysis)
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
2 / 18
Introduction
Analysis strategy
1
Study properties of ”normal” and DGA domain names
2
Extract problem relevant ”features” in domain names
3
Training ESOM parameters with extracted ”features” as input.
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
3 / 18
DNS Name Properties
Data used for the analysis
Taken from a research network
Collected in a time range of one day
(from midnight to midnight)
Selected only domain name from DNS request packets
No IPv6
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
4 / 18
DNS Name Properties
DNS: number of characters in domain name
Maximum is caused by request from AntiVir-Software
and cloud infrastructure domains
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
5 / 18
DNS Name Properties
Relative digit occurrence
second-level domain
Rocco Mandrysch (RUAG Defence)
third-level domain
Hidden Information in the DNS Protocol
16.06.2016
6 / 18
DNS Name Properties
Relative vowel occurrence
second-level domain
Rocco Mandrysch (RUAG Defence)
third-level domain
Hidden Information in the DNS Protocol
16.06.2016
7 / 18
DNS Name Properties
Relative consonant occurrence
second-level domain
Rocco Mandrysch (RUAG Defence)
third-level domain
Hidden Information in the DNS Protocol
16.06.2016
8 / 18
DNS Name Properties
Shannon Entropy - Definition
Introduced by Claude Shannon in the book ”A Mathematical Theory
of Communication” in 1948
Entropy H(X ) describes the amount of information for a variable X
Calculated via
H(X ) =
Pn
i=1
P(xi ) logb P(x1 )
X = {x1 , ..., xn } is a discrete random variable
with possible value x
P(X ) is a probability mass function
b is the base of the logarithmic function
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
9 / 18
DNS Name Properties
Shannon Entropy
Entropy for whole domain name as well as generated
Generated: Strings with a length of 5, 10, 26 and 50 characters,
which are generated with simple random number generator.
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
10 / 18
Emergent Self-organizing Maps
Emergent Self-Organizing Maps - Definition
SOM is an artificial neural network (ANN) algorithm
NN are used to estimate or approximate function
Self-Organizing Map (SOM) is an unsupervised learning algorithm.
unsupervised learning: type of algorithms that try to find correlations
without any external inputs other than the raw data.
Method of mapping high-dimensional input to low dimensional output
such as 2-D or 3-D
Similar inputs are mapped to close locations in the low dimensional
map
Large SOMs are called Emergent Self-Organizing Maps (ESOM) to
emphasize the distinction.
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
11 / 18
Emergent Self-organizing Maps
Analysis setup
Architecture: boundless toroid grids
to avoid border effects, topology errors, and enable an intuitive
undistorted visualization.
Training data:
standard domain names
tinba domain names
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
12 / 18
Emergent Self-organizing Maps
Analysis setup - Tinba
Tinba - DGA: domain name is used as seed for generating the next
domain
Example generated domain names:
npggxbrbwqpo.com
qmecbeefjkxg.com
cgpyiywvqqqx.com
bkttfgbbfegc.com
dbrqkktiyxng.com
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
13 / 18
Emergent Self-organizing Maps
Analysis setup
Input variables
entropy
relative occurrence of
letters, digits, consonants, vowels, symbols
n-gram occurrence (n > 1)
Value is calculated for TLD, SLD and THLD
In total 26 input variables
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
14 / 18
Emergent Self-organizing Maps
Results
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
15 / 18
Emergent Self-organizing Maps
Results
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
16 / 18
Emergent Self-organizing Maps
Results
Dots with black corner are overlays of
normal domain names (red)
tinba domain names (green)
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
17 / 18
Summary and Conclusion
Summary and next steps
Studied properties of ”default” and DGA domain name
Extracted features as input for training ESOM parameters
Single features are not enough for distinguishing between ”normal”
and DGA domain names → ESMO is a promising approach
Next steps of ESOM analysis: optimize input features
More results will follow → to be continued ...
Thank you!
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
18 / 18
Backup slides
Backup slides
Related work
“Finding Domain-Generation Algorithms by Looking at Length
Distributions”, Miranda Mowbray & Josiah Hagen, DOI:
10.1109/ISSREW.2014.20, 2014
”Detecting DNS Tunnels Using Character Frequency Analysis”,
Kenton Born & David Gustafson, arXiv:1004.4358, 2014
”Botnet Detection Using Passive DNS”, Pedro Marques de Luz,
Master Thesis, 2014
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
20 / 18
Backup slides
Vocabulary
mail.google.com
com: top level domain (tld)
google: second-level domain (sld)
mail: third-level domain (thld)
block: single element of the domain such as tld or sld
(separated by ’dot’)
length of a block: number of symbols/letters/digits in a block
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
21 / 18
Backup slides
DNS: number of blocks
many request such as *.in-addr.arpa
→ maximum at 6
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
22 / 18
Backup slides
DNS: length of a block
length of top level domains → maximum at 3
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
23 / 18
Backup slides
Generated Domain Names
In general DGA are always based on
a “random number generator” together with
SHA256 hashing as generation scheme and the current date for seeding
(PadCrypt).
domain name as seed for generating the next domain (Tinba).
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
24 / 18
Backup slides
Generated Domain Names: analysis setup
Simple pseudo-random-number generator
Default setup in Python3 class random
Core generator: Mersenne Twister
50k generated domain names
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
25 / 18
Backup slides
Generated Domain Names
Every letter occurrence with a similar number
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
26 / 18
Backup slides
Generated Domain Names
relative vowel occurrence
length: 26 letters
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
27 / 18
Backup slides
Generated Domain Names
relative vowel occurrence
length: 5 letters
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
28 / 18
Backup slides
Generated Domain Names
relative consonant occurrence
length: 26 letters
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
29 / 18
Backup slides
Generated Domain Names
relative consonant occurrence
length: 5 letters
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
30 / 18
Backup slides
Shannon Entropy - Definition
Two probability distributions over 30 bins illustrating the higher value
of the entropy H for the broader distribution.
The largest entropy would arise from a uniform distribution that
would give H = ln(1/30) = 3.40.
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
31 / 18
Backup slides
ESOM: Results
Rocco Mandrysch (RUAG Defence)
Hidden Information in the DNS Protocol
16.06.2016
32 / 18