Document

João Nogueira
[email protected]
www.joaonogueira.eu
Fusion Tech Talk 2# (powered by Simplify Digital)
April 4, 2017
Outline

What is R?

Platforms & Libs

How to Use & Examples

Conclusions & Resources
Fusion Tech Talk 2# - R
2
João Nogueira
What is R?
Fusion Tech Talk 2# - R
João Nogueira
4
R

Open source programming language for
statistical computing
•
Supported by the R Foundation

Evolution of S programming

Designed to process data

Can use C, C++ and Fortran code

Can be called by C and C# code
•
Possibly others as well
Fusion Tech Talk 2# - R
João Nogueira
5
R Markdown

Simplicity of Markdown style

With support for embedded code chunks
•

R, Python, SQL, Bash, Rcpp, Stan, JavaScript, CSS
Generates static or dynamic documents
Fusion Tech Talk 2# - R
João Nogueira
Platforms & Libs
Fusion Tech Talk 2# - R
João Nogueira
RStudio Ecosystem

RStudio Desktop
•

Standalone R IDE
RStudio Server (Pro)
•

7
Multi-user Web R IDE
Shiny (Pro)
•
Build interactive Web-Apps with R
Fusion Tech Talk 2# - R
João Nogueira
Microsoft R Ecosystem
Fusion Tech Talk 2# - R
8
João Nogueira
9
Packages & Dependencies

R has an embedded package manager
•
•

CRAN (The Comprehensive R Archive Network)
•

install.package() / require() / library() commands
Dependency tree is automatically resolved
Holds most public packages (binaries & source)
Packages can be installed from source
Fusion Tech Talk 2# - R
João Nogueira
Essential Packages

data.table
•

10
Data storage & processing workhorse
dplyr / plyr
•
Data manipulation helpers
• e.g. Split-Apply-Combine

ggplot2
•
Swiss Army Knife of plots
Fusion Tech Talk 2# - R
João Nogueira
11
Other Packages

ggmaps - Google Maps API

RODBC - ODBC Connectivity

caret -Quick start to machine learning

bigmemory / ff - Datasets to big for memory

parallel/doMC/doParallel – Parallelization

h2o – AI / Deep Learning

Many many many many others
Fusion Tech Talk 2# - R
João Nogueira
How to Use & Examples
Fusion Tech Talk 2# - R
João Nogueira
13
Basic Commands

Get Help ‘?command’- e.g. ‘?fread’
•
Most of the times is all that you need

install.package/require/library – Load Packages

Local assignment operators
•

Use <-, not =, for assignment (long explanation)
source – Run R Script
Fusion Tech Talk 2# - R
João Nogueira
14
Example

Lets see what we can do with a leaked dataset
of 10 M passwords
•
The dataset has a raw size of ~200MB and is publicly
available on the Internet

How can we use it to improve password
requirements on a registration page?

Password rules should not be too strict
•
Don’t drive your users crazy!
Fusion Tech Talk 2# - R
João Nogueira
15
Goals & Challenges

Goal:
•
We should tell users if their password is
commonly used
• We should tell users if their password has
been previously leaked and is vulnerable

Challenge
•
Its not efficient/feasible to search within the
dataset in realtime as the user is typing
Fusion Tech Talk 2# - R
João Nogueira
Data Loading & Exploration

16
Let’s use data.table for that:
#### Load passwords ####
rawPasswords <- fread('10-million-combos.txt',col.names=c("User","Password"))
#### Sort by popularity ####
sortedPasswords <- rawPasswords[,.(.N),by=.(Password)][order(-N),]
Fusion Tech Talk 2# - R
João Nogueira
Visualization - WordCloud
17
lords <- Corpus(VectorSource(rawPasswords$Password))
wordcloud(lords, scale=c(8,1), max.words=100,random.color=TRUE,
random.order=FALSE, rot.per=0.35, use.r.layout=TRUE, colors=brewer.pal(8, 'Dark2'))
Fusion Tech Talk 2# - R
João Nogueira
18
Common Passwords

We should tell users if their password is
commonly used
•

Assume: “common” = 25% most popular passwords
Quantiles offer a good indication of how
common certain passwords are
# Add password Rank column (popularity)
sortedPasswords[,Rank:=1:nrow(sortedPasswords)]
# Merge Rank information with the original dataset
rankedPasswords<-base::merge(rawPasswords,sortedPasswords,by=("Password"))
# Compute quantiles
qt = quantile(rankedPasswords$Rank, quantilesToUse)
Fusion Tech Talk 2# - R
João Nogueira
Exploration – Quantiles

Hey, it’s only 25k passwords!
•

19
0.2% of passwords are responsible for 25% of leaks
Perfectly feasible to check in realtime and warn
users if they are entering a commonly used
password
Fusion Tech Talk 2# - R
João Nogueira
Exploration – CDF

20
Cumulative Density Functions (CDFs) are a nice
way to explore password frequency as well
•
(Plotting code omitted for brevity)
Fusion Tech Talk 2# - R
João Nogueira
21
Leaked Passwords

We should tell users if their password is known
to have been leaked

But how to do it quickly and efficiently?

Why not Bloom Filters?
•
Good at telling us quickly if a given element belongs
to a set without having to store the full set
• Efficiency limited by the allowable false-positive rate
• O(k) complexity (k = number of hashing functions, not
number of items!)
Fusion Tech Talk 2# - R
João Nogueira
22
Bloom Filter for Passwords
#### Create Bloom Filter ####
Bloom <- get("Bloom", envir = asNamespace("bloom"))
bloomFilter <- new(Bloom,capacity = nrow(sortedPasswords), error_rate = 0.001,
filename = "./bloom.bin",file.exists("./bloom.bin"))
#### Add known passwords to Bloom Filter ####
dataToAdd <- sortedPasswords[order(Rank),]
d_ply(.data=dataToAdd,
.variables=.(Password,Rank),
.fun=function(x){
bloomFilter$add(x$Password,x$Rank)
}
)
#### Check if all passwords are in Bloom Filter ####
addedPasswords = dataToAdd$Password
containsResult <- aaply(addedPasswords,.margins=1, bloomFilter$contains)
sum(containsResult) == length(addedPasswords)
Fusion Tech Talk 2# - R
João Nogueira
23
Bloom Filter for Passwords

The output *.bin file is about ~40MB
•

5x footprint reduction at 0.1% false-positive rate
Tell users in realtime if their password has been
leaked before!
Fusion Tech Talk 2# - R
João Nogueira
24
Other Considerations

The results could obviously be improved by
forcing a minimum-length password for instance

But serve to show how flexible R is

And the vast amount of packages that exist
Fusion Tech Talk 2# - R
João Nogueira
Conclusions & Resources
Fusion Tech Talk 2# - R
João Nogueira
Conclusions

R is great for data processing

Great for data visualization/plotting too!

Good starting point for machine learning

Tons of packages that do the work for you

Lots of open-source & free tools

Easy to integrate with popular ecosystems
Fusion Tech Talk 2# - R
26
João Nogueira
Resources

TONS of material in Coursera & Books

http://www.cyclismo.org/tutorial/R/

https://www.r-project.org/foundation/

https://cran.r-project.org/

https://www.rstudio.com/

http://rmarkdown.rstudio.com

https://bookdown.org/yihui/bookdown/

https://plot.ly/r/getting-started/

https://msdn.microsoft.com/en-us/microsoft-r/microsoft-r-more-resources

https://www.r-bloggers.com/

https://github.com/RevolutionAnalytics
Fusion Tech Talk 2# - R
27
João Nogueira
Thank you!