João Nogueira [email protected] www.joaonogueira.eu Fusion Tech Talk 2# (powered by Simplify Digital) April 4, 2017 Outline What is R? Platforms & Libs How to Use & Examples Conclusions & Resources Fusion Tech Talk 2# - R 2 João Nogueira What is R? Fusion Tech Talk 2# - R João Nogueira 4 R Open source programming language for statistical computing • Supported by the R Foundation Evolution of S programming Designed to process data Can use C, C++ and Fortran code Can be called by C and C# code • Possibly others as well Fusion Tech Talk 2# - R João Nogueira 5 R Markdown Simplicity of Markdown style With support for embedded code chunks • R, Python, SQL, Bash, Rcpp, Stan, JavaScript, CSS Generates static or dynamic documents Fusion Tech Talk 2# - R João Nogueira Platforms & Libs Fusion Tech Talk 2# - R João Nogueira RStudio Ecosystem RStudio Desktop • Standalone R IDE RStudio Server (Pro) • 7 Multi-user Web R IDE Shiny (Pro) • Build interactive Web-Apps with R Fusion Tech Talk 2# - R João Nogueira Microsoft R Ecosystem Fusion Tech Talk 2# - R 8 João Nogueira 9 Packages & Dependencies R has an embedded package manager • • CRAN (The Comprehensive R Archive Network) • install.package() / require() / library() commands Dependency tree is automatically resolved Holds most public packages (binaries & source) Packages can be installed from source Fusion Tech Talk 2# - R João Nogueira Essential Packages data.table • 10 Data storage & processing workhorse dplyr / plyr • Data manipulation helpers • e.g. Split-Apply-Combine ggplot2 • Swiss Army Knife of plots Fusion Tech Talk 2# - R João Nogueira 11 Other Packages ggmaps - Google Maps API RODBC - ODBC Connectivity caret -Quick start to machine learning bigmemory / ff - Datasets to big for memory parallel/doMC/doParallel – Parallelization h2o – AI / Deep Learning Many many many many others Fusion Tech Talk 2# - R João Nogueira How to Use & Examples Fusion Tech Talk 2# - R João Nogueira 13 Basic Commands Get Help ‘?command’- e.g. ‘?fread’ • Most of the times is all that you need install.package/require/library – Load Packages Local assignment operators • Use <-, not =, for assignment (long explanation) source – Run R Script Fusion Tech Talk 2# - R João Nogueira 14 Example Lets see what we can do with a leaked dataset of 10 M passwords • The dataset has a raw size of ~200MB and is publicly available on the Internet How can we use it to improve password requirements on a registration page? Password rules should not be too strict • Don’t drive your users crazy! Fusion Tech Talk 2# - R João Nogueira 15 Goals & Challenges Goal: • We should tell users if their password is commonly used • We should tell users if their password has been previously leaked and is vulnerable Challenge • Its not efficient/feasible to search within the dataset in realtime as the user is typing Fusion Tech Talk 2# - R João Nogueira Data Loading & Exploration 16 Let’s use data.table for that: #### Load passwords #### rawPasswords <- fread('10-million-combos.txt',col.names=c("User","Password")) #### Sort by popularity #### sortedPasswords <- rawPasswords[,.(.N),by=.(Password)][order(-N),] Fusion Tech Talk 2# - R João Nogueira Visualization - WordCloud 17 lords <- Corpus(VectorSource(rawPasswords$Password)) wordcloud(lords, scale=c(8,1), max.words=100,random.color=TRUE, random.order=FALSE, rot.per=0.35, use.r.layout=TRUE, colors=brewer.pal(8, 'Dark2')) Fusion Tech Talk 2# - R João Nogueira 18 Common Passwords We should tell users if their password is commonly used • Assume: “common” = 25% most popular passwords Quantiles offer a good indication of how common certain passwords are # Add password Rank column (popularity) sortedPasswords[,Rank:=1:nrow(sortedPasswords)] # Merge Rank information with the original dataset rankedPasswords<-base::merge(rawPasswords,sortedPasswords,by=("Password")) # Compute quantiles qt = quantile(rankedPasswords$Rank, quantilesToUse) Fusion Tech Talk 2# - R João Nogueira Exploration – Quantiles Hey, it’s only 25k passwords! • 19 0.2% of passwords are responsible for 25% of leaks Perfectly feasible to check in realtime and warn users if they are entering a commonly used password Fusion Tech Talk 2# - R João Nogueira Exploration – CDF 20 Cumulative Density Functions (CDFs) are a nice way to explore password frequency as well • (Plotting code omitted for brevity) Fusion Tech Talk 2# - R João Nogueira 21 Leaked Passwords We should tell users if their password is known to have been leaked But how to do it quickly and efficiently? Why not Bloom Filters? • Good at telling us quickly if a given element belongs to a set without having to store the full set • Efficiency limited by the allowable false-positive rate • O(k) complexity (k = number of hashing functions, not number of items!) Fusion Tech Talk 2# - R João Nogueira 22 Bloom Filter for Passwords #### Create Bloom Filter #### Bloom <- get("Bloom", envir = asNamespace("bloom")) bloomFilter <- new(Bloom,capacity = nrow(sortedPasswords), error_rate = 0.001, filename = "./bloom.bin",file.exists("./bloom.bin")) #### Add known passwords to Bloom Filter #### dataToAdd <- sortedPasswords[order(Rank),] d_ply(.data=dataToAdd, .variables=.(Password,Rank), .fun=function(x){ bloomFilter$add(x$Password,x$Rank) } ) #### Check if all passwords are in Bloom Filter #### addedPasswords = dataToAdd$Password containsResult <- aaply(addedPasswords,.margins=1, bloomFilter$contains) sum(containsResult) == length(addedPasswords) Fusion Tech Talk 2# - R João Nogueira 23 Bloom Filter for Passwords The output *.bin file is about ~40MB • 5x footprint reduction at 0.1% false-positive rate Tell users in realtime if their password has been leaked before! Fusion Tech Talk 2# - R João Nogueira 24 Other Considerations The results could obviously be improved by forcing a minimum-length password for instance But serve to show how flexible R is And the vast amount of packages that exist Fusion Tech Talk 2# - R João Nogueira Conclusions & Resources Fusion Tech Talk 2# - R João Nogueira Conclusions R is great for data processing Great for data visualization/plotting too! Good starting point for machine learning Tons of packages that do the work for you Lots of open-source & free tools Easy to integrate with popular ecosystems Fusion Tech Talk 2# - R 26 João Nogueira Resources TONS of material in Coursera & Books http://www.cyclismo.org/tutorial/R/ https://www.r-project.org/foundation/ https://cran.r-project.org/ https://www.rstudio.com/ http://rmarkdown.rstudio.com https://bookdown.org/yihui/bookdown/ https://plot.ly/r/getting-started/ https://msdn.microsoft.com/en-us/microsoft-r/microsoft-r-more-resources https://www.r-bloggers.com/ https://github.com/RevolutionAnalytics Fusion Tech Talk 2# - R 27 João Nogueira Thank you!
© Copyright 2026 Paperzz