Machine Learning Algorithms DenStream Clustering

Machine Learning Algorithms – DenStream Clustering
Introducing Machine Learning Algorithms in Smart Data Streaming
TABLE OF CONTENTS
Contents
Before you start .......................................................................................................................................... 3
Setting up the Environment ........................................................................................................................ 4
Setup ........................................................................................................................................................ 4
Create a New Model ................................................................................................................................ 4
Create and Run the DenStream Project .................................................................................................... 6
Create the Project .................................................................................................................................... 6
Add the CCL File ...................................................................................................................................... 6
Upload Data and View Output.................................................................................................................... 8
Additional Materials .................................................................................................................................... 10
2
BEFORE YOU START
Pre-requisites:
 You must have an SAP HANA SPS11 system with smart data streaming installed that you can connect to and
use.
 You must have SAP HANA Studio 2 installed with the streaming plug-in installed and configured.
 Download the files here to your local windows machine (where your HANA studio is
installed on).
 These files will be used later in the tutorial (make sure you have all of them):
o denstream_demo.ccl
o denstream_input_temps.xml
o denstream_script.sql
 This tutorial assumes you will be using the HANA SYSTEM user to connect to your HANA system.
 You should have completed the Freezer Monitoring Tutorial for smart data streaming or otherwise be
comfortable.
Overview:
SAP HANA SPS11 introduces two machine learning algorithms that can be used in streaming projects: Adaptive
Hoeffding Tree and DenStream Clustering. Integrating machine learning algorithms with smart data streaming
combines supervised learning and unsupervised learning such that one can efficiently train data models in real-time.
This tutorial will walk you through a demo of the DenStream Clustering machine learning algorithm. This algorithm
discovers data clusters of any shape and constantly updates such that only relevant and necessary information is
kept. Core-micro-clusters and outlier-micro-clusters are formed and as the weights of the clusters change over time,
outlier-micro-clusters may become a core-micro-cluster.
The data used in this tutorial has been gathered from the database group at MIT, which can be accessed at:
http://db.csail.mit.edu/labdata/labdata.html.
Outline:
1.
2.
3.
4.
5.
6.
7.
Set up the environment
Create a new model using a SQL script
Create a new streaming project
Add the CCL file
Compile and run the project
Upload data
View output
3
SETTING UP THE ENVIRONMENT
Setup
Explanation
Screenshot
Click Window at the top then go to
Preferences.
Change the StreamViewer number of rows
displayed to 35000 in order to see all the
various record types and be able to locate
all the outliers in the output.
Create a New Model
Explanation
Screenshot
You will use a SQL script to create the model.
In the Administration Console perspective,
click File, then Open File…, and then
open the “denstream_script.sql” file, which
was one of the files downloaded
previously.
The “insert” command creates the model while
the “create table” command creates the
table in which we will view the output.
4
Explanation
Screenshot
In the top right corner, click on
symbol
and choose the system to connect to.
Run the script.
**You may get an error stating: “unique
constraint violated”. This means the model
has already been created.
5
CREATE AND RUN THE DENSTREAM PROJECT
Create the Project
Explanation
Screenshot
In the Streaming Development perspective,
go to the Project Explorer tab and create
a new streaming project called
“my_denstream_project”.
Add the CCL File
Explanation
Screenshot
Switch to the CCL view by right clicking on the
empty space and clicking Switch to Text.
6
Explanation
Screenshot
Replace the auto generated CCL with the
CCL from “denstream_demo.ccl”, which is
one of the files downloaded previously.
Change the dataservice to the service the
model was created in (if you completed
the Freezer Monitoring Tutorial, it would
be “freezermon_service”).
Compile and Run the project by right clicking
on the project.
7
UPLOAD DATA AND VIEW OUTPUT
Explanation
Screenshot
In the Streaming Run-Test perspective,
expand your project and double click
Cluster_Assignments, denstream_input
and denstream_output to open them.
Under the Server View, navigate to the File
Upload tab.
Click Browse… and open the
“denstream_input_temps.xml” file which is
one of the files downloaded previously.
Then click Upload.
You can view the data in the three windows.
In denstream_output you can see the cluster
data under the CLUSTER_INFO column,
which will be explained below.
As well, you can see similar information for
the outliers under the OUTLIER_INFO
column.
You can view the information better by right
clicking on any row and clicking Clipboard
Copy. Then paste to any text editor as
shown on the right.
8
Explanation
Screenshot
Cluster Info:
ClusterID – the number of the cluster; outliers have a clusterID of -1.
Weight – begins as the number of entries in the cluster until events begin to decay over time due to a “decay factor”. The
weights of older events decrease, which results in the overall weight becoming a decimal value.
Center – the middle temperature point of the cluster.
Radius – the size of the cluster. It reflects how much the entries in the cluster differ. If the radius is 0, it means there are multiple
entries with the exact same temperature.
In Cluster_Assignments, you can see overall
information about each cluster. The
different clusters are assigned a number
from 1 to 7, while the outliers have an
assignment of -1. The COUNT_ID is the
number of points in that cluster.
9
ADDITIONAL MATERIALS
Links
What’s New in HANA SPS11
Streaming Developer Guide
Machine Learning with Streaming
DenStream Clustering
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP
affiliate company. SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered
trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries.
Please see http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for additional trademark information and notices.
10