Machine Learning Algorithms – DenStream Clustering Introducing Machine Learning Algorithms in Smart Data Streaming TABLE OF CONTENTS Contents Before you start .......................................................................................................................................... 3 Setting up the Environment ........................................................................................................................ 4 Setup ........................................................................................................................................................ 4 Create a New Model ................................................................................................................................ 4 Create and Run the DenStream Project .................................................................................................... 6 Create the Project .................................................................................................................................... 6 Add the CCL File ...................................................................................................................................... 6 Upload Data and View Output.................................................................................................................... 8 Additional Materials .................................................................................................................................... 10 2 BEFORE YOU START Pre-requisites: You must have an SAP HANA SPS11 system with smart data streaming installed that you can connect to and use. You must have SAP HANA Studio 2 installed with the streaming plug-in installed and configured. Download the files here to your local windows machine (where your HANA studio is installed on). These files will be used later in the tutorial (make sure you have all of them): o denstream_demo.ccl o denstream_input_temps.xml o denstream_script.sql This tutorial assumes you will be using the HANA SYSTEM user to connect to your HANA system. You should have completed the Freezer Monitoring Tutorial for smart data streaming or otherwise be comfortable. Overview: SAP HANA SPS11 introduces two machine learning algorithms that can be used in streaming projects: Adaptive Hoeffding Tree and DenStream Clustering. Integrating machine learning algorithms with smart data streaming combines supervised learning and unsupervised learning such that one can efficiently train data models in real-time. This tutorial will walk you through a demo of the DenStream Clustering machine learning algorithm. This algorithm discovers data clusters of any shape and constantly updates such that only relevant and necessary information is kept. Core-micro-clusters and outlier-micro-clusters are formed and as the weights of the clusters change over time, outlier-micro-clusters may become a core-micro-cluster. The data used in this tutorial has been gathered from the database group at MIT, which can be accessed at: http://db.csail.mit.edu/labdata/labdata.html. Outline: 1. 2. 3. 4. 5. 6. 7. Set up the environment Create a new model using a SQL script Create a new streaming project Add the CCL file Compile and run the project Upload data View output 3 SETTING UP THE ENVIRONMENT Setup Explanation Screenshot Click Window at the top then go to Preferences. Change the StreamViewer number of rows displayed to 35000 in order to see all the various record types and be able to locate all the outliers in the output. Create a New Model Explanation Screenshot You will use a SQL script to create the model. In the Administration Console perspective, click File, then Open File…, and then open the “denstream_script.sql” file, which was one of the files downloaded previously. The “insert” command creates the model while the “create table” command creates the table in which we will view the output. 4 Explanation Screenshot In the top right corner, click on symbol and choose the system to connect to. Run the script. **You may get an error stating: “unique constraint violated”. This means the model has already been created. 5 CREATE AND RUN THE DENSTREAM PROJECT Create the Project Explanation Screenshot In the Streaming Development perspective, go to the Project Explorer tab and create a new streaming project called “my_denstream_project”. Add the CCL File Explanation Screenshot Switch to the CCL view by right clicking on the empty space and clicking Switch to Text. 6 Explanation Screenshot Replace the auto generated CCL with the CCL from “denstream_demo.ccl”, which is one of the files downloaded previously. Change the dataservice to the service the model was created in (if you completed the Freezer Monitoring Tutorial, it would be “freezermon_service”). Compile and Run the project by right clicking on the project. 7 UPLOAD DATA AND VIEW OUTPUT Explanation Screenshot In the Streaming Run-Test perspective, expand your project and double click Cluster_Assignments, denstream_input and denstream_output to open them. Under the Server View, navigate to the File Upload tab. Click Browse… and open the “denstream_input_temps.xml” file which is one of the files downloaded previously. Then click Upload. You can view the data in the three windows. In denstream_output you can see the cluster data under the CLUSTER_INFO column, which will be explained below. As well, you can see similar information for the outliers under the OUTLIER_INFO column. You can view the information better by right clicking on any row and clicking Clipboard Copy. Then paste to any text editor as shown on the right. 8 Explanation Screenshot Cluster Info: ClusterID – the number of the cluster; outliers have a clusterID of -1. Weight – begins as the number of entries in the cluster until events begin to decay over time due to a “decay factor”. The weights of older events decrease, which results in the overall weight becoming a decimal value. Center – the middle temperature point of the cluster. Radius – the size of the cluster. It reflects how much the entries in the cluster differ. If the radius is 0, it means there are multiple entries with the exact same temperature. In Cluster_Assignments, you can see overall information about each cluster. The different clusters are assigned a number from 1 to 7, while the outliers have an assignment of -1. The COUNT_ID is the number of points in that cluster. 9 ADDITIONAL MATERIALS Links What’s New in HANA SPS11 Streaming Developer Guide Machine Learning with Streaming DenStream Clustering © 2015 SAP SE or an SAP affiliate company. All rights reserved. No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. Please see http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for additional trademark information and notices. 10
© Copyright 2026 Paperzz