Part 1 Configuring Oracle Big Data SQL

Oracle Big Data, Data Science, Advance Analytics & Oracle NoSQL Database
Securely analyze data across the big data platform whether that data resides in Oracle Database 12c, Hadoop
or a combination of these sources. You will able to leverage your existing Oracle skill sets and applications to
gain these insights. Apply Oracle's rich SQL dialect and security policies across the data platform greatly
simplifying the ability to gain insights from all your data.
Two parts to Big Data SQL:
 Enhanced Oracle external tables
 Oracle Big Data SQL Server
o Big Data SQL Server applies SmartScan over data stored in Hadoop in order to achieve fast
performance (Big Data SQL Server is available on Oracle Big Data Appliance only not on
VM/OVA)
Part 1 Configuring Oracle Big Data SQL
Copy/Download bigdatasql_hol_otn_setup.sql and bigdatasql_hol.sql files.
Run bigdatasql_hol_otn_setup.sql script on SQL Developer, when prompted for a connection, select the moviedemo
connection and click OK. This will complete the setup for this tutorial, the script bigdatasql_hol.sql has DEMO steps.
The virtual environment for this tutorial is mostly preconfigured for Oracle Big Data SQL:
There are six simple tasks required to configure Oracle Big Data SQL:
1.
2.
3.
4.
1.
5.
Create the Common Directory and a Cluster Directory on the Exadata Server. DONE.
Create and populate the bigdata.properties file in the Common Directory. DONE.
Copy the Hadoop configuration files into the Cluster Directory. DONE.
Create corresponding Oracle directory objects that reference these configuration directories.
Install Oracle Big Data SQL on the BDA using Mammoth the BDA's installation and configuration utility. DONE.
Install a CDH client on each Exadata Server. DONE.
Common Directory
The Common directory contains a few subdirectories and an important file, named bigdata.properties. This file stores
configuration information that is common to all BDA clusters. Specifically, it contains property value pairs used to configure
the JVM and identify a default cluster.
For Exadata, the Common directory must be on a clusterwide file system; it is critical that all Exadata Database nodes
access the exact same configuration information.
cd /u01/bigdatasql_config/
cat bigdata.properties
Cluster Directory
The Cluster directory contains configuration files required to connect to a specific BDA cluster. In addition, the Cluster
directory must be a subdirectory of the Common directory and the name of the directory is important: It is the name that
you will use to identify the cluster.
Notes:





The properties, which are not specific to a hadoop cluster, include items such as the location of the Java VM,
classpaths and the LD_LIBRARY_PATH.
In addition, the last line of the file specifies the default cluster property in this case bigdatalite.
As you will see later, the default cluster simplifies the definition of Oracle tables that are accessing data in Hadoop.
In our hands-on lab, there is a single cluster: bigdatalite. The bigdatalite subdirectory contains the
configuration files for the bigdatalite cluster.
The name of the cluster must match the name of the subdirectory (and it is case sensitive!).
cd /u01/bigdatasql_config/bigdatalite
ls
Notes:



These are the files required to connect Oracle Database to HDFS and to Hive.
Although not required, in our example these files were previously retrieved by using Cloudera Manager.
The screenshot below shows the home page for a Cloudera Manager cluster. In our example, we select View
Client URLs from the actions menu, and then downloaded the configuration files for both YARN and Hive to the
Cluster Directory.
Create the Corresponding Oracle Directory Objects (Task #4)



ORACLE_BIGDATA_CONFIG
: the Oracle directory object that references the Common Directory
ORACLE_BIGDATA_CL_bigdatalite : the Oracle directory object that references the Cluster Directory.
The naming convention for this directory is as follows: Cluster Directory name begins with
ORACLE_BIGDATA_CL_ Followed by the cluster name (i.e. "bigdatalite"). This name is case sensitive and is
limited to 15 characters. Must match the physical directory name in the file system (repeat: it's case sensitive!).
SQL> create or replace directory ORACLE_BIGDATA_CONFIG as '/u01/bigdatasql_config';
SQL> create or replace directory "ORA_BIGDATA_CL_bigdatalite" as '';
Notice that there is no location specified for the Cluster Directory. It is expected that the directory will be a
subdirectory of ORACLE_BIGDATA_CONFIG, Use the cluster name as identified by the Oracle directory object.
Recommended Practice: In addition to the Oracle directory objects, you should also create the Big Data SQL Multithreaded
Agent (MTA). ( Already done as pre-configuration). This agent bridges the metadata between Oracle Database and
Hadoop. Technically, the MTA allows the external process to be multithreaded instead of launching a JVM for every process
(which can be quite slow).
SQL> create public database link BDSQL$_bigdatalite using
'extproc_connection_data';
SQL> create public database link BDSQL$_DEFAULT_CLUSTER using 'extproc_connection_data';
Part 2 Create Oracle Table Over Application Log
The movie application streamed data into HDFS specifically into the following directory:
/user/oracle/moviework/applog_json
Execute the following command to review the log file stored in HDFS:
hadoop fs -ls
/user/oracle/moviework/applog_json
hadoop fs -tail /user/oracle/moviework/applog_json/movieapp_log_json.log
The JSON log captures the following information about each interaction/ contains every click happened on the web site:
Create Oracle Table:
SQL>
CREATE TABLE movielog
(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
(TYPE ORACLE_HDFS
DEFAULT DIRECTORY DEFAULT_DIR
LOCATION ('/user/oracle/moviework/applog_json/')
)
REJECT LIMIT UNLIMITED;
SELECT * FROM movielog WHERE rownum < 20;
SQL> CREATE TABLE movielog_plus
(click VARCHAR2(40))
ORGANIZATION EXTERNAL
(TYPE ORACLE_HDFS
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS (
com.oracle.bigdata.cluster=bigdatalite
com.oracle.bigdata.overflow={"action":"truncate"}
)
LOCATION ('/user/oracle/moviework/applog_json/')
)
REJECT LIMIT UNLIMITED;




The click column has been changed to a VARCHAR2(40). Clearly, this is going to be a problem; the length of a
JSON document exceeds that size.
There are numerous ways to handle this situation, including:
o
Generate an error and then either reject the record,
o
set its value to null or replace it with an alternate value.
o
Simply truncate the data. Here, we are truncating the data. And, we have applied this truncate action to
all columns in the table; you can also specify the individual column(s) to truncate.
A cluster bigdatalite has been specified.
This cluster will be used instead of the default (which in this case happens to be the same).
Currently a given session may only connect to a single cluster.
SELECT * FROM movielog_plus WHERE rownum < 20;
Oracle Database 12c (12.1.0.2) includes native JSON support. This allows queries to easily extract attribute data from
JSON documents. Run the following query in SQL Developer:
SQL> SELECT m.click.custid, m.click.movieid, m.click.genreid, m.click.time
FROM movielog m
WHERE rownum < 20;
The column specification in the select list is a full path to the JSON attribute.
The specification starts with the table alias ("m" note: this is required!), followed by the column name ("click"), and then a
case sensitive JSON path (e.g. "genreId").
Combine data from Oracle Database and Hadoop.
Combine the "click" data with data sourced from the movie dimension table from Oracle database.
SQL>
SELECT f.click.custid, m.title, m.year, m.gross, f.click.rating
FROM movielog f, movie m
WHERE f.click.movieid = m.movie_id
AND f.click.rating > 4;
Create view(s) to simplify queries against the JSON data.
SQL>
SELECT
CAST(m.click.custid AS NUMBER) custid,
CAST(m.click.movieid AS NUMBER) movieid,
CAST(m.click.activity AS NUMBER) activity,
CAST(m.click.genreid AS NUMBER) genreid,
CAST(m.click.recommended AS VARCHAR2(1)) recommended,
CAST(m.click.time AS VARCHAR2(20)) time,
CAST(m.click.rating AS NUMBER) rating,
CAST(m.click.price AS NUMBER) price
FROM movielog m;
Oracle SQL for MoviePlex average ratings compare to top 10 grossing movies:
SQL>
SELECT m.title, m.year, m.gross, round(avg(f.rating), 1)
FROM movielog_v f, movie m
WHERE f.movieid = m.movie_id
GROUP BY m.title, m.year, m.gross
ORDER BY m.gross desc
FETCH FIRST 10 ROWS ONLY;
Part 3 Leverage the Hive Metastore to Access Data in Hadoop
Hive enables SQL access to data stored in Hadoop and NoSQL stores.
Two parts to Hive: the Hive execution engine and the Hive Metastore.)
The Hive execution engine launches MapReduce job(s) based on the SQL that has been issued.
MapReduce is a batch processing framework and is not intended for interactive query and analysis but it is extremely useful
for querying massive data sets using the well understood SQL language. Importantly, no coding is required (Java, Pig, etc.).
The SQL supported by Hive is still limited (SQL92), but improvements are being made over time.
The Hive Metastore has become the standard metadata repository for data stored in Hadoop. It contains the definitions of
tables (table name, columns and data types), the location of data files (e.g. directory in HDFS), and the routines required
parse that data (e.g. StorageHandlers, InputFormats and SerDes - serializer/deserializer ).
The same metadata can be shared across multiple products (e.g. Hive, Oracle Big Data SQL, Impala, Pig, Stinger, etc.);
Review Tables Stored in Hive: CLI
hive > show tables;
The movielog Table is equivalent to the external table that was defined in Oracle Database in the previous
exercise. Review the definition of the table by executing the following command at the
hive> show create table movielog;
hive> select * from movielog limit 10;
Because there are no columns in the select list and no filters applied, the query simply scans the file and returning the
results. No MapReduce job is executed.
The second table queries that same file however this time it is using a SerDe that will translate the attributes into
columns. Review the definition of the table by executing the following command:
There are columns defined for each field in the JSON document making it much easier to understand and query
the data. A java class org.apache.hive.hcatalog.data.JsonSerDe is used to deserialize the JSON file.
hive > show create table movieapp_log_json;
This is an illustration of Hadoop's schema on read paradigm; a file is stored in HDFS, but there is no schema
associated with it until that file is read. Our examples are using two different schemas to read that same data; these
schemas are encapsulated by the Hive tables movielog and movieapp_log_json.
The Hive query execution engine converted hiveSQL query into a MapReduce job.
The author of the query does not need to worry about the underlying implementation Hive handles this automatically.
hive > select * from movieapp_log_json where rating > 4;
hive > exit;
Leverage Hive Metadata When Creating Oracle Tables:
Create a table over the Hive movieapp_log_json table using the following DDL: The ORACLE_HIVE access driver type
invokes Oracle Big Data SQL at query compilation time to retrieve the metadata details from the Hive Metastore. The default
can be overridden using ACCESS PARAMETERS.
The metadata includes the location of the data and the classes required to process the data (e.g. StorageHandlers,
InputFormats and SerDes). The scanned the files found in the /user/oracle/movie/moviework/applog_json directory
and then used the Hive SerDe to parse each JSON document.
In a true Oracle Big Data Appliance environment, the input splits would be processed in parallel across the nodes
of the cluster by the Big Data SQL Server, the data would then be filtered locally using Smart Scan, and only the
filtered results (rows and columns) would be returned to Oracle Database.
SQL> CREATE TABLE movieapp_log_json (
custid INTEGER ,
movieid INTEGER ,
genreid INTEGER ,
time VARCHAR2 (20) ,
recommended VARCHAR2 (4) ,
activity NUMBER,
rating INTEGER,
price NUMBER
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
REJECT LIMIT UNLIMITED;
SQL> SELECT * FROM movieapp_log_json WHERE rating > 4 ;
The second Hive table over the same movie log content except the data is in Avro format not JSON text format.
Create an Oracle table over that Avro based Hive table using the following command:
The Oracle table name does not match the Hive table name. Therefore, an ACCESS PARAMETER was specified that
references the Hive table (default.movieapp_log_avro).
SQL> CREATE TABLE mylogdata (
custid INTEGER ,
movieid INTEGER ,
genreid INTEGER ,
time VARCHAR2 (20) ,
recommended VARCHAR2 (4) ,
activity NUMBER,
rating INTEGER,
price NUMBER
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.movieapp_log_avro )
)
REJECT LIMIT UNLIMITED;
SQL> SELECT custid, movieid, time FROM mylogdata;
To illustrate how Oracle Big Data SQL uses the Hive Metastore at query compilation to determine query execution
parameters, just change the definition of the hive table movieapp_log_data. In Hive, alter the table's LOCATION field so
that it points to a file that containing has only two records ( example).
The Oracle SQL also runs without making any changes to the Oracle table query movieapp_log_json:
hive >
ALTER TABLE movieapp_log_json SET LOCATION
"hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/two_recs";
hive > SELECT * FROM movieapp_log_json;
SQL > SELECT * FROM movieapp_log_json;
Reset the Hive table and then confirm that there are more than two rows. Execute the following commands.
hive > ALTER TABLE movieapp_log_json SET LOCATION
"hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json";
hive > select * from movieapp_log_json limit 10;
Part 4 Applying Oracle Database Security Policies Over Data in Hadoop
Oracle Database security features, including strong authentication, row level access, data redaction, data masking, auditing
and more have been utilized to ensure that data remains safe on HDFS/Hadoop/BigDATA.
Example, to protect personally identifiable information, including the customer last name and customer id.
Oracle Data Redaction policy has already been set up on the customer table that obscures these two fields. This was
accomplished by using the DBMS_REDACT PL/SQL package.
SQL>
DBMS_REDACT.ADD_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'CUSTOMER',
column_name => 'CUST_ID',
policy_name => 'customer_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
);
Creates a policy called customer_redaction: It is applied to the cust_id column moviedemo.customer table
It performs a partial redaction I e. it is not nec. applied to all characters in the field
It replaces the first 7 characters with the number "9"
The redaction policy will always apply since the expression describing when it will apply is specified as "1=1"
DBMS_REDACT.ALTER_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'CUSTOMER',
action => DBMS_REDACT.ADD_COLUMN,
column_name => 'LAST_NAME',
policy_name => 'customer_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',
expression => '1=1'
);
Updates the customer_redaction policy, redacting a second column in that same table.
It will replace the characters 3 to 25 in the LAST_NAME column with an '*'.
The fact that the data is redacted is transparent to application code.
SELECT cust_id, last_name FROM customer;
Apply Redaction Policies to Data Stored in Hadoop:




Apply an equivalent redaction policy to two of our Oracle Big Data SQL tables, with the following effects:
The first procedure redacts data sourced from JSON in HDFS.
The second procedure redacts Avro data sourced from Hive.
Both policies redact the custid; attribute.
SQL> BEGIN
-- JSON file in HDFS
DBMS_REDACT.ADD_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'MOVIELOG_V',
column_name => 'CUSTID',
policy_name => 'movielog_v_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
);
--- Avro data from Hive
-DBMS_REDACT.ADD_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'MYLOGDATA',
column_name => 'CUSTID',
policy_name => 'mylogdata_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
);
END;
/
Review the redacted data from the Avro source:
SQL>
SELECT * FROM mylogdata WHERE rownum < 20;
Join the redacted HDFS data to the customer table by executing the following SELECT statement:
SQL>
SELECT f.custid, c.last_name, f.movieid, f.time
FROM customer c, movielog_v f
WHERE c.cust_id = f.custid;
Part 5 Using Oracle Analytic SQL Across All Your Data
Oracle Big Data SQL allows you to utilize Oracle's rich SQL dialect to query all your data, regardless of where that data may
reside.
Oracle MoviePlex's understanding of customers by utilizing an RFM analysis:

Recency : when was the last time the customer accessed the site?

Frequency : what is the level of activity for that customer on the site?

Monetary : how much money has the customer spent?
SQL Analytic Functions will be applied to data residing in both the application logs on Hadoop and sales data in Oracle
Database tables.
RFM combined score of 551 indicates that the customer is in the highest tier of customers in terms of recent visits (R=5) and
activity on the site (F=5), however the customer is in the lowest tier in terms of spend (M=1). Apply Oracle NTILE functions
across all data:
The customer_sales subquery selects from the Oracle Database fact table movie_sales to categorize customers
based on sales. The click_data subquery performs a similar task for web site activity stored in the application logs
categorizing customers based on their activity and recent visits. These two subqueries are then joined to produce the
complete RFM score.
SQL>
WITH customer_sales AS (
Sales
and customer attributes
SELECT m.cust_id,
c.last_name,
c.first_name,
c.country,
c.gender,
c.age,
c.income_level,
NTILE (5) over (order by sum(sales)) AS rfm_monetary
FROM movie_sales m, customer c
WHERE c.cust_id = m.cust_id
GROUP BY m.cust_id,
c.last_name,
c.first_name,
c.country,
c.gender,
c.age,
c.income_level
),
click_data AS (
clicks
from application log
SELECT custid,
NTILE (5) over (order by max(time)) AS rfm_recency,
NTILE (5) over (order by count(1)) AS rfm_frequency
FROM movielog_v
GROUP BY custid
) SELECT c.cust_id,
c.last_name,
c.first_name,
cd.rfm_recency,
cd.rfm_frequency,
c.rfm_monetary,
cd.rfm_recency*100 + cd.rfm_frequency*10 + c.rfm_monetary AS rfm_combined,
c.country,
c.gender,
c.age,
c.income_level
FROM customer_sales c, click_data cd
WHERE c.cust_id = cd.custid;
2) We want to target customers who we may be losing to competition. Therefore, execute the following amend the
query which finds important customers (high monetary score) that have not visited the site recently (low recency score):
SQL>
WITH customer_sales AS (
Sales
and customer attributes
SELECT m.cust_id,
c.last_name,
c.first_name,
c.country,
c.gender,
c.age,
c.income_level,
NTILE (5) over (order by sum(sales)) AS rfm_monetary
FROM movie_sales m, customer c
WHERE c.cust_id = m.cust_id
GROUP BY m.cust_id,
c.last_name,
c.first_name,
c.country,
c.gender,
c.age,
c.income_level
),
click_data AS (
clicks
from application log
SELECT custid,
NTILE (5) over (order by max(time)) AS rfm_recency,
NTILE (5) over (order by count(1)) AS rfm_frequency
FROM movielog_v
GROUP BY custid
) SELECT c.cust_id,
c.last_name,
c.first_name,
cd.rfm_recency,
cd.rfm_frequency,
c.rfm_monetary,
cd.rfm_recency*100 + cd.rfm_frequency*10 + c.rfm_monetary AS rfm_combined,
c.country,
c.gender,
c.age,
c.income_level
FROM customer_sales c, click_data cd
WHERE c.cust_id = cd.custid
AND c.rfm_monetary >= 4
AND cd.rfm_recency <= 2
ORDER BY c.rfm_monetary desc, cd.rfm_recency desc
;
Pattern Matching and Advance Analytics with PIVOT tables::