Today`s topics FAQs This material is built based on, What is Pig?

CS435 Introduction to Big Data
Spring 2017 Colorado State University
4/19/2017 Week 14-B
Sangmi Pallickara
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
CS435 BIG DATA
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.1
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.3
Today’s topics
•  FAQs
•  Pig Latin
PART 3.
DATA STORAGE AND FLOW MANAGEMENT
Sangmi Lee Pallickara
Computer Science, Colorado State University
http://www.cs.colostate.edu/~cs435
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.2
4/19/17
FAQs
•  Quiz 8
Dataflow management over HDFS
: Apache Pig
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.4
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
This material is built based on,
What is Pig?
•  Olston, Christopher and Reed, Benjamin and Srivastava,
•  An engine for executing data flows in parallel on Hadoop.
Utkarsh and Kumar, Ravi and Tomkins, Andrew, "Pig Latin: A
Not-so-foreign Language for Data Processing," Proceedings of
the 2008 ACM SIGMOD International Conference on
Management of Data, 2008, pp.1099--1110
W14.B.5
•  Includes a language, Pig Latin
•  Express these data flows
•  Pig Latin
•  Operators (Join, sort, filter, etc)
•  Customized functions for reading, processing, and writing data
•  Apache Open Source project http://pig.apache.org
1
CS435 Introduction to Big Data
Spring 2017 Colorado State University
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/2017 Week 14-B
Sangmi Pallickara
W14.B.6
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
Pig on Hadoop
HDFS
•  Pig runs on Hadoop
•  Hadoop Distributed File System (HDFS)
•  Hadoop’s processing system, MapReduce
•  A distributed file system
•  Stores files across all of the nodes in a Hadoop cluster
•  Breaks files into large blocks
•  Distributes them across different machines
W14.B.7
•  Pig
•  reads input files from HDFS
•  Uses HDFS to store intermediate data between MapReduce jobs
•  Writes its output to HDFS
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.8
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.9
A Parallel Dataflow Language
•  Allows users to describe how data from one or more inputs
should be read, processed, and then stored to one or more
outputs in parallel
Apache Pig
What’s new with Pig?
•  Simple linear flows
•  e.g. word count
•  Complex workflows
•  e.g. multiple inputs are joined
•  Uses a Directed Acyclic Graph (DAG)
•  Edges: dataflows
•  Nodes: operators that process the data
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.10
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.11
Query vs. Dataflow language
Dealing with multiple operations-1
•  SQL: query language
•  Focus on allowing users to form queries
•  Allows user to describe what question they want answered
•  Not how they want it answered!
•  Using SQL
•  Write separate queries
•  Storing the intermediate data into temporary tables
•  OR, write query with sub-queries
•  Pig
•  What and how
•  Users can define HOW they get the answers.
Temporary table
CREATE TEMP TABLE t1 AS
SELECT customer, sum(purchase) AS total_purchases
FROM transactions
GROUP BY customer;
SELECT customer, total_purchases, zipcode FROM t1,
customer_profile_Colorado
WHERE t1.customer =
customer_profile_Colorado.customer;
2
CS435 Introduction to Big Data
Spring 2017 Colorado State University
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
4/19/2017 Week 14-B
Sangmi Pallickara
W14.B.12
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.13
Dealing with multiple operations-2
•  Using Pig
•  A long series of data operations is the main part of the design
issue
•  Data is normalized
--Load the transactions file, group it by customer, and
sum their total purchases
txns = load ‘transactions’ as (customer, purchase);
grouped = group txns by customer;
total = foreach grouped generate group, SUM(txns.purchase)
as tp;
--Load the customer_profile file
profile = load ‘customer_profile_Colorado’ as (customer);
--join the grouped and summed transactions and
customer_profile data
answer = join total by group, profile by customer;
--Write the results to the screen
dump answer;
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
•  SQL
•  Designed for the RDBMS environment
W14.B.14
•  Schemas and proper constraints are enforced
•  Pig
•  Designed for the Hadoop data-processing environment
•  Schemas are sometimes unknown or inconsistent
•  Data may not be properly constrained
•  Rarely normalized
•  Does not require data to be loaded into table first
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.15
How Pig differs from Pure MapReduce
Continued
•  MapReduce DOES the data-processing
•  Pig can analyze a Pig Latin script and understand the dataflow
•  Early error checking
•  Optimization
•  Then, why is Pig necessary?
•  Pig Latin provides all of the standard data-processing operations
•  Join, filter, group by, order by, union, etc.
•  Pig provides type.
•  Error checking before and during runtime
•  More efficient implementation of complex data operations
•  Re-balancing during the reduce phase
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.16
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
Easy to use
What is Pig Useful for?
•  Find the five pages most visited by users between the ages of 18
•  Traditional ETL (extract transform load) data pipeline
and 25
W14.B.17
•  Research on raw data
users = load ‘users’ as (name, age);
fltrd = filter users by age>=18 and age <=25;
pages = load ‘pages’ as (user, url);
jnd = join fltrd by name, pages by user;
grpd = group jnd by url;
smmd = foreach grpd generate group, COUNT(jnd) as clicks;
srtd = order smmd by clicks desc;
top5 = limit srtd 5;
Store top5 into ‘top5sites’;
•  Iterative processing
3
CS435 Introduction to Big Data
Spring 2017 Colorado State University
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
4/19/2017 Week 14-B
Sangmi Pallickara
W14.B.18
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.19
What is Pig NOT useful for?
•  If you need to process gigabytes or terabytes of data, Pig is a
good choice
•  It expects to read all the records of a file and write all of its output
sequentially.
Apache Pig
Getting start with Pig
•  If you require writing single or small groups of records or look up
many different records in random order
•  Pig is NOT a good choice.
•  See NoSQL databases
•  Hive/HBase
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.20
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
Grunt
Grunt-continued
•  Grunt is Pig’s interactive shell
•  Provides command-line history and editing
pig –x local
W14.B.21
•  Is not a full-featured shell.
•  Pipes, redirection, and background execution will not work on your grunt
shell
•  will result in the prompt
grunt>
•  To exit Grunt you can type,
grunt>quit
•  or enter Ctrl-D
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.22
Some HDFS commands in Grunt
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.23
Controlling Pig from Grunt
•  Grunt acts as a shell for HDFS
Grunt>fs -ls
cat filename
Print the content of a file to
stdout
copyFromLocal localfile
hdfsfile
Copy a file from your local disk
to HDFS
copyToLocal hdfsfile
localfile
Copy a file from HDFS to your
local disk
rmr filename
Remove files recursively. This is
equivalent to rm –r in Unix
kill jobid
Kill the MapReduce job associated with jobid
exec[[-param
param_name=param_value]][[param_file filename]] script
Execute the Pig Latin Script
Aliases defined in script are not imported into
Grunt
Good for testing
run[[-param
param_name=param_value]][[param_file filename]] script
Execute the Pig Latin script in the current Grunt
shell.
All aliases referenced in script are available to
Grunt
Shell history is available
4
CS435 Introduction to Big Data
Spring 2017 Colorado State University
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
4/19/2017 Week 14-B
Sangmi Pallickara
W14.B.24
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.25
Data types: scalar types
•  Pig’s Data Types
•  Scalar types
•  Complex types
Apache Pig
•  Scalar types
•  Int
Data types
•  An integer
•  Ints are represented in interfaces by java.lang.Integer
•  4 byte signed integer
•  Long
•  A long integer
•  Longs are represented in interfaces by java.lang.Long
•  8 byte signed integer
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.26
Data types: scalar types
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.27
Data types: complex types
•  Tuple
•  A fixed-length, ordered collection of Pig data elements
•  Divided into fields
•  (‘bob’, 55) describes a tuple with two fields
•  (‘John’, 18, 4.0F) describes a tuple with three fields
•  Float
•  A floating-point number
•  Floats are represented in interfaces by java.lang.Float
•  4 bytes to store their value
•  Double
•  A double-precision floating-point number
•  Doubles are represented in interfaces by java.lang.Double
•  8 bytes to store their value
•  Chararray
•  A string or character array
•  Java.lang.String
•  Bytearray
•  A blob or array of bytes
•  DataByteArray that wraps a Java byte[]
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.28
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
Data types: complex types
Data types: complex types
•  Bag
•  An unordered collection of tuples
•  Not possible to reference tuples by position
•  {(‘bob’,55),(‘sally’,52),(‘john’,25)}
•  Map
•  Set of key-value pairs
•  A bag can have:
•  Duplicate of tuples
•  Tuples with different number of fields
•  Tuples with different data types
W14.B.29
•  Key must be a chararray to data element mapping, where that element
can be any Pig type
•  [‘name’#’bob’,’age’#55] creates map with two keys, “name”, and
“age”.
•  The first value(‘age’) is a chararray and the second value (55) is an integer.
•  The key should be unique
5
CS435 Introduction to Big Data
Spring 2017 Colorado State University
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/2017 Week 14-B
Sangmi Pallickara
W14.B.30
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
Nulls
Schemas
•  Data of any type can be null
•  Pig supports schema
•  If a schema for the data is available
•  Null in Pig means
•  The data is unknown
•  Missing data, error, etc
•  Up-front error checking
•  Optimization
•  If no schema is available
•  Pig does not have a notion of constraints on the data
•  Null can happen always
4/19/17
W14.B.31
CS435 Introduction to Big Data - Spring 2017
Colorado State University
•  Pig will still process the data
W14.B.32
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
If a schema is available..
If no schema is available
•  Your program should tell Pig what it is when you load the data:
•  Use a dollar sign + position
Dividends = load ‘NYSE_dividends’ as (exchange:chararray,
symbol:chararray, date:chararray, dividend:float);
W14.B.33
daily = load ‘NYSE_daily’;
calcs = foreach daily generate $7 /1000, $3 * 100.0,
SUBSTRING($0,0,1), $6 - $3;
•  Pig makes a safe guess
•  $7/1000
•  Guess: $7 (eighth field) is numeric type
•  $3*100.0
Dividends = load ‘NYSE_dividends’ as (exchange, symbol,
date, dividend);
•  Guess: $3 is numeric type
•  SUBSTRING()
•  Guess: $0 is chararray
•  $6 - $3
•  Guess: $6 and $3 are numeric type
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.34
If no schema is available (continued)
daily = load ‘NYSE_daily’;
fltrd = filter daily by $6 > $3;
•  > is valid operator on
•  numeric
•  chararry
•  bytearray
•  No safe guess!
•  In this case, Pig treats these fields as if they were bytearrays
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
Multiple types with schema
W14.B.35
[1/2]
cat student;
John
18 4.0
Mary
19 3.8
Bill
20 3.9
Joe
18 3.8
A = LOAD 'student' AS (name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name:chararray,age:int,gpa:float}
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
6
CS435 Introduction to Big Data
Spring 2017 Colorado State University
4/19/2017 Week 14-B
Sangmi Pallickara
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.36
Multiple types with schema
[2/2]
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.37
Casts
•  Casts to bytearray are NOT allowed
cat student;
John
18 4.0
Mary
19 3.8
Bill
20 3.9
Joe
18 3.8
•  Casts from bytearray to any type are allowed
•  Casts to and from complex types are NOT allowed
A = LOAD 'data' AS (name:chararray, age:int, gpa);
DESCRIBE A;
A: {name: chararray,
age: int,
gpa: bytearray}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.38
FROM
To int
To long
To float
To double
To Ch
int
N.A.
Y
Y
Y
Y
long
Y, any values greater
greater than 231 or
less than –231 will be
truncated.
N.A.
Y
Y
Y
float
Y. Values will be
truncated to int
values.
Y. Values will be
truncated to long
values.
N.A.
double
Y. Values will be
truncated to int
values.
Y. Values will be
truncated to long
values.
Yes. Values with N.A.
precision beyond
what float can
represent will be
truncated.
Y
Yes. Chararrays
with nonnumeric
characters result in
null.
Yes. Chararrays
with nonnumeric
characters result
in null.
N.A.
chararray Yes. Chararrays with
nonnumeric
characters result in
null.
Yes. Same
as to float
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.39
Internal Casts
daily = load ‘NYSE_daily’as (exchange:chararray,
symbol:chararry, date:chararray, open:float, low:float,
close:float, volume:int, adj_close:float);
rough = foreach daily generate volume*close;
•  Pig will change volume to (float)volume internally.
•  Without losing precision
•  Pig widens types
•  int with long à (long) int with long à long
•  long with float à (float) long with long àfloat
•  There is NO internal casts between numeric types and
chararrays
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.40
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
How strongly typed is Pig?
How strongly typed is Pig?-contd
•  Strongly typed computer language
•  Users must declare up front the type for all variables
•  e.g. Java
•  Pig: “Gently typed”
•  If the schema provided, it follows the schema
•  If no schema provided, it adapts to the actual types at runtime
•  Weakly typed language
•  Variables can take on values of different types
•  Adapt as the occasion demands
•  e.g. Perl
W14.B.41
player = load ‘baseball’ as (name:chararray,
team:chararray, pos:bag{t:(p:chararray)}, bat:map());
uninteded = foreach player generate bat#’base_on_balls’bat#’ibbs’;
•  ‘-’ is an operator expecting integers as operands.
•  What will be the return type: unintended?
7
CS435 Introduction to Big Data
Spring 2017 Colorado State University
4/19/2017 Week 14-B
Sangmi Pallickara
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.42
bat#base_on_balls
: bytearray
Continued
player = load ‘baseball’ as (name:chararray,
team:chararray,pos:bag{t:(p:chararray)}, bat:map[]);
unintended =foreach player generate bat#’base_on_balls’bat#’ibbs’;
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.43
Relation Names
•  Pig Latin is a dataflow language
•  Each processing step results in a new data set or relation.
input = load ‘data’
bat#ibbs:
bytearray
•  input
•  Name of the relation
Player = load ‘baseball’ as (name:chararray,
team:chararray, pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate
(int)bat#’base_on_balls’ – (int)bat#’ibbs’;
•  Results from loading the data set data
•  Relation names look like variables
•  But they are NOT.
•  Return type will be integer
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.44
Reusing the relation names
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
W14.B.45
Field Names
•  It is possible to reuse relation names.
•  Field (or column) in a relation
A = load ‘NYSE_dividends’ (exchange, symbol, date,
dividends);
A = filter A by dividends > 0;
A = foreach A generate UPPER(symbol)
A = load ‘NYSE_dividends’ (exchange, symbol, date,
dividends);
A = filter A by dividends > 0;
A = foreach A generate UPPER(symbol)
•  This example creates a new relation called A repeatedly.
•  It loses track of the old relations called A.
•  dividends, symbol, date …
•  Not recommended
•  Looks like variables
•  But you cannot assign values to them!
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.46
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
Preliminary Matters - 1
Preliminary Matters - 2
•  Both relation and field names MUST start with an alphabetic
•  UDF names ARE case-sensitive
•  COUNT() is not the same UDF as count().
character
•  Then they can have 0 or more alphabetic, numeric, or _ characters.
•  All characters in the name must be ASCII
•  Keyword in Pig Latin is NOT case-sensitive
•  LOAD = load
•  Field/relation names ARE case-sensitive
•  A = load ‘foo’
•  a = load ‘foo’
•  Two different relations
W14.B.47
•  Comments
•  SQL-style
•  --
•  Java-style
•  /* */
A = load ‘foo’: -- this is a single-line comment
/*
* This is a multiline comment.
*/
B = load /*a comment in the middle */’bar’;
8
CS435 Introduction to Big Data
Spring 2017 Colorado State University
CS435 Introduction to Big Data - Spring 2017
Colorado State University
4/19/17
4/19/2017 Week 14-B
Sangmi Pallickara
W14.B.48
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
Input and Output: load
Input and Output: load
•  Tab separated file store in HDFS
•  Using relative path name
•  Data stored in HBase
•  Using loader for HBase
W14.B.49
divs = load ‘/data/examples/NYSE_dividends’
•  All of the relative path to the data file is based on your home directory of HDFS
divs = load ‘NYSE_dividends’ using HBaseStorage()
•  /users/yourlogin
•  Using complete path name
•  nn.acme.com : URL of NameNode.
divs = load ’hdfs://nn.acme.com/data/examples/NYSE_dividends’
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
W14.B.50
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
PigStorage()
Input and Output: Store
•  Default data loader for PIG
•  Indicating the separator
•  Pig stores your data on HDFS in a tab-delimited file using
W14.B.51
PigStorage()
divs = load ’NYSE_dividends’ using
PigStorage(‘,’);
store processed into ‘/data/examples/processed’;
store processed into ‘hdfs://nn.acme.com/data/exmples/
processed’;
•  Specifying the schema
divs = load ’NYSE_dividends’
dividends);
as
(exchange, symbol, date,
•  Pig stores your data to HBase
•  Loads data from file and directory
•  Allows using globs and multiple file loading
4/19/17
CS435 Introduction to Big Data - Spring 2017
Colorado State University
store processed into ‘processed’ using HBaseStorage()
W14.B.52
Input and Output: Store
•  PigStorage() takes an argument to indicate the separator.
store processed into ‘processed’ using PigStorage(‘,’);
•  Print on the screen.
dump processed;
9