HPE Vertica Pulse

Vertica Pulse
HPE Vertica Analytic Database
Software Version: 7.2.x
Document Release Date: 2/8/2017
Legal Notices
Warranty
The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty
statements accompanying such products and services. Nothing herein should be construed as constituting an
additional warranty. HPE shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.
Restricted Rights Legend
Confidential computer software. Valid license from HPE required for possession, use or copying. Consistent with
FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data
for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license.
Copyright Notice
© Copyright 2006 - 2016 Hewlett Packard Enterprise Development LP
Trademark Notices
Adobe™ is a trademark of Adobe Systems Incorporated.
Apache® Hadoop® and Hadoop are either registered trademarks or trademarks of the Apache Software
Foundation in the United States and/or other countries.
Microsoft® and Windows® are U.S. registered trademarks of Microsoft Corporation.
UNIX® is a registered trademark of The Open Group.
This product includes an interface of the 'zlib' general purpose compression library, which is Copyright © 19952002 Jean-loup Gailly and Mark Adler.
HPE Vertica Analytic Database (7.2.x)
Page 2 of 117
Contents
Pulse Virtual Machine Quick Start
7
About the Vertica Pulse Package
9
Installing or Upgrading Vertica Pulse
11
Vertica Pulse Package Version Requirements
11
Installation Overview
11
Installing Java on Vertica Hosts
Setting the JavaBinaryForUDx Configuration Parameter
13
13
Installing or Upgrading the Vertica Pulse Package on Your Host
Install or Upgrade the Pulse Package
Running the Pulse Install Script
15
15
16
Tuning the jvm Resource Pool for Vertica Pulse
Configuring the jvm Resource Pool for your System
18
19
Assign Users to the pulse_users Role and Allow Access to Pulse Functions
21
Uninstalling Vertica Pulse and Pulse Packages
Uninstall Vertica Pulse on Your Hosts
Uninstall Pulse Packages
22
22
22
Using Pulse
25
Dictionaries and Mappings
Dictionaries
Mappings
Dictionary and Mapping Tables
Loading Dictionaries and Mappings into Pulse
Automatically Loading Dictionaries and the Normalization Map
Manually Loading Dictionaries and the Normalization Map
Dictionary and Mapping Labels
Normalization Map Effect on Results
Before Mapping
Insert Normalization Values and Load Map
After Mapping
Creating Tables for Custom Dictionary Mappings
Using Action Patterns in Dictionaries
Action Pattern Syntax
Default Action Patterns
Examples
Using Lists In Dictionaries
Using Regular Expressions in Dictionaries
26
26
26
26
28
29
29
30
31
31
31
31
32
33
33
34
34
34
35
Determining Sentiment
37
HPE Vertica Analytic Database (7.2.x)
Page 3 of 117
Vertica Pulse
Sentiment Analysis Levels
Attribute-Level Analysis
Sentence-Level Analysis
Document-Level Analysis
38
39
39
39
Tuning Pulse
Improving Automatic Attribute Discovery
Determining How Pulse Scores Sentiment
Improving Sentiment Scores
Sentiment Scoring and the Precedence of Pulse User-Dictionaries
Tuning Example
Additional Tuning Examples
41
41
42
42
43
44
45
Bulk Loading Word Lists from Text Files
Bulk Loading User Dictionary Lists
Bulk Loading the Normalization Map
46
46
47
Multilingual Pulse
49
Spanish Pulse
50
Multilingual Examples
51
Pulse Cookbook
55
Batch Analyzing Data as It Is Loaded
55
Analyzing Comments for a Company or Product
59
Determining Popular Topics
62
Determining Prolific Authors
66
Analyzing the Sentiment of Specific Authors
67
Finding Associated Attributes
69
Using Pulse as an Aid in Competitive Analysis
70
Pulse Function Reference
75
LoadDictionary
76
LoadMapping
78
SentimentAnalysis
80
PartsOfSpeech
85
GetAllDictionarySetLabels
88
GetAllDictionaryWords
89
GetAllLoadedDictionaries
91
GetAllMappingWords
92
CommentAttributes
94
GetSentenceCount
97
HPE Vertica Analytic Database (7.2.x)
Page 4 of 117
Vertica Pulse
ExtractSentence
100
GetAllSentences
103
SetDefaultLanguage
106
GetLoadedDictionary
107
GetLoadedMapping
109
GetStorage
111
UnloadLabeledDictionary
112
UnloadLabeledDictionarySet
114
UnloadLabeledMapping
115
Send Documentation Feedback
HPE Vertica Analytic Database (7.2.x)
117
Page 5 of 117
Vertica Pulse
HPE Vertica Analytic Database (7.2.x)
Page 6 of 117
Vertica Pulse
Pulse Virtual Machine Quick Start
Pulse Virtual Machine Quick Start
These Quick Start instructions detail the minimal steps for installing and using Pulse
with the Vertica Virtual Machine Image. Consult the complete documentation for
detailed steps on installing Pulse on your own platform.
Downloading and Installing Pulse
1. Go to http://my.vertica.com/ and sign in. Then, click the Download tab.
2. Scroll down to the section "Download Vertica 7.1 Virtual Machines" and click the
download link for your VM environment. These instructions assume you are
installing the VMDK version - VMWare Server 2.0 and Workstation 7.0.
3. After the download completes, unzip the file.
4. Double-click the .vmx file in vmsrvr_64/Vertica 7.1.x x64 for VMware. The
VM starts in your VMWare application.
5. You are automatically logged in as dbadmin. However, the password for the user
(and root) is 'password'.
6. In the VM, select Applications > Accessories > Terminal to open a terminal.
7. In the terminal, type admintools to start the administration tools.
8. You are prompted for a license when admintools starts for the first time. To use the
community edition license, simply click OK. You are then prompted to accept the
EULA. Accept the EULA then exit admintools.
9. As dbadmin, using vsql on any node in the cluster, set the JavaBinaryforUDx
Configuration Parameter (use which java to determine your java location):
vsql -t -c "ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';"
10. Copy the Vertica Pulse install package to the VM then, as root, install the Pulse
Package:
rpm -Uvh /path/to/vertica-pulse.x86_64.xxx.rpm
Note: Only install Vertica Pulse on a single node. All Pulse functions are
available on all nodes. However, the installation SQL scripts and user-dictionary
loading script are only available on the node on which you install the Pulse
HPE Vertica Analytic Database (7.2.x)
Page 7 of 117
Vertica Pulse
Pulse Virtual Machine Quick Start
package.
11. As dbadmin, run the Pulse install script on the node on which you installed the
Pulse Package:
vsql -f /opt/vertica/packages/pulse/ddl/install.sql
Using Pulse
1. Run a sentiment function:
select sentimentanalysis('Cookies are sweet.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)
Note: By default, VerticaPulse analyzes English text, however, you can also specify
the language of the text being analyzed as an attribute of the sentimentanalysis()
function. For example:
select sentimentanalysis('Cookies are sweet.', 'english') OVER(PARTITION BEST);
English and Spanish are the supported languages.
HPE Vertica Analytic Database (7.2.x)
Page 8 of 117
Vertica Pulse
About the Vertica Pulse Package
About the Vertica Pulse Package
Vertica Pulse provides a suite of functions that allow you to analyze and extract the
sentiment from English and Spanish language text directly from your Vertica database.
Vertica Pulse features include:
l
l
l
l
l
Attribute based sentiment scoring - Pulse scores the sentiment of attributes in a
sentence. Attributes are generally nouns and are automatically discovered by Pulse.
Pulse typically scores sentiment from a range of -1 (negative sentiment) to +1
(positive sentiment). A sentiment of 0 is considered neutral. Scoring individual
attributes in a sentence instead of scoring the sentence as a whole provides a more
granular analysis for the text. For example, consider the sentence "The quick brown
fox jumped over the lazy dog." It would be difficult to score the sentiment on the
sentence as a whole, but if you score on the attributes of fox and dog, you could say
the sentiment on the fox was positive (the fox is quick), and the sentiment on the dog
is negative (the dog is lazy).
Tuning to your domain - Pulse provides functionality to recognize attributes that
are specific to your domain. For example, you can add the name of your product or
company to a 'white_list' so that it is discovered by Pulse.
Tuning of how sentiment is scored - Pulse includes user-dictionaries of words that
are used to help score sentiment. You can alter these user-dictionaries to fine tune
the way your text is analyzed.
Filtering of attributes you are not interested in - Pulse supports a special 'stop
words' user-dictionary to indicate attributes that should not be analyzed. Alternately,
you can choose to score sentiment only on attributes defined in your white_list.
Synonym mappings - Pulse provides customizable mappings so that you can map
synonyms to a base word, and then normalize the analysis for the synonyms to the
base word. For example, you can map Hewlett Packard to HP.
Vertica Pulse requires that Java and the Vertica Java Support Package are installed on
all nodes in the Vertica cluster.
Depending on the version of Pulse, it may support only one language (English or
Spanish) or multiple languages (English and Spanish). For multilingual versions, Pulse
can analyze each text row (for example a tweet) in the language of the text specified as
argument, the language specified by the user as parameter or the default language. See
Multilingual Pulse for details.
HPE Vertica Analytic Database (7.2.x)
Page 9 of 117
Vertica Pulse
About the Vertica Pulse Package
HPE Vertica Analytic Database (7.2.x)
Page 10 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
Installing or Upgrading Vertica
Pulse
The Vertica Pulse Package requires that Java be installed prior to installing Vertica
Pulse.
•
•
•
•
•
•
•
Vertica Pulse Package Version Requirements
11
Installation Overview
11
Installing Java on Vertica Hosts
13
Installing or Upgrading the Vertica Pulse Package on Your Host
15
Tuning the jvm Resource Pool for Vertica Pulse
18
Assign Users to the pulse_users Role and Allow Access to Pulse Functions
21
Uninstalling Vertica Pulse and Pulse Packages
22
Vertica Pulse Package Version
Requirements
Your server must be running version 7.1.x or later to run Pulse. Pulse must be installed
on a Vertica node.
You can download the Vertica server package and from the Vertica Marketplace.
Installation Overview
1. Verify that your Vertica server version matches your Vertica Pulse version.
2. Install Java on all Hosts and set the JavaBinaryForUDx Vertica configuration
parameter to your Java binary location. For example, using vsql:
ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java'
3. Install the Vertica Package on a single node in the cluster. The process is the same
for installation or upgrade. You need only install it on a single node, but note that the
SQL scripts used to install and uninstall the Pulse functions and the SQL script that
creates pulse schema and the user-dictionaries tables are only available from the
node on which you installed the Pulse package. The Pulse functions, once
installed, are available on all nodes regardless if the package is installed on the
node to which you are connecting.
HPE Vertica Analytic Database (7.2.x)
Page 11 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
4. Modify the jvm resource pool so that Pulse performs optimally on your system
hardware.
HPE Vertica Analytic Database (7.2.x)
Page 12 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
Installing Java on Vertica Hosts
You must install a Java Virtual Machine (JVM) on every host in your Vertica cluster in
order to run Pulse. Pulse requires a 64-bit Java Standard Edition 6 or 7 (Java version
1.6 or 1.7) runtime. Both the Oracle JDK and openjdk are supported. You can choose to
install either the Java Runtime Environment (JRE) or Java Development Kit (JDK),
since the JDK also includes the JRE. See the Java Standard Edition (SE) Download
Page to download an Oracle installation package for your Linux platform, or use your
platforms packaging tool (such as yum or apt-get) to get a Java 1.6 or 1.7 compatible
version of open-jdk.
Once you have installed a JVM on each host, ensure that the java command is in the
search path and calls the correct JVM by running the command:
java -version
This command should print something similar to:
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
Setting the JavaBinaryForUDx
Configuration Parameter
The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE
to execute Java UDFs. After you have installed the JRE on all of the nodes in your
cluster, you need to set this parameter to the absolute path of the Java executable. You
can use the symbolic link that some Java installers create (for example /usr/bin/java). If
the Java executable is in your shell search path, you can get the path of the Java
executable by running the following command from the Linux command line shell:
$ which java
/usr/bin/java
If the java command is not in the shell search path, use the path to the Java executable
in the directory where you installed the JRE. For example, if you installed the JRE in
/usr/java/default (which is where the installation package supplied by Oracle
installs the Java 1.6 JRE), the Java executable is /usr/java/default/bin/java.
You set the configuration parameter by executing the following statement as a database
superuser:
HPE Vertica Analytic Database (7.2.x)
Page 13 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
See ALTER DATABASE for more information on setting configuration parameters.
To view the current setting of the configuration parameter, query the
CONFIGURATION_PARAMETERS system table:
=> \x
Expanded display is on.
=> SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name =
'JavaBinaryForUDx';
-[ RECORD 1 ]-----------------+---------------------------------------------node_name
| ALL
parameter_name
| JavaBinaryForUDx
current_value
| /usr/bin/java
default_value
|
change_under_support_guidance | f
change_requires_restart
| f
description
| Path to the java binary for executing
UDx written in Java
Once you have set the configuration parameter, Vertica will be able to find the Java
executable on each node in your cluster in order to execute Java UDFs.
Note: Since the location of the Java executable is set by a single configuration
parameter for the entire cluster, you must ensure that the path to the Java executable is
the same across all of the nodes in the cluster.
HPE Vertica Analytic Database (7.2.x)
Page 14 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
Installing or Upgrading the Vertica
Pulse Package on Your Host
After you install a JVM on all of the nodes in your cluster, you must install the Pulse
Package on a single node. If upgrading, install the new package on the same host on
which you previously installed the package. Pulse installation or upgrade is a two-step
process:
1. Install/Update the RPM or DEB package for Pulse.
2. Run included sql scripts to install or update the Pulse functions and create the user
dictionaries.
The Pulse install process installs the functions and schema required for sentiment
analysis. You need only install it on a single node. However, be aware that the following
SQL scripts are only available from the node on which you installed the Pulse package:
l
SQL scripts used to install and uninstall the Pulse functions
l
SQL script that populates and loads the dictionaries
You can access Pulse functions on all nodes, regardless if the package is installed on
the node to which you are connecting.
Install or Upgrade the Pulse Package
When you upgrade or reinstall Pulse, it automatically uses port 5433 for vsql. If you are
using a different port, configure it using the command export VSQL_PORT=<port_
number>.
1. Copy the RPM or DEB package to the node where you want to install or upgrade
Pulse. If you are upgrading Pulse then copy the new package to the same node
where you previously installed the Pulse package. The version of Vertica Pulse
must match the version of the Vertica server. For example, if your Vertica server is
version 7.1.0, then the VerticaPulse version must also be 7.1.0.
If you are upgrading Pulse, you can find the currently-installed version number of
Pulse with the command:
select lib_version, lib_sdk_version from user_libraries where lib_name = 'SentimentLib';
2. Log into the host and install the package.
HPE Vertica Analytic Database (7.2.x)
Page 15 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
n
For Red Hat, use:
sudo rpm -Uvh /path-to-package/vertica-pulse.x86_64.xxx.rpm
n
For Debian, use:
sudo dpkg -i /path-to-package/vertica-pulse.x86_64.xxx.deb
The Pulse Package is installed to /opt/vertica/packages/pulse.
After you install the package, you must run the appropriate SQL scripts to install or
upgrade the Pulse functions and install the dictionary tables. Vertica automatically
reloads any labeled user-defined dictionaries.
Running the Pulse Install Script
Run the install script to install or upgrade the Pulse functions and schema for the
dictionaries and mappings required for sentiment analysis. You must run the install
script once on the node on which you installed the package. After you run the install
script, then all nodes can use the Pulse functions.
Important! Before running the install script, you must set the JavaBinaryforUDx
configuration parameter or the install script fails to install the Pulse functions. See
Installing Java on Vertica Hosts.
To run the install script:
1. As the dbadmin user, on the node on which you installed the Pulse RPM/DEB, run
the install.sh script:
bash /opt/vertica/packages/pulse/install.sh
Note: You must run the install script for installs or upgrades.
2. The script installs/upgrades the Pulse functions:
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
CREATE
LIBRARY
TRANSFORM
TRANSFORM
TRANSFORM
TRANSFORM
TRANSFORM
TRANSFORM
TRANSFORM
TRANSFORM
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
HPE Vertica Analytic Database (7.2.x)
Page 16 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
CREATE TRANSFORM FUNCTION
etc...
3. If this is a fresh installation, then Modify the jvm Resource Pool to match your
system hardware.
HPE Vertica Analytic Database (7.2.x)
Page 17 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
Tuning the jvm Resource Pool for
Vertica Pulse
Note: You must modify the jvm resource pool to match the capabilities of your
hardware so that Vertica Pulse has adequate resources to perform queries. If a
cluster does not have sufficient resources to run an Vertica Pulse query, then such a
query can fail with an Out Of Memory (OOM) exception.
Vertica Pulse runs as a Java UDx (User Defined eXtension) and uses the jvm resource
pool to define the resources available to run Vertica Pulse queries.
Vertica starts a Java Virtual Machine (JVM) when you perform an Vertica Pulse query.
The session from which you issue the query reserves resources for the JVM (across all
nodes in the cluster) and it releases the resources when the session ends. You can also
explicitly close the JVM attached to the session by using the command SELECT
release_jvm_memory();.
The most critical resource pool settings that affect Vertica Pulse are MAXMEMORYSIZE
and PLANNEDCONCURRENCY.
l
l
MAXMEMORYSIZE defines the amount of RAM that a JVM can use. By default
MAXMEMORYSIZE is set to either 10% of system memory or 2GB, whichever is
smaller.
PLANNEDCONCURRENCY defines how many JVMs are allowed to run across the
cluster and how many Pulse sessions you are able to run cluster-wide. By default,
PLANNEDCONCURRENCY is set to AUTO, which is the lower of either the number
of cores on the node, or memory / 2GB, but it is never automatically set to less than
"4".
The amount of memory that each JVM is allocated is determined by MAXMEMORYSIZE
/ PLANNEDCONCURRENCY. For example, suppose MAXMEMORYSIZE is set to 8G
and PLANNEDCONCURRENCY is set to 2. In this case, only a maximum of 2 sessions
can run Vertica Pulse queries and the session JVMs have a maximum memory size of
4GB.
Tip: The basic thing to remember is that PLANNEDCONCURRENCY controls the
number of sessions across the entire cluster that can run the sentimentAnalysis()
function. If set to 1, then only a single session can run Pulse functions. No other
sessions are able to run Pulse or Java UDx functions until the session currently
running Pulse functions is closed.
HPE Vertica Analytic Database (7.2.x)
Page 18 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
While resource pool settings are based on the resources of a node, they apply across
the entire cluster. A session with an Vertica Pulse query reserves the same resources
for its JVM on all nodes in the cluster. Therefore, it doesn't matter if the cluster contains 3
nodes or 30 nodes; each node reserves, for example, 4GB of the node's memory for the
JVM used by the Vertica Pulse session and PLANNEDCONCURRENCY limits the
amount of sessions that can run Pulse cluster-wide. If PLANNEDCONCURRENCY is 1,
then only 1 vsql session (or client connection) in the entire cluster can run Pulse.
You can display the current resource pool settings for the jvm resource pool with the
following command:
select name, MAXMEMORYSIZE, PLANNEDCONCURRENCY from V_CATALOG.RESOURCE_POOLS
where name = 'jvm';
Configuring the jvm Resource Pool for your
System
Do not use the default jvm resource pool settings for Vertica Pulse. You must configure
the jvm resource pool to match your hardware and workload requirements. Specifically,
specify PLANNEDCONCURRENCY and MAXMEMORYSIZE to match your hardware.
You may need to experiment to find the optimal settings for your hardware and your
specific workloads. As a best practice, allow:
l
l
At least 2GB of memory per session for Vertica Pulse
At least 25% of the memory available for general Vertica overhead. Essentially,
MAXMEMORYSIZE must never exceed 75% of total system memory.
Note: If you are running a lot of queries not in the context of Vertica Pulse, then you
should allow for more memory to be available outside of the jvm resource pool.
To configure your system for Vertica Pulse:
l
Determine the number of cores on a node. Your PLANNEDCONCURRENCY setting
cannot exceed this value. For example, you can run the following from a shell to
determine cores:
cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g |
grep -v ^$ | sort | uniq | wc -l
l
Determine the amount of memory in GB on a node. Your MAXMEMORYSIZE cannot
exceed 75% of the total system memory. For example, you can run the following from
HPE Vertica Analytic Database (7.2.x)
Page 19 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
a shell to determine the Total System Memory in GB for any particular node:
awk /MemTotal/'{printf "%f GB\n", $2/1024/1024}' /proc/meminfo
l
l
Use the formula MAXMEMORYSIZE / PLANNEDCONCURRENCY to determine
how much memory each Vertica Pulse JVM receives. For example, you can use
(.75 * Total System Memory) / PLANNEDCONCURRENCY if you plan to use most
of your RAM for Vertica Pulse. The outcome of the formula must be 2 (which denotes
GB) or greater. For example, if you have 8GB of total system memory, and your
estimated PLANNEDCONCURRENCY is 3, then the formula results in "2" and is
acceptable. However, if you have the same amount of memory and
PLANNEDCONCURRENCY is set to 4, then the result of the formula is "1.5", which
is below the recommended minimum of 2GB. You can either add more RAM to the
system or reduce PLANNEDCONCURRENCY to get the resulting number up to "2".
Finally, alter the jvm resource pool. For example, for a cluster with nodes each
having 16GB of memory, and you determine to use up to 75% of the total system
memory (0.75 * 16GB = 12GB) for Vertica Pulse, then you can set the resource pool
as follows:
ALTER RESOURCE POOL jvm MAXMEMORYSIZE '12G' PLANNEDCONCURRENCY 3;
Note: For evaluation purposes on systems with lower memory, set
MAXMEMORYSIZE to 75% and PLANNEDCONCURRENCY to 1: ALTER
RESOURCE POOL jvm MAXMEMORYSIZE '75%' PLANNEDCONCURRENCY 1; While
these settings are unsupported, they do allow you to run simple Vertica Pulse
queries. You may experience Out Of Memory exceptions and slow performance.
For additional details, see:
l
ALTER RESOURCE POOL
l
Managing Workloads
l
Java UDx Resource Management
HPE Vertica Analytic Database (7.2.x)
Page 20 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
Assign Users to the pulse_users Role
and Allow Access to Pulse Functions
When you install Pulse, the install script creates a pulse schema, which contains the
user-dictionary and mapping lists used by Pulse. Initially only administrators can read or
edit tables in the pulse schema. To give non-administrator database users access to the
pulse schema, you assign the user to the 'pulse_users' role, which has all privileges for
the pulse schema. The role is created automatically when you install Pulse.
Note: The default dbadmin user has access to the pulse schema by default. You do
not need to add the pulse_users role to the dbadmin account.
Granting users Access to the Pulse Schema
To grant non administrator users access to the tables in the Pulse schema:
1. As the dbadmin, if the user does not exist, create the user with the command:
create user username identified by 'password';
2. As the dbadmin, if the user does not have access to function in the public schema,
then grant execute privileges with the command: GRANT execute ON ALL
FUNCTIONS IN SCHEMA public TO username;
Note: By default, the Pulse functions are created in the public schema.
3. As the dbadmin, grant the pulse_user role to the new user with the command: grant
pulse_users to username;
4. As the user to which you granted the pulse_user role, set the users role to pulse_
users with the command: set role pulse_users;
Note: The user must run the set role command per session to read or edit
tables in the pulse schema.
HPE Vertica Analytic Database (7.2.x)
Page 21 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
Uninstalling Vertica Pulse and Pulse
Packages Uninstalling Vertica Pulse on hosts and uninstalling Pulse packages require different
procedures.
Uninstall Vertica Pulse on Your Hosts
As the dbadmin, run the uninstall script from the node on which you installed the Pulse
package:
bash /opt/vertica/packages/pulse/uninstall.sh
The uninstall script removes all Pulse functions, but does not remove the pulse schema
containing the user-dictionary and mapping tables.
To remove all Pulse dictionaries and mappings, including custom dictionaries, include
the -r parameter
bash /opt/vertica/packages/pulse/uninstall.sh -r
Uninstall Pulse Packages
To uninstall the Pulse package, on the nodes that have the Pulse package installed,
use the appropriate command for your package.
l
For RPM packages:
# sudo rpm -e vertica-pulse
l
For DEB packages:
# sudo dpkg --remove vertica-pulse
The Pulse schema and associated user-dictionary and mapping tables remain in the
database. To remove the Pulse schema and its associated tables, run the following
command:
DROP SCHEMA pulse CASCADE
HPE Vertica Analytic Database (7.2.x)
Page 22 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
HPE Vertica Analytic Database (7.2.x)
Page 23 of 117
Vertica Pulse
Installing or Upgrading Vertica Pulse
HPE Vertica Analytic Database (7.2.x)
Page 24 of 117
Vertica Pulse
Using Pulse
Using Pulse
•
•
•
•
•
Dictionaries and Mappings
26
Determining Sentiment
37
Sentiment Analysis Levels
38
Tuning Pulse
41
Bulk Loading Word Lists from Text Files
46
HPE Vertica Analytic Database (7.2.x)
Page 25 of 117
Vertica Pulse
Using Pulse
Dictionaries and Mappings
Pulse contains built-in dictionaries and maps that help determine the sentiment of
sentences. You have the option of creating and loading user-defined dictionaries and
maps.
Dictionaries and Mappings are loaded across all client sessions and remain in memory
even if the database is stopped and started.
Dictionaries
Pulse uses a proprietary system dictionary to help score sentiment. The system
dictionary is not visible or modifiable. You can, however, alter the default way that Pulse
scores sentiment by modifying user dictionaries. The user dictionaries provide flexibility
so that you can tune sentiment scoring for your specific domain. You do not have to
modify user dictionaries if Pulse is scoring your data appropriately.
Users can apply dictionaries on a per-user basis. Any number of Pulse users can
concurrently apply different sets of dictionaries without conflicts and without disrupting
the sessions of other users. Each user can have one dictionary of each type loaded at
any given time. If a user does not specify a dictionary of a given type, Pulse uses the
default dictionary for that type.
Mappings
Maps are lists of synonyms of one or more words that map to another word. Using maps
allows you to analyze text that pertains to the same subject or concept but may use
slightly different terminology.
For example, you can map both 'Hewlett Packard' and 'Hewlett-Packard' (with hyphen)
to 'HP.' Pulse substitutes the mapped words to the core word when it runs its analysis.
Dictionary and Mapping Tables
User dictionaries and a normalization map for each supported language reside in tables
inside the Pulse schema. You can see the contents of the tables with simple queries
such as:
SELECT * FROM pulse.pos_words_en;
Or:
SELECT * FROM pulse.pos_words_es;
HPE Vertica Analytic Database (7.2.x)
Page 26 of 117
Vertica Pulse
Using Pulse
There is one table per dictionary/map for each language. The table name has the
language abbreviation as a suffix. For example, English tables have the suffix "_en" and
Spanish tables have the suffix "_es". By default, the user dictionaries and normalization
map are empty. You can modify these tables to tune Pulse to your specific needs. After
you modify these tables, you must load the changes into memory.
You can update the user dictionaries and normalization tables at any time. To do so,
you must run load functions (see LoadDictionary()and LoadMapping()) to load the
values from the tables into memory. Your changes affect sentiment scoring only after
you load the new values.
Note: Loading a user dictionary or loading a normalization map overwrites the
values in memory with the values from the specified table. You cannot append user
dictionaries or the normalization map in memory.
The following dictionary table names provide descriptions of the English user
dictionaries. For Pulse versions that support Spanish, the same set of dictionaries with
the suffix "_es" is present in the Pulse schema.
Dictionary Table
Name
Description
white_list_en
Words that are always marked as an attribute. This list
augments the built-in Pulse attribute discovery process. Add
words that you always want scored to the white_list user
dictionary. For example, such words can include nouns,
phrases or business-dependent attributes that are not autodiscovered by Pulse.
This list is typically modified to increase the accuracy of
sentiment scoring for your domain.
stop_words_en
Words that are never marked as an attribute. Add words that
you do not want scored to the stop_words user dictionary.
Use this dictionary to filter out attributes that are not of interest
to your analysis. This list is typically modified to increase the
accuracy of sentiment scoring for your domain.
The stop_words dictionary can only contain nouns and
compound nouns. If Pulse does not identify a stop word as a
noun, it ignores it.
pos_words_en
Positive words that can be any type of word or phrase. Words
in this list are more likely to carry a positive polarity in general.
You can also add exact phrases, such as idioms, to this list.
HPE Vertica Analytic Database (7.2.x)
Page 27 of 117
Vertica Pulse
Using Pulse
Dictionary Table
Name
Description
Examples: adroit, resolve, strong, hit the nail on the head
Negative words that can be any type of word or phrase that
have a negative connotation. Words in this list are deemed
more likely to carry a negative polarity in general.
neg_words_en
You can also add exact phrases, such as idioms, to this list.
Examples: abhorrent, butcher, racist, wrath, flash in the pan.
neutral_words_en
Words that indicate a neutral connotation. Words in this list
are scored with a sentiment of 0, meaning not positive or
negative.
The following table shows the tables that describe mapping within Pulse.
Mapping
Table Name
Description
Example
normalization_en
A list of word pairs used to map like terms
(synonyms). You can use this to correct common
misspellings and map them to the correct
spelling. This list is frequently modified and is
empty by default.
base/synonym:
l
l
l
l
'hp'/
'hewlettpackar
d'
'hp'/ 'HewlettPackard'
'Obama'/
'President
Obama'
'Obama'/
'Barack Obama'
Loading Dictionaries and Mappings into
Pulse
If you have made changes to the Pulse schema tables, then you must load either the
dictionaries, the normalization map, or both. After the changes are loaded, Pulse stores
them in memory across all sessions in the cluster. Because Pulse automatically loads
HPE Vertica Analytic Database (7.2.x)
Page 28 of 117
Vertica Pulse
Using Pulse
the dictionaries and mapping at startup, you do not need to reload them after a database
restart or system reboot.
To load an individual user dictionary into memory, use the LoadDictionary() function
with the appropriate parameter and word list.
l
n
LoadDictionary does not append user-dictionary lists. Instead, it overwrites them. If
you load a user dictionary more than once with the same list name, then only the
most recent user dictionary is loaded for that list name.
To load the normalization mapping into memory, use the LoadMapping() function
with the normalization map.
l
n
If you load a mapping with an incorrect mapName, then the result of LoadMapping()
is false and the map is not loaded. LoadMapping() does not append maps. Instead,
it overwrites them. If you load a map more than once with the same mapName, then
only the most recent mapping is loaded for that mapName.
n
If LoadMapping() is successful, Vertica returns a success message from each node
in the cluster.
Automatically Loading Dictionaries and the
Normalization Map
For ease of use, Pulse ships with a script to automatically load into memory all of the
required user dictionaries and the normalization mapping. This script only exists on the
node on which you installed the Pulse RPM/DEB package.
You can run the script from within vsql with the following command:
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
Manually Loading Dictionaries and the
Normalization Map
If you want to manually load certain user dictionaries or mappings from the Pulse
schema tables, run the following command. This example loads the pos_words
dictionary. See LoadDictionary() for valid values for the listName parameter and for
multilingual version loading.
Note: The following examples use the English dictionaries. For Spanish, replace "_
en" with "_es".
HPE Vertica Analytic Database (7.2.x)
Page 29 of 117
Vertica Pulse
Using Pulse
1. Add a word to the pos_words dictionary:
=> INSERT INTO pulse.pos_words_en VALUES('SuperDuper');
=> COMMIT;
By default, added words are not case sensitive. "ERROR" produces the same results
as "error". You can, however, specify a case setting for a single word using the
$Case parameter. For example, to identify "Apple", rather than "apple", you would
add the following:
=> INSERT INTO pulse.white_list_en VALUES('$Case(Apple)');
=> COMMIT;
2. Load the updated dictionary into Pulse:
=> SELECT LoadDictionary(standard USING PARAMETERS
listName='white_list') OVER()
FROM pulse.white_list_en;
3. If you change the normalization map, you can load the new normalization values
with the following command:
=> SELECT LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization') OVER() FROM pulse.normalization_en;
After loading, Vertica returns a success message and the number of rows (words or
word pairs) loaded.
Dictionary and Mapping Labels
You can apply a label to any user-defined dictionary or mapping when you load that
object. Labels enable to you perform sentiment analysis against a predetermined set of
dictionaries and mappings without having to specify a list of dictionaries. For example,
you might have a set of dictionaries labeled "music" and a set labeled "movies." The
default user dictionaries automatically have a label of "default."
A single dictionary or mapping can have multiple labels. For example, you might label a
white list of artists as both "painters" and "renaissance." You could load the dictionary
by loading either label. A label can only apply to one dictionary of each type. For
example, you cannot have two dictionaries of stop words that share the same label. If
you apply a label to multiple dictionaries of the same type, Pulse uses the most recently
applied label.
You can view the labels associated with your current dictionaries using the
GetAllLoadedDictionaries() function. You can also view the label associated with your
current mapping using the GetLoadedMapping() function.
HPE Vertica Analytic Database (7.2.x)
Page 30 of 117
Vertica Pulse
Using Pulse
Normalization Map Effect on Results
Before any of the sentiment analysis functions are run on the text, the normalization map
is applied. When a sentiment analysis function is run, Pulse replaces the synonym with
the base word. The result of the sentiment analysis function displays the mapped words
and not the original text. For example, Pulse maps both 'Hewlett Packard' and 'HewlettPackard' (with a hyphen) to 'HP' in the results when the normalization map is populated
with those terms.
Before Mapping
The following example demonstrates sentiment analysis before mapping:
=> SELECT SentimentAnalysis('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
| sentiment_score
----------+----------------------+----------------1 | hewlett-packard
|
0
2 | hewlett packard
|
0
2 | garage
|
0
2 | palo alto california |
0
(4 rows)
Insert Normalization Values and Load Map
You can add values to the normalization map using an INSERT statement. The
following example demonstrates how to insert normalization values and load the map:
=> INSERT INTO pulse.normalization_en VALUES('HP', 'Hewlett-Packard');
=> INSERT INTO pulse.normalization_en VALUES('HP', 'Hewlett Packard');
=> COMMIT;
=> SELECT LoadMapping(standard_base, standard_synonym
USING PARAMETERS mapName='normalization') OVER()
FROM pulse.normalization_en;
You can also map multiple values to the same term using a $LIST parameter. The
following example would map multiple alternate names for the city of Boston to the
value 'Boston'.
INSERT INTO normalization_en Values( 'Boston', '$LIST(BOS,beantown,the hub);
After Mapping
The mapping operation replaces the attributes with their counterparts from the
normalization list and displays the base terms:
HPE Vertica Analytic Database (7.2.x)
Page 31 of 117
Vertica Pulse
Using Pulse
=> SELECT SentimentAnalysis('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
| sentiment_score
----------+----------------------+----------------1 | hp
|
0
2 | hp
|
0
2 | garage
|
0
2 | palo alto california |
0
(4 rows)
The CommentAttribute() function also uses the normalization map and displays the
base terms instead of the original text:
=> SELECT CommentAttributes('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
----------+---------------------1 | hp
2 | hp
2 | garage
2 | palo alto california
(4 rows)
Creating Tables for Custom Dictionary
Mappings
The Vertica Pulse package includes all the necessary user dictionary and mappings
tables. However, you can create your own tables to store additional user dictionaries or
mappings. For example:
CREATE TABLE my_positive_words(word VARCHAR(64));
The following example shows how to create a table, add terms to it, and then load the
table as a normalization map:
=>
=>
=>
=>
CREATE TABLE myNormalization(base VARCHAR(64), synonym VARCHAR(64));
INSERT INTO myNormalization VALUES('hp','Hewlett Packard');
INSERT INTO myNormalization VALUES('hp','Hewlett-Packard');
COMMIT;
=> SELECT LoadMapping(base, synonym USING PARAMETERS
mapName='normalization') OVER() FROM myNormalization;
After loading, Vertica returns a success message from each node in the cluster.
HPE Vertica Analytic Database (7.2.x)
Page 32 of 117
Vertica Pulse
Using Pulse
Using Action Patterns in Dictionaries
Vertica Pulse supports the use of action patterns in white_list dictionaries only. An
action pattern enables Pulse to recognize phrases that denote action, intention, or
interest, such as going to buy, waiting to see, and so on. Action patterns can identify
behaviors associated with your sentiment analysis terms.
Action patterns can:
l
l
Connect Word Forms to a Root Word — Vertica Pulse lemmatizes all words.
Lemmatization recognizes different word forms and maps them to the root word. For
example, Pulse would map bought and buying to buy. This ability extends to
misspellings. For example, tryiiiing and seeeeeing taaablets would map to trying and
seeing tablets.
Create Object-Specific Queries — To identify only the attributes that are objects of
action patterns, create a whitelist dictionary that contains only action patterns of
interest. In your sentiment analysis query set the actionPattern and whiteListOnly
parameters to true.
Note: Action patterns exist in the whitelist dictionary. If a word that matches an
action pattern appears in both the white_list and stop_words dictionaries, the white_
list takes precedence. The stop_list word would appear in sentiment analysis
results.
Action Pattern Syntax
Construct an action pattern by combining action parameters within an #action. By
default, parameters match any instance of the associated part of speech. You can match
specific terms by listing them with the parameter. For example, the parameter $PREP
(to,on), would match only to and on.
Parameters can also accept $regex and $list operators.
#action{$ADV $VERB $PREP $ADJ}
Parameter
Short form
Description
$ADJECTIVE
$ADJ
Matches any adjective.
$ADVERB
$ADV
Matches any adverb.
$PREPOSITION
$PREP
Matches any preposition.
$VERB
$VERB
Matches any verb.
HPE Vertica Analytic Database (7.2.x)
Page 33 of 117
Vertica Pulse
Using Pulse
Default Action Patterns
Pulse includes default action patterns in the whitelist dictionary. You cannot remove
these patterns. Pulse always evaluates them when you perform a sentiment analysis.
Language
Pattern
Example
English
$Verb $Prep $Verb
planning on buying, thinking about dropping
$Verb TO $Verb
going to buy, looking to acquire
$Verb $Prep $Verb
voy a comprar, pienso en dejar
$Verb $Verb
quiero solicitar, planean adquirir
Spanish
Examples
The following example shows how Pulse can match customer or client, the verb would
and any other verb. It would match phrases like customer would buy or client would
cancel.
INSERT INTO pulse.white_list_en values('#action{$LIST(customer,client) would $VERB}');
The following example shows a match for specific verbs like, want, and plan, plus any
preposition and any other verb. It would match phrases like want to own or plan on
buying.
INSERT INTO pulse.white_list_en values('#action{$VERB(like,want,plan) $PREP $VERB}');
The following example identifies words ending in ember, such as December, and uses
a regular expression to identify date references, such as 2nd or 4th. This action pattern
could identify users planning to attend an event or making holiday plans.
INSERT INTO pulse.white_list_en values('#action{On $regex(.+ember) $regex(\d+(th|st|rd|nd)) I will
$verb to}');
Using Lists In Dictionaries
Vertica Pulse supports the use of token-based lists in dictionaries. You can use a list to
match multiple terms to a single word. Token based lists differ from mapping by allowing
you to create multiple associations in a single action rather than a series of pairs. Unlike
mapping, lists are not restricted to the normalization dictionary.
Note: Token-based lists do not apply to the base word in normalization dictionaries.
HPE Vertica Analytic Database (7.2.x)
Page 34 of 117
Vertica Pulse
Using Pulse
You can add lists to the following dictionaries:
l
pos_words
l
neg_words
l
neutral_words
l
normalization
l
white_list
l
stop_words
You can add a token to a user-defined dictionary using an INSERT statement and
$LIST parameter containing the list of values to match.
The following example would match slang terms to the word "good".
INSERT INTO pos_words Values( 'good', '$LIST(sweet,dope,tight);
Using Regular Expressions in Dictionaries
Vertica Pulse supports the use of regular expressions in user-defined dictionaries.
Vertica Pulse regular expressions use the java.util.regex package syntax. For more
information on this syntax, refer to the Oracle documentation.
Note: Regular expressions do not apply to the base word in normalization
dictionaries.
You can add regular expressions to the following dictionaries:
l
pos_words
l
neg_words
l
neutral_words
l
normalization
l
white_list
l
stop_words
You can add a regular expression to a user-defined dictionary using an INSERT
statement and $REGEX parameter containing the regular expression. Regular
expressions are case insensitive. the regular expression $regex(apple) produces the
same matches as the regular expression $regex(Apple).
HPE Vertica Analytic Database (7.2.x)
Page 35 of 117
Vertica Pulse
Using Pulse
Note: A regular expression can support a single token or word. Smartphone would
be a valid token, but smart phone would not .
The following example would match any word ending with the string "day". You could
use it to identify the days of the week or words such as yesterday and today.
INSERT INTO stopwords_en Values( '$LIST(nice,good,fine) $REGEX(.*day)');
The following example matches references to iPhones, including the number and letter
version.
INSERT INTO whitelist_en Values(‘Iphone $REGEX(\d{1}(S|C)?)’);
To use a parenthesis as a literal part of a regular expression, you must use the escape
character \ twice to prevent Pulse from interpreting the parenthesis as metacharacter in
the regular expression. The following example would match the literal string (hugs).
INSERT INTO whitelist_es Values($REGEX(\\(hugs\\));
HPE Vertica Analytic Database (7.2.x)
Page 36 of 117
Vertica Pulse
Using Pulse
Determining Sentiment
You determine sentiment by using the SentimentAnalysis() function on text.
The SentimentAnalysis() function first extracts the attributes (typically nouns) from the
sentence, and then applies a sentiment score to each attribute. Scores can be one of the
following:
l
1 - Positive Sentiment
l
0 - Neutral Sentiment
l
-1 - Negative Sentiment
This provides a more granular analysis than just determining the sentiment for the
sentence as a whole. Consider the following quote from Abraham Lincoln; "Force is allconquering, but its victories are short-lived." If you were to score the sentiment of the
sentence as a whole by averaging the sentiment of its parts, then the sentiment is
neutral.
=> select avg(t1.sentiment_score) as 'Average Sentiment' from (
select sentimentAnalysis('Force is all-conquering, but its victories are short-lived.')
over (PARTITION BEST)
) as t1;
Average Sentiment
----0
If you score the individual attributes of the sentence, then you can obtain a much more
precise analysis of the sentiment than if you were trying to assign a single score to the
entire sentence. For example:
=> select sentimentAnalysis('Force is all-conquering, but its victories are short-lived.') over
(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | force
|
1
1 | victories |
-1
"Force" is scored with positive sentiment because it is "all-conquering". "Victories" is
scored with negative sentiment because it is "short-lived".
Note: Vertica Pulse does not recognize personal pronouns (I, you, we, he, she, it,
etc.) as attributes.
HPE Vertica Analytic Database (7.2.x)
Page 37 of 117
Vertica Pulse
Using Pulse
SentimentAnalysis() also extracts the sentiment from multiple sentences and returns
the sentence in which attributes are found:
=> SELECT SentimentAnalysis('Force is all-conquering, but its victories are short-lived. Every good
boy deserves fudge.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | force
|
1
1 | victories |
-1
2 | boy
|
1
2 | fudge
|
1
(4 rows)
"Boy" is scored with positive sentiment because he is good. Fudge is scored with
positive sentiment because it is something that is deserved.
Note: The sentence detector considers a period to mark the end of a sentence. Some
abbreviations that use a period, such as Dr. or Mr., cause the sentence detector to end
the sentence at the abbreviation.
The SentimentAnalysis function also identifies attributes with neutral sentiment (a
sentiment score of zero). For example:
SELECT SentimentAnalysis('Roses are red. Violets are blue.') OVER(PARTITION BEST);
sentence | attribute | sentiment score
----------+-----------+----------------1 | roses
|
0
2 | violets
|
0
(2 rows)
Both roses and violets receive neutral sentiment because neither being red nor blue is
considered positive or negative in this context.
See the Pulse Cookbook for more examples of determining sentiment.
Sentiment Analysis Levels
Vertica Pulse is capable of determining sentiment at the following levels:
l
Attribute
l
Sentence
l
Document
You can specify an analysis level using the granularity parameter of the
SentimentAnalysis function. You can perform multiple levels of analysis simultaneously
within the same query.
HPE Vertica Analytic Database (7.2.x)
Page 38 of 117
Vertica Pulse
Using Pulse
Attribute-Level Analysis
Attribute level analysis provides a sentiment for each object in a sentence. This
behavior is the default level of analysis for Pulse. Attribute analysis identifies the objects
of a sentence and any sentiment expressed regarding those objects.
The following example shows the sentiment expressed with regard to "camera" and
"quality pictures."
Select SentimentAnalysis ('The camera takes great quality pictures but is expensive. It feels like a professional one' USING
PARAMETERS granularity='A') over();
sentence |
attribute
| sentiment_score
---------+------------------+----------------1 | camera
|
1
1 | quality pictures |
1
Sentence-Level Analysis
A sentence level analysis provides the overall sentiment of each sentence in a
document. If a sentence is contains both positive and negative sentiments, it appears as
mixed.
The following example shows two sentences, the first of which is mixed. As a mixed
sentiment, the sentiment score is 0, or neutral, and the mixed value is true. The second
sentence is entirely positive. Its sentiment is 1, or positive, and the mixed value is false.
Select SentimentAnalysis ('The camera takes great quality pictures but is expensive. It feels like a professional one' USING
PARAMETERS granularity='S') over();
sentence | sentiment_score | mixed
----------+-----------------+------1 |
0 | true
2 |
1 | false
Document-Level Analysis
Document level analysis provides the overall sentiment of an entire document. If you
wanted to know if a movie review was positive, negative, or mixed, a document level
analysis could provide that information. Document level analysis gives both the overall
sentiment score and a mixed rating if the sentiment is not exclusively positive or
negative.
The following example shows that overall, the writer is positive but does express some
negative sentiments.
HPE Vertica Analytic Database (7.2.x)
Page 39 of 117
Vertica Pulse
Using Pulse
Select SentimentAnalysis ('The camera takes great quality pictures but is expensive. It feels like a professional one' USING
PARAMETERS granularity='D') over();
sentiment_score | mixed
-----------------+------1 | true
HPE Vertica Analytic Database (7.2.x)
Page 40 of 117
Vertica Pulse
Using Pulse
Tuning Pulse
Pulse contains built-in dictionaries that help to determine the sentiment of sentences.
These dictionaries are not directly readable. However, you can modify the Pulse
dictionary tables to improve automatic attribute discovery and provide more accurate
results for sentiment scoring based on your specific data sets. The dictionary tables are
available in the Pulse schema. Any words you add to these dictionaries takes
precedence over the built-in dictionaries.
Improving Automatic Attribute Discovery
Pulse identifies nouns in sentences and marks them as attributes. However, there are
two dictionaries and one mapping that you can modify to improve automatic attribute
discovery. These are:
white_list - a list of words on which you want to score sentiment, but are not autodiscovered by Pulse. Typically these are product or company names, or special
words in the domain of the data you are analyzing. You can also add noun phrases to
the white_list.
l
n
Consider the term "President Smith". Pulse automatically marks "President" as an
attribute. However, you can add "President Smith" to the white_list and Pulse then
uses "President Smith" as the attribute instead of just "President".
n
If your white_list contains phrases that are subsets of other phrases in the white list,
then the shorter phrase is not matched if the text being analyzed matches the
superset phrase. For example, if both "Honest Al" and "Honest Al Car Emporium"
are in the white_list, then the latter phrase is identified as an attribute in the text
"Honest Al Car Emporium is not honest.", not the shorter "Honest Al" white_list
phrase.
stop_words - a list of words on which you do not want to score sentiment, but may
appear frequently in your data set. stop_words is basically a way to filter out
attributes.
l
n
l
If a word appears in both stop_words and white_list, then the white_list word
takes precedence. The word appears in results even though it is in the stop_words
dictionary.
normalization - a map of base words and synonyms that allow you to normalize
words for easy comparison. For example, you can normalize "Hewlett Packard" to
HPE Vertica Analytic Database (7.2.x)
Page 41 of 117
Vertica Pulse
Using Pulse
"HP", then count the number of times "HP" appears as an attribute in your data. Any
text that contains "HP" or "Hewlett Packard" is counted towards the total.
Determining How Pulse Scores Sentiment
When tuning Pulse it is important to understand why Pulse may not be scoring a
particular attribute the way you want it to be scored. For example, consider the sentence
"The quick brown fox jumped over the lazy dog." By default, Pulse scores the fox as
positive and the dog as negative. If you want to better understand how the words in the
sentence affect the attributes, then you can use the relatedwords parameter to see
which words are affecting the score. For example:
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
USING PARAMETERS relatedwords=true) OVER(PARTITION BEST);
sentence | attribute | sentiment_score | related_word_1 | related_word_2 | related_word_3
----------+-----------+-----------------+----------------+----------------+---------------1 | fox
|
1 | quick
| lazy
|
1 | dog
|
-1 | lazy
|
|
(2 rows)
The output details that "quick" and "lazy" impacted the scoring of the "fox" attribute, and
that "lazy" affected the scoring of the "dog" attribute. "Quick" (positive) is weighted
heavier than "lazy" (negative) when scoring "fox" because the word "quick" is closer to
the attribute "fox" in the sentence, and the result is that "fox" is scored positively. "Lazy"
(negative) is the only related word being used to score the sentiment for "dog". If you
don't agree with the scoring, you can change how these related words affect the score
by adding them to the appropriate user-dictionary, as described in "Improving Sentiment
Scores".
Improving Sentiment Scores
Pulse scores sentiment on attributes (nouns) in sentences using Natural Language
Processing (NLP) algorithms and rules. Pulse attempts to identify the parts of a
sentence (for example, verbs, nouns/attributes, adjectives, etc; the parts of speech), and
then scores the attributes based on which system-dictionaries the parts of speech
appear (positive,negative, or neutral) and where those parts of speech appear in relation
to the attributes and other contextual information. Pulse does not identify personal
pronouns (he, you, we , she, etc.) as attributes.
Pulse provides a PartsOfSpeech function so that you can verify which parts of speech
are being identified in a sentence.
HPE Vertica Analytic Database (7.2.x)
Page 42 of 117
Vertica Pulse
Using Pulse
Sentiment Scoring and the Precedence of
Pulse User-Dictionaries
The negative, positive, and neutral user-dictionaries adjust the score of an attribute
based on which dictionary the words in the sentence appear. User-dictionaries take
precedence over the internal dictionaries that Pulse uses for analyzing text, so that you
can override the default polarity of an opinion word or phrase by inserting it in the
appropriate user-dictionary table.
Pulse also supports using phrases in the pos_words, neg_words and neutral_words
dictionaries. Phrases, such as idioms ("hit the nail on the head."), can be added to the
user dictionaries. Phrases of two or more words support "fuzzy" matching. For example,
the phrase "solve problem" also matches "solves problems".
Pulse uses an order of precedence to determine which user dictionary is used to modify
the default score. The order of precedence of the user dictionary that Pulse uses to
score attributes is as follows:
1. Phrases or strings that occur in the "neutral_words" dictionary
2. Phrases or strings that occur in the "neg_words" dictionary
3. Phrases or strings that occur in the "pos_words" dictionary
4. Single words appearing in the "neutral_words" dictionary
5. Single words appearing in the "neg_words" dictionary
6. Single words appearing in the "pos_words" dictionary
Note: If a word is present in both stop_words and white_list, then the white_
list word takes precedence. The word is present in results even though it exists in
stop_words.
Consider the sentence "Fudge is good". It contains three parts; a noun (fudge), a verb
(is), and an adjective (good). When you analyze the sentence using Pulse, it identifies
"fudge" as an attribute, because it is a proper noun, and then assigns "fudge" a positive
sentiment:
select sentimentAnalysis('Fudge is good') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fudge
|
1
HPE Vertica Analytic Database (7.2.x)
Page 43 of 117
Vertica Pulse
Using Pulse
The number of words matched against a dictionary also has an impact on which
dictionaries take precedence. For example, phrases or word combinations in the userdictionary lists take precedence over single words. For example, the positive phrase
"solve problem" causes a positive score on the text "Joe solves problems", even though
"problem" is a negative word. Since phrases have precedence over single words, a
positive score is applied to Joe.
SELECT SentimentAnalysis('Joe solves problems.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | joe
|
1
(1 row)
SELECT SentimentAnalysis('Joe is a problem.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | joe
|
-1
1 | problem
|
0
(2 rows)
Tuning Example
You can modify any of the user-dictionaries to improve the accuracy of sentiment
scores. The two basic dictionaries, "neg_words" and "pos_words", are typically the
easiest to modify to get good results. Words in these two dictionaries can be any part of
speech (verb, adjective, etc.). If you find a word that is causing an attribute to be scored
positively or negatively, but it should be score as neutral, then you can add that word to
the "neutral_words_en" dictionary to cause it to be scored 0.
Consider the sentence "The product delivers simplicity.":
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
0
1 | simplicity |
0
(2 rows)
If you want "product" to be scored positively in this sentence, then you must add "deliver
simplicity" to the pos_words user-dictionary. "deliver simplicity" will also match "delivers
simplicity" due to the "fuzzy" matching feature of phrases in the dictionaries. If you add
"simplicity" by itself to the "pos_words" dictionary, then simplicity in any context is
considered positive, which may not be the result you want to achieve across your entire
domain. The following example adds "deliver simplicity" to the pos_words userdictionary for the English language:
insert into pulse.pos_words_en values ('deliver simplicity');
commit;
HPE Vertica Analytic Database (7.2.x)
Page 44 of 117
Vertica Pulse
Using Pulse
-- you must reload the dictionaries for the changes to be effective
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
1
(1 row)
Notice that "simplicity" is not positive if it is not paired with "deliver":
select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
0
1 | simplicity |
0
(2 rows)
If you want "simplicity" to always be positive, add it to the "pos_words" list. This
example replaces "deliver" with "provides":
insert into pulse.pos_words_en values ('simplicity');
commit;
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
1
1 | simplicity |
0
(2 rows)
Notice that the sentiment score for the attribute (noun) "simplicity" is not affected by
having the word "simplicity" in a Pulse user-dictionary, since it has been identified as an
attribute.
Additional Tuning Examples
The following table provides additional examples for tuning Pulse:.
Text
Attribute Score
New product smashes kickstarter target in a
day!
New
Product
HPE Vertica Analytic Database (7.2.x)
Tuning Steps
Default: "Smash" is
-1
scored
negatively by
After
default.
Tuning:
1
Add "smash
target" to "pos_
words".
Page 45 of 117
Vertica Pulse
Using Pulse
Text
Attribute Score
Get a sneak peek of the new movie.
Movie
Default: "sneak" is
-1
scored
negatively by
After
Tuning: default.
1
Google was able to spot trends in flu
outbreaks in the United States using the
collection and analysis of big data.
Google
health
tips
Add "sneak
peek" to "pos_
words".
Default: "outbreak" is
-1
scored
negatively by
After
Tuning: default.
1
Five health tips that will knock your socks off!
Tuning Steps
Add "spot trend"
to "pos_words".
Default: "knock" is
-1
scored
negatively by
After
tuning: default.
1
Add "knock your
socks off" to
"pos_words".
If you have many words or base/synonyms to add to user-dictionaries, then you can bulk
load the lists from text files. See Bulk Loading Word Lists from Text Files.
Bulk Loading Word Lists from Text
Files
If you have many words that you need to add to the user-dictionary or normalization
mapping, then it may be easier to create the word lists in a text file and load the lists
using the COPY command.
Bulk Loading User Dictionary Lists
To bulk load user-dictionary lists into the pulse schema, first create a text file with the list
of words to add, one word per line, for each of the user-dictionaries. See Dictionaries
and Mappings for a list of the user-dictionaries and normalization map. Optionally name
HPE Vertica Analytic Database (7.2.x)
Page 46 of 117
Vertica Pulse
Using Pulse
each text file to match the name of the corresponding user-dictionary. Place these text
files in the /home/dbadmin directory.
Then, in vsql, use one or more of the following commands to load the respective text file
into the pulse schema. These commands assume that you are using English version of
Pulse, that the built-in user dictionary tables in the pulse schema and that the text files
are named the same as the user-dictionary.
copy pulse.neg_words_en(standard) from '/home/dbadmin/neg_words.txt';
copy pulse.neutral_words_en(standard) from '/home/dbadmin/neutral_words.txt';
copy pulse.pos_words_en(standard) from '/home/dbadmin/positive_words.txt';
copy pulse.stop_words_en(standard) from '/home/dbadmin/stop_words.txt';
copy pulse.white_list_en(standard) from '/home/dbadmin/white_list.txt';
Bulk Loading the Normalization Map
You can load normalization terms into the pulse schema similarly to loading userdictionaries. However, instead of one word per line, use the convention of one pair of
words per line, separated by a comma. For example, to map the different forms of
Hewlett Packard Enterprise to HP, create a text file in /home/dbadmin named
normalization.txt with the following content:
hp, hewlett packard
hp, hewlett-packard
Then, in vsql, use the following command to load the normalization into the pulse
schema.
copy pulse.normalization_en (standard_base, standard_synonym) from '/home/dbadmin/normalization.txt'
delimiter ',';
When you have finished loading the text files, run the loadUserDictionaries.sql
script to update the new terms in memory:
vsql -f /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
HPE Vertica Analytic Database (7.2.x)
Page 47 of 117
Vertica Pulse
Using Pulse
HPE Vertica Analytic Database (7.2.x)
Page 48 of 117
Vertica Pulse
Multilingual Pulse
Multilingual Pulse
This section describes the multilingual features of Pulse and gives a brief explanation
on how to use the sentimentAnalysis() functions for different supported languages.
Pulse can analyze text in different languages. Currently English and Spanish are
supported. You can specify the language that is analyzed in three ways:
l
l
Provide the language as argument: if there is a language specified in the document
record, then it can be used for analyzing the text by passing it as argument. This is
particularly useful when a dataset contains texts in different languages. If the
language in a record is not a supported one, then it is ignored.
Provide the language as parameter: if there is no value specified for the language for
a document record, Pulse uses the value specified for the language parameter in the
query to get the language.
Note: If you provide the language parameter more than once, then the last value
specified is used.
l
Do not provide an argument or parameter and use the default language. If the
language is neither specified in the record nor by the user, then Pulse defaults to
English unless you have changed the default language. To change the default
language use the SetDefaultLanguage function.
Note: If you provide both an argument and a parameter, then the argument is used
as the language. If the argument is not valid then the parameter is used. If neither the
argument or parameter are valid then the default language is used.
Note: Accents are removed from characters in attributes. Additionally, a "u" with a
dieresis is converted to a plain "u" and an "n" with a diacritical tilde is replace with a
plain "n".
Functions that use language as parameter and/or as argument:
l
CommentAttributes
l
ExtractSentence
l
GetAllSentences
l
GetSentenceCount
HPE Vertica Analytic Database (7.2.x)
Page 49 of 117
Vertica Pulse
Multilingual Pulse
l
PartsOfSpeech
l
SentimentAnalysis
Other functions can use the language only as a parameter (if not provided, the function
uses the default language):
l
GetLoadedDictionary
l
GetLoadedMapping
l
LoadDictionary
l
LoadMapping
l
GetAllDictionaryWords
l
GetAllMappingWords
In This Section
•
•
Spanish Pulse
50
Multilingual Examples
51
Spanish Pulse
The only visible difference between the English and Spanish versions is in the table
names for the user dictionaries. The modifications for dictionaries/mappings must be
done in the following tables:
l
white_list_es
l
stop_words_es
l
pos_words_es
l
neg_words_es
l
neutral_words_es
l
normalization_es
HPE Vertica Analytic Database (7.2.x)
Page 50 of 117
Vertica Pulse
Multilingual Pulse
Consider the text "El producto provee simplicidad" (the product provides simplicity). If
the word 'simplicidad' (simplicity) should be positive, it has to be loaded into the pos_
words dictionary for Spanish as follows:
select sentimentanalysis('El producto provee simplicidad') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-------------+----------------1 | producto
|
0
1 | simplicidad |
0
(2 rows)
insert into pulse.pos_words_es values('simplicidad');
OUTPUT
-------1
(1 row)
select LoadDictionary(standard USING PARAMETERS listName='pos_words') over() from pulse.pos_words_
es;
Success
--------t
(1 row)
select sentimentanalysis('El producto provee simplicidad') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-------------+----------------1 | producto
|
1
1 | simplicidad |
0
(2 rows)
Multilingual Examples
Language as an Argument
select sentimentanalysis('Cookies are sweet.', 'english') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)
select sentimentanalysis('Las galletas son dulces','spanish') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | galletas |
1
(1 row)
The following example shows how to analyze tweets from a table where each tweet
record contains the language of the tweet in addition to the text.
create table myTweets (text varchar(300), language varchar(15));
HPE Vertica Analytic Database (7.2.x)
Page 51 of 117
Vertica Pulse
Multilingual Pulse
insert into myTweets values ('Wired reviews Amazon''s tiny-screen Kindle Fire: Web browsing sucks,
emotionally draining, makes reading a chore', 'english');
insert into myTweets values ('Cookies are sweet', 'english');
insert into myTweets values ('Why does my iPhone have 6 GB of corrupted space I can''t use? That is
obnoxious.', 'english');
insert into myTweets values ('Las galletas son dulces', 'spanish');
insert into myTweets values ('el iPhone es el celular mas popular', 'spanish');
select sentimentanalysis(text,language) OVER(PARTITION BEST) from MyTweets;
sentence |
attribute
| sentiment_score
----------+----------------+----------------1 | reviews amazon |
-1
1 | kindle fire
|
-1
1 | web
|
-1
1 | chore
|
-1
1 | cookies
|
1
1 | iphone
|
-1
1 | gb
|
-1
1 | space
|
-1
1 | galletas
|
1
1 | iphone
|
1
1 | celular
|
1
(11 rows)
Language as a Parameter
select sentimentanalysis('Las galletas son dulces' using PARAMETERS language='spanish') OVER
(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | galletas |
1
(1 row)
select sentimentanalysis('Cookies are sweet' using PARAMETERS language='english') OVER(PARTITION
BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)
Although it is possible to specify the language as parameter for a specific text given in a
query, using the language argument is more appropriate. The use of the language
parameter is targeted to queries that analyze a set of texts (from a table) written in a
same language. The language parameter is used by Pulse to skip texts in other
languages because Pulse does not automatically detect the language, Thus, Pulse
uses the language specified as parameter to analyze each text from the table
(consequently the sentiment scores for texts in other language may be incorrect).
HPE Vertica Analytic Database (7.2.x)
Page 52 of 117
Vertica Pulse
Multilingual Pulse
The following example shows a query that analyzes tweets from a table where the
tweets do not have a language value stored in the table, but are all in the same
language.
create table myTweets (text varchar(300));
insert into myTweets values ('Las galletas son dulces');
insert into myTweets values ('el iphone es el celular mas popular');
insert into myTweets values ('el zorro rapido brinco sobre el perro flojo');
select sentimentanalysis(text using PARAMETERS language='spanish') OVER(PARTITION BEST) from
MyTweets;
sentence |
attribute
| sentiment_score
----------+----------------+----------------1 | galletas
|
1
1 | iphone
|
1
1 | celular
|
1
1 | zorro
|
1
1 | perro
|
-1
(5 rows)
The following example shows a query that analyzes tweets from a table with tweets in
different languages. The Spanish tweets do not have the language value. In a single
query you can specify both an argument and parameter. The argument has precedence
over the parameter setting. In this case the parameter is only used when a tweet doesn't
provide a language value.
create table myTweets (doc_id int, text varchar(300), language varchar(15));
insert into myTweets values (1, 'Vertica is the best company', 'english');
insert into myTweets values (2, 'Cookies are sweet', 'english');
insert into myTweets values (3, 'The quick brown fox jumped over the lazy dog', 'english');
insert into myTweets values (4, 'Las galletas son dulces');
insert into myTweets values (5, 'el iphone es el celular mas popular');
select doc_id, sentimentanalysis(text,language using PARAMETERS language='spanish') OVER(PARTITION
BY id, text) from MyTweets;
doc_id
| sentence | attribute | sentiment_score
----------+-----------+-----------+----------------1 |
1| vertica
|
1
1 |
1| company
|
1
2 |
1| cookies
|
1
3 |
1| fox
|
1
3 |
1| dog
|
-1
4 |
1| galletas |
1
5 |
1| iphone
|
1
5 |
1| celular
|
1
(8 rows)
HPE Vertica Analytic Database (7.2.x)
Page 53 of 117
Vertica Pulse
Multilingual Pulse
Using the Default Language
select sentimentanalysis('Cookies are sweet') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)
HPE Vertica Analytic Database (7.2.x)
Page 54 of 117
Vertica Pulse
Pulse Cookbook
Pulse Cookbook
This section contains the following recipes for using Pulse
•
•
•
•
•
•
•
Batch Analyzing Data as It Is Loaded
55
Analyzing Comments for a Company or Product
59
Determining Popular Topics
62
Determining Prolific Authors
66
Analyzing the Sentiment of Specific Authors
67
Finding Associated Attributes
69
Using Pulse as an Aid in Competitive Analysis
70
Batch Analyzing Data as It Is Loaded
If you are constantly loading data that needs to be analyzed with Pulse, then you should
run the sentimentAnalysis() function in batches on the newly loaded data. You can store
the sentiment scores in a separate table and associate the rows in the scored table with
the original table by joining on IDs between the tables. Running sentimentAnalysis() as
the data is loaded and storing the results is more efficient than running
sentimentAnalysis() during interactive sessions because the sentimentAnalysis() can
take a few seconds to return results.
For example, suppose that you are using the Social Media Connector (available in the
ETL and Data Ingest section of the Vertica Marketplace) to retrieve Twitter tweets and
load them into Vertica. In this case, you can create shell scripts and a cron job to
automatically run sentimentAnalysis() on the text of the tweets. Then you can store the
resulting scores in a table for quick retrieval later on.
Complete the following steps as the dbadmin user to run sentimentAnalysis() on your
Twitter data. This task also sets up the system to run sentimentAnalysis() on new Twitter
data every 2 minutes.
1. Create a table to hold the tweets (for example, named tweets) with the following
structure:
create table tweets(
id int,
created_at timezonetz,
"user.name" varchar(144),
"user.screen_name" varchar(144),
text varchar(500),
"retweeted_status.retweet_count" int,
HPE Vertica Analytic Database (7.2.x)
Page 55 of 117
Vertica Pulse
Pulse Cookbook
"retweeted_status.id" int,
"retweeted_status.favorite_count" int,
"user.location" varchar(144),
"coordinates.coordinates.0" float,
"coordinates.coordinates.1" float,
lang varchar(5)
);
The columns are based on the data returned by Twitter's streaming API. The fields
are defined in the Twitter Field Guide at
https://dev.twitter.com/docs/platform>/tweets.
Note that the columns with quoted names; "user.name", "user.screen_name", are
sub-fields within a larger field. For example, the "users" field is described here:
https://dev.twitter.com/docs/platform>/users.
You must at least have columns for id, text, and "user.screen_name"
2. Create a table to hold the sentiment scores (for example, named : tweet_sentiment).
Then load it with the scores from your existing tweets. Make sure no new tweets are
loaded until this step completes.
Replace the column names in the following example with the column names from
your twitter table. The example uses the column names used by the Social Media
Connector:
create table tweet_sentiment as
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en'order by attribute );
-- The following table defines which data has been analyzed
create table dt_start as (select max(created_at) dt from tweets);
Note: If you have a large number of tweets then this command can take a long
time to run. However, it is important to score your existing data before you start
scoring newly loaded data.
3. Create a SQL script to update the tweet_sentiment table with data from newly
loaded tweets. Save it in the home folder of the Vertica database admin user. For
example, this path could be /home/dbadmin/tweet_update.sql.
HPE Vertica Analytic Database (7.2.x)
Page 56 of 117
Vertica Pulse
Pulse Cookbook
Replace the column names with the column names from your twitter table. The
following example uses the column names used by the Vertica Social Media
Connector:
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
drop table if exists dt_end;
create table dt_end as (select max(created_at) dt from tweets);
-- run sentiment
insert into tweet_sentiment
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' and
tweets.created_at > (select dt from dt_start) and
tweets.created_at <= (select dt from dt_end)
order by attribute);
-- copy date end into new start date
drop table if exists dt_start;
create table dt_start as (select dt from dt_end);
-- free up jvm resource pool memory used by this script
select release_jvm_memory();
4. Create a shell script named tweet_update.sh that is run from a cron job. This shell
script runs the tweet_update.sql script and logs the results to the file tweet_
update.log. Save the tweet_update.sh script in the home folder of the Vertica
database admin user. For example, this path could be /home/dbadmin/tweet_
update.sh.
Replace the dbadmin, password, and databasename values with the values for your
system.
/opt/vertica/bin/vsql -U dbadmin -w password -d databasename -f /home/dbadmin/tweet_update.sql
> tweet_update.log
After you have created the shell script tweet_update.sh, make the script executable
by entering the following command: chmod +x tweet_update.sh.
5. Create a cron job to run the script every two minutes. Use the command crontab e to create the cron job. You can view all of your created cron jobs by using the
command crontab -l.
*/2 * * * * /home/dbadmin/tweet_update.sh
HPE Vertica Analytic Database (7.2.x)
Page 57 of 117
Vertica Pulse
Pulse Cookbook
The script runs every two minutes. Any new tweets that have been loaded in that twominute window are analyzed and the results are added to the tweet_sentiment table.
You can join results of queries by the id's of the tweets and tweet_sentiment tables.
HPE Vertica Analytic Database (7.2.x)
Page 58 of 117
Vertica Pulse
Pulse Cookbook
Analyzing Comments for a Company or
Product
Pulse allows you to analyze comments (such as tweets) for a particular company or
product.
For example, imagine that the fictional company Pytell Corp has just released a new
product called Owl-2. You want to analyze the sentiment of both the company and the
product.
You've collected several tweets from Twitter about several companies and products into
your database. However, for this analysis you only want to target tweets that have to do
with Pytell Corpand/or Owl-2.
The dataset for this example is below:
create table tweets_sample(id int, author varchar(50), text varchar(400));
insert into tweets_sample values(400900, 'DramaBugs',
'Pytell Corp has horrible customer support. On Hold 2 hours!');
insert into tweets_sample values(401200, 'Gemball',
'Owl-2 doesn''t fly!');
insert into tweets_sample values(403070, 'Postta',
'Pytell finally released Owl-2!');
insert into tweets_sample values(480920, 'Instana',
'Unboxing Owl-2 after work today! Stay Tuned!');
insert into tweets_sample values(434500, 'Dailydant',
'Owl-2 flies great! I like it!');
insert into tweets_sample values(450670, 'HelpfulBen',
'Owl-2 keeps crashing into things!');
insert into tweets_sample values(402092, 'Championtips',
'Owl-2 has solved our rodent infestation!');
insert into tweets_sample values(434950, 'Editone',
'Pytell fail? Reports of Owl-2 crashing through windows.');
insert into tweets_sample values(413956, 'CzarLatest',
'Pytell Corp''s Owl-2 just released!');
insert into tweets_sample values(459988, 'CelticMiss', 'I like Ponies!');
insert into tweets_sample values(403511, 'BuffDrama',
'I am afraid of small spiders.');
commit;
1. Run SentimentAnalysis to get an idea of how Pulse is analyzing the data:
SELECT author, SentimentAnalysis(text) OVER(PARTITION BY author, text) FROM tweets_sample
ORDER BY attribute;
author
| sentence |
attribute
HPE Vertica Analytic Database (7.2.x)
| sentiment_score
Page 59 of 117
Vertica Pulse
Pulse Cookbook
--------------+----------+--------------------+----------------DramaBugs
|
1 | customer support
|
-1
Championtips |
1 | owl-2
|
0
HelpfulBen
|
1 | owl-2
|
-1
Instana
|
1 | owl-2
|
1
CzarLatest
|
1 | owl-2
|
0
Gemball
|
1 | owl-2
|
0
Postta
|
1 | owl-2
|
0
Dailydant
|
1 | owl-2
|
1
Editone
|
2 | owl-2
|
-1
CelticMiss
|
1 | ponies
|
1
Editone
|
2 | reports
|
-1
Championtips |
1 | rodent infestation |
0
BuffDrama
|
1 | spiders
|
-1
Instana
|
2 | tuned
|
0
Editone
|
1 | Pytell
|
-1
Postta
|
1 | Pytell
|
0
CzarLatest
|
1 | Pytell corp
|
0
DramaBugs
|
1 | Pytell corp
|
-1
Editone
|
2 | windows
|
-1
Instana
|
1 | work today
|
0
(20 rows)
2. There are some attributes listed (ponies!) that do not apply to the analysis that you
are doing. You can focus your analysis by adding whitelist entries and filtering on
the whitelist. Insert whitelist entries for the company and product name into the
standard whitelist:
INSERT INTO pulse.white_list_en VALUES ('Pytell Corp');
INSERT INTO pulse.white_list_en VALUES ('owl-2');
commit;
Reload the whitelist into Pulse. Loading a user-dictionary or mapping overwrites the
existing user-dictionary or mapping:
SELECT LoadDictionary(standard USING PARAMETERS listName='white_list') OVER() FROM
pulse.white_list_en;
3. Also, note that Pulse is not identifying all variations on the company name. There
are also three obvious attributes for the product name ('Pytell', 'pytell corp). You can
normalize these values by using a normalization mapping. Add the synonyms to the
standard normalization mapping:
insert into pulse.normalization_en values('Pytell', 'Pytell Corp');
commit;
4. Reload the normalization mapping to load the new values into Pulse:
HPE Vertica Analytic Database (7.2.x)
Page 60 of 117
Vertica Pulse
Pulse Cookbook
SELECT LoadMapping(standard_base, standard_synonym USING PARAMETERS mapName='normalization')
OVER() FROM pulse.normalization_en;
5. Run the query again to see how the normalization affects the results.
Note that 'pytell corp' has been normalized to 'pytell' and Pulse is correctly
identifying the synonyms and mapping them to the base term
HPE Vertica Analytic Database (7.2.x)
Page 61 of 117
Vertica Pulse
Pulse Cookbook
Determining Popular Topics
The next examples in this cookbook use a table with the following structure:
create table tweets(
id int,
created_at timezonetz,
"user.name" varchar(144),
"user.screen_name" varchar(144),
text varchar(500),
"retweeted_status.retweet_count" int,
"retweeted_status.id" int,
"retweeted_status.favorite_count" int,
"user.location" varchar(144),
"coordinates.coordinates.0" float,
"coordinates.coordinates.1" float,
lang varchar(5)
);
The columns are based on the data returned by Twitter's streaming API. The fields are
defined in the Twitter Field Guide at https://dev.twitter.com/docs/platform>/tweets.
Note that the columns with quoted names; "user.name", "user.screen_name", are subfields within a larger field. For example, the "users" field is described here:
https://dev.twitter.com/docs/platform>/users.
The example queries provided work with any Twitter data that follows the above table
structure.
Determining Popular Topics
The Pulse attribute discovery feature allows you to easily find popular topics in a data
set. Use the CommentAttributes() function to extract the attributes from rows of text and
count the number of times the attribute occurs.
For example, using a dataset of 30,000 tweets that matched a keyword of "D11"
collected during the D11 tech conference in 2013, you could get a count of the attributes
discovered by Pulse to determine popular topics:
SELECT t.attribute, count(*) FROM(SELECT CommentAttributes(text)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
attribute
| count
----------------+------http
| 3631
d11
| 3281
rt
| 2453
encryption
| 2356
HPE Vertica Analytic Database (7.2.x)
Page 62 of 117
Vertica Pulse
Pulse Cookbook
usb
aes-256
smartphones
rt @hp
world
ceo
(10 rows)
|
|
|
|
|
|
2121
1859
1843
1788
1609
1520
If the dataset contains tweets in English and Spanish languages, then (using the Pulse
multilingual version) each tweet can be analyzed according to its language by
specifying the language as argument in the CommentAttributes() function. If the
language of a specific tweet is not supported, then that tweet is ignored by the function.
For example:
SELECT t.attribute, count(*) FROM(SELECT CommentAttributes(text,lang)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
Notice that the top attribute is "http". This is due to the large number of links in tweets.
You can ignore links by using the filterlinks argument of CommentAtttributes():
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
attribute
| count
----------------+------d11
| 4757
rt
| 2397
encryption
| 2356
usb
| 2121
aes-256
| 1871
smartphones
| 1829
rt @hp
| 1788
world
| 1611
ceo
| 1542
interview
| 1346
(10 rows)
The attribute "http" is now gone from the list, but we still have "rt" (for retweet) on the list
and it is not helpful in this context. You can omit terms such as "rt" by adding them to the
stop_words list and reloading the stop_words user-dictionary:
INSERT INTO pulse.stop_words_en VALUES('rt');
commit;
SELECT LoadDictionary(standard USING PARAMETERS
listName='stop_words') OVER() FROM pulse.stop_words_en;
When you rerun the query you get more accurate results for the popular topics in the
data set:
SELECT t.attribute, count(*) FROM
HPE Vertica Analytic Database (7.2.x)
Page 63 of 117
Vertica Pulse
Pulse Cookbook
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute
ORDER BY count(*) DESC LIMIT 10;
attribute
| count
----------------+------d11
| 4757
encryption
| 2356
perfume
| 2121
usb
| 1871
aes-256
| 1829
rt @hp
| 1788
world
| 1611
ceo
| 1542
interview
| 1346
cloud
| 1306
(10 rows)
You can further refine the list to topics that contain specific attributes by adding the
attributes in which you are interested to the white_list, and then filtering with the whitelist
parameter:
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true,
whitelistonly=true) OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute
ORDER BY count(*) DESC LIMIT 10;
Determining The Sentiment of Popular
Topics
In addition to finding popular, or most discussed, topics in your data set, you can also
easily get an average sentiment for the topics.
The following example uses a dataset of 10,000 tweets containing the hashtag #sports.
SELECT * from
(SELECT attribute, count(attribute) AS
cnt, AVG(sentiment_score) FROM (select
SentimentAnalysis(text USING PARAMETERS
filterlinks=true) OVER(PARTITION BEST) from tweets)
AS t1 GROUP BY attribute ORDER BY
AVG(sentiment_score) desc) AS t2
WHERE t2.cnt > 500
LIMIT 5;
The result shows the top 5 tweets with the highest average sentiment for attributes that
have 500 or more occurances:
attribute
| cnt |
avg
-------------------+------+-------------------
HPE Vertica Analytic Database (7.2.x)
Page 64 of 117
Vertica Pulse
Pulse Cookbook
football
game
baseball
basketball
hockey
| 817 | 0.290085679314565
| 638 | 0.134796238244514
| 1558 | 0.128369704749679
| 776 | 0.114690721649485
| 2610 | 0.113409961685824
HPE Vertica Analytic Database (7.2.x)
Page 65 of 117
Vertica Pulse
Pulse Cookbook
Determining Prolific Authors
You can identify prolific authors of your textual data without using any of the Pulse
functions. For example, using the same dataset as the examples in Determining Popular
Topics, you can easily determine how many tweets were made by authors:
select "user.name", count(*) as post_count from tweets group by
"user.name" order by count(*) DESC limit 10;
user.name
| post_count
----------------------+-----------Nick Cicero
|
182
Networked Society
|
171
AllThingsD
|
137
Stephanie~
|
117
Jennifer Ives
|
105
Claudia-ElasticMinds |
101
Needful Things
|
96
Poptart Tech
|
85
Patrick Bertrand
|
84
Alessandro Piol
|
81
(10 rows)
HPE Vertica Analytic Database (7.2.x)
Page 66 of 117
Vertica Pulse
Pulse Cookbook
Analyzing the Sentiment of Specific
Authors
You can use the white_list feature of SentimentAnalysis() to filter the attributes so only
the white_list terms are returned. You can combine the white_list with a query for a list of
specific authors to narrow down the results to a specific subset of authors.
Using the same tweet_samples table in Analyzing Comments for a Company or
Product, add the following sample tweets:
INSERT INTO tweets_sample VALUES('123', 'bcook',
'The hyperdrive is a great machine.');
INSERT INTO tweets_sample VALUES('124', 'sprock',
'The hyperdrive is a pinnacle of technology.');
INSERT INTO tweets_sample VALUES('125', 'tgates',
'What is a hyperdrive?');
INSERT INTO tweets_sample VALUES('126', 'bcook', 'Roses are red.');
INSERT INTO tweets_sample VALUES('127', 'sprock',
'Energy equals mass times the speed of light squared.');
INSERT INTO tweets_sample VALUES('128', 'tgates', 'Violets are blue.');
commit;
Create an authors table to hold the names of the authors whose sentiment you want to
analyze:
CREATE TABLE authors (name VARCHAR, screenname VARCHAR);
Then insert the following authors:
INSERT INTO authors VALUES('Brian Cook','bcook');
INSERT INTO authors VALUES('Tom Gates', 'tgates');
INSERT INTO authors VALUES('Jim Sprock', 'sprock');
commit;
Add the word 'hyperdrive' to your existing white_list and reload the white_list userdictionary:
INSERT INTO pulse.white_list_en VALUES('hyperdrive');
SELECT LoadDictionary(standard USING PARAMETERS
listName='white_list') OVER() FROM pulse.white_list_en;
Then, you can run a query that filters on authors and the white_list and provides you
with a sentiment score and the content of the analyzed text:
SELECT t1.id, t1.author, t1.attribute, t1.sentiment_score, t2.text from (SELECT id, author,
SentimentAnalysis(text USING PARAMETERS
whitelistonly=true) OVER (PARTITION BY id, author) FROM tweets_sample
WHERE author IN (SELECT screenname FROM authors)) AS t1 JOIN (SELECT id,
HPE Vertica Analytic Database (7.2.x)
Page 67 of 117
Vertica Pulse
Pulse Cookbook
text FROM tweets_sample) AS t2 ON t1.id = t2.id ;
id | author | attribute | sentiment_score |
text
-----+--------+-----------+-----------------+--------------------------123 | bcook | hyperdrive |
1 | The hyperdrive is a great
124 | sprock | hyperdrive |
1 | The hyperdrive is a pinnacle
125 | tgates | hyperdrive |
0 | What is a hyperdrive?
(3 rows)
HPE Vertica Analytic Database (7.2.x)
Page 68 of 117
Vertica Pulse
Pulse Cookbook
Finding Associated Attributes
Once you've analyzed your tweets and stored them in a table (see Batch Analyzing
Data as It Is Loaded) you can use the analyzed data to make quick comparisons, such
as finding attributes most associated with another attribute.
For example, if your primary attribute is 'microsoft', you may want to determine which
other attributes are used most often with the word 'microsoft' in the same tweet. This can
be accomplished with the following SQL:
select t1.attribute, count(*), avg(t1.sentiment_score) from tweet_sentiment t1,
tweet_sentiment t2 where t1.id=t2.id and not t1.attribute=t2.attribute and
t2.attribute = 'microsoft' group by t1.attribute order by count desc limit 5;
We get the following results from a data set of 25,000 PC Manufacturer tweets:
attribute
| count |
avg
----------------------------------+-------+-------------------windows phone
|
81 | 0.0238095238095238
power data center
|
77 |
0.58974358974359
wind project
|
77 |
0
investment
|
73 |
0
windows
|
57 | 0.175438596491228
The query allows you to gain additional insight into the scope of an attribute and may
aid in determining the context of why a certain attribute it scored a certain way.
HPE Vertica Analytic Database (7.2.x)
Page 69 of 117
Vertica Pulse
Pulse Cookbook
Using Pulse as an Aid in Competitive
Analysis
This topic details how you can use Pulse to conduct basic competitive analysis for
products or brands. Pulse makes basic competitive analysis simple through use of it's
white list feature. By utilizing the white list feature, you can analyze the tweets that
pertain only to the brands or products that you are evaluating.
For example, say you wanted to analyze the sentiment of major food brands to
determine how the brands compared to each other and what words people associate
(positively and negatively) about the brands. Your work flow to do this analysis with
Twitter and Vertica Pulse could be as follows:
1. Start collecting tweets based on the brands or products that you are following. For
example, you can use the Social Media Connector (available on the Pulse
marketplace) to collect tweets matching keywords.
2. First, create a white_list that contains the same keywords as the tweets that you are
collecting. The whitelist allows you to later group and filter tweets collected. For
example:
insert into pulse.white_list_en values ('productA');
insert into pulse.white_list_en values ('productB');
insert into pulse.white_list_en values ('productC');
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
3. Batch Load Tweets, and be sure to specify whitelistonly=true and
relatedwords=true in the sentimentAnalysis() function. This creates a table with
the sentiment score for your white-listed attributes. Note that this should be done in
batches for large data sets. For smaller data sets (depending on your hardware) you
can try and analyze all the tweets at once. For example:
create table tweet_sentiment as
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true, relatedwords=true,
filterretweets=true, whitelistonly=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' order by attribute );
4. Verify that your tweet_sentiment table contains only your whitelist attributes. The
HPE Vertica Analytic Database (7.2.x)
Page 70 of 117
Vertica Pulse
Pulse Cookbook
following query should only return the brands/products that you have white listed.
For example:
=> select distinct(attribute) from tweet_sentiment;
attribute
------------ProductA
ProductB
ProductC
(3 rows)
5. You can get a basic idea of which product or brand is being talked about the most
by seeing how many instances of each attribute appear in your data set:
=> select attribute, count(*) from tweet_sentiment group by (attribute) order by count(*)
desc;
attribute | count
-------------+------ProductA
|
701
ProductB
|
192
ProductC
|
52
(3 rows)
You can see that ProductA is the most talked about product of three being analyzed
over the time-frame that the tweets were collected.
6. Determine the average sentiment scores of the tweets you have collected:
=> select attribute, avg(sentiment_score) as score from tweet_sentiment group by (attribute)
order by score DESC;
attribute |
score
-------------+--------------------ProductC
|
0.192307692307692
ProductB
| -0.0729166666666667
ProductA
| -0.122681883024251
(3 rows)
From this basic analysis, you can see that ProductC has the most positive sentiment
from the three brands being analyzed over the time period when the tweets were
collected, and ProductA has the lowest sentiment.
7. You can also determine which words or phrases are associated with each attribute
in their positive and negative contexts. For example, to see the list of words that are
most associated with positive sentiment for ProductC, you can look at the related
HPE Vertica Analytic Database (7.2.x)
Page 71 of 117
Vertica Pulse
Pulse Cookbook
words fields and add up the occurances of words associated with positive
sentiment:
=> select count(*), related_word_1 from tweet_sentiment where attribute = 'ProductC' and
sentiment_score > 0 group by related_word_1 order by count DESC;
count | related_word_1
-------+---------------11 | delicious
2 | love
1 | best
1 | bless
1 | good
1 | work
(6 rows)
You can also do the same for negative sentiment:
=> select count(*), related_word_1 from tweet_sentiment where attribute = 'ProductC' and
sentiment_score < 0 group by related_word_1 order by count DESC;
count | related_word_1
-------+---------------1 | working
1 | dragging
1 | bad
1 | doomed
1 | loud
1 | stressful
1 | damn
(7 rows)
8. Finally, Pulse makes it easy to see other attributes associated with your target
attributes to help you better understand the context in which people are discussing
the brands or products that you are analyzing.
a. Create another sentiment table from your data, but this time omit the
whitelistonly and relatedwords parameters:
create table tweet2_sentiment as
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true, filterretweets=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' order by attribute );
b. Next, query the tweets that contain your target attribute and find all the other
attributes associated with those tweets. Display a count of the top 5 attributes
(not including the target attribute):
HPE Vertica Analytic Database (7.2.x)
Page 72 of 117
Vertica Pulse
Pulse Cookbook
=> select count(attribute), attribute from tweet2_sentiment where id in (select id from
tweet_sentiment where attribute = 'ProductC') and attribute <> 'ProductC' group by
(attribute) order by count(attribute) DESC limit 5;
count |
attribute
-------+----------------13 | bbq
11 | state
11 | sandwich
11 | steak
3 | ProductB
(5 rows)
As you can see, a few basic queries can tell you the general sentiment differences
between multiple brands or products. You can also determine which words are
contributing to the sentiment of each product/brand that you are analyzing and which
other attributes people are talking about when they mention the brand or product(s) that
you are analyzing.
You could further refine these queries by breaking out different geographic locations or
time of day by joining the IDs of the tweet_sentiment table back to the main tweets table
and filtering be location or time.
HPE Vertica Analytic Database (7.2.x)
Page 73 of 117
Vertica Pulse
Pulse Cookbook
HPE Vertica Analytic Database (7.2.x)
Page 74 of 117
Vertica Pulse
Pulse Function Reference
Pulse Function Reference
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
LoadDictionary
76
LoadMapping
78
SentimentAnalysis
80
PartsOfSpeech
85
GetAllDictionarySetLabels
88
GetAllDictionaryWords
89
GetAllLoadedDictionaries
91
GetAllMappingWords
92
CommentAttributes
94
GetSentenceCount
97
ExtractSentence
100
GetAllSentences
103
SetDefaultLanguage
106
GetLoadedDictionary
107
GetLoadedMapping
109
GetStorage
111
UnloadLabeledDictionary
112
UnloadLabeledDictionarySet
114
UnloadLabeledMapping
115
HPE Vertica Analytic Database (7.2.x)
Page 75 of 117
Vertica Pulse
Pulse Function Reference
LoadDictionary
Lists words from a Pulse user-defined dictionary into memory for use by
sentimentAnalysis() and other Pulse functions.
This function must be used with the OVER() clause.
For more information on Pulse user-defined dictionaries, see Dictionaries and
Mappings.
Syntax
SELECT LoadDictionary(word USING PARAMETERS listName='listname'[, language='lang'] [, label='label'])
OVER() FROM table;
Parameters
Argument Description
word
A column of words to assign to a user-dictionary list. The column name
must match the value of word.
listName
The user-dictionary list from which to load the values from word.
Valid values:
l
pos_words
l
neg_words
l
neutral_words
l
stop_words
l
white_list
See Dictionaries and Mappings for details on each list type.
language
The language of the dictionary:
l
'english' or 'en'
l
'spanish' or 'es'
label
The label that you want to assign to the dictionary.
table
The specified table from which values are loaded.
HPE Vertica Analytic Database (7.2.x)
Page 76 of 117
Vertica Pulse
Pulse Function Reference
Examples
SELECT LoadDictionary(standard USING PARAMETERS listName=
'neg_words_en') OVER() from pulse.neg_words_en;
SELECT LoadDictionary(standard USING PARAMETERS listName=
'pos_words_en') OVER() from pulse.pos_words_en;
SELECT LoadDictionary(standard USING PARAMETERS listName=
'pos_words_en', language='english') OVER() from pulse.pos_words_en;
SELECT LoadDictionary(standard USING PARAMETERS listName=
'pos_words_es', language='spanish') OVER() from pulse.pos_words_es;
SELECT LoadDictionary(standard USING PARAMETERS listName=
'neg_words',label='custom_negatives') OVER() from pulse.neg_words_en;
See Also
l
LoadMapping()
l
GetLoadedDictionary()
l
GetStorage()
HPE Vertica Analytic Database (7.2.x)
Page 77 of 117
Vertica Pulse
Pulse Function Reference
LoadMapping
Loads a Pulse user-mapping into memory for use by sentimentAnalysis() and other
Pulse functions.
This function must be used with the OVER() clause.
For more information on Pulse user-mappings, see Dictionaries and Mappings.
Syntax
SELECT LoadMapping(base, wordToMap USING PARAMETERS mapName='mapName' [, language='lang'][,
label='label']) OVER() FROM table;
Parameters
Argument
Description
base
A column of base words to assign to a mapped word. The column name
must match the value of base.
wordToMap A column of words to map to the base word in the same row. The
column name must match the value of wordToMap.
mapName
The mapping to load the words into.
Valid values:
language
l
irregular_verbs — list of conjugations of verbs and their bases.
l
normalization — list of synonyms and their base word.
The language of the dictionary:
l
'english' or 'en'
l
'spanish' or 'es'
label
The label of the mapping that you want to load. If you do not provide a
label, Pulse uses the default mapping.
table
The specified table from which values are loaded.
HPE Vertica Analytic Database (7.2.x)
Page 78 of 117
Vertica Pulse
Pulse Function Reference
Examples
SELECT LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization') OVER() from pulse.normalization_en;
SELECT LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization', language='english') OVER() from pulse.normalization_en;
SELECT LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization', language='spanish') OVER() from pulse.normalization_es;
See Also
l
LoadDictionary()
l
GetLoadedMapping()
l
GetStorage()
HPE Vertica Analytic Database (7.2.x)
Page 79 of 117
Vertica Pulse
Pulse Function Reference
SentimentAnalysis
Provides a sentiment score for each attribute (noun) in a given body of text. Positive
sentiment receives a positive integer score and negative sentiment receives a negative
integer score. A score of 0 indicates that the sentiment for the attribute is neutral.
This function must be used with the OVER() clause. Use OVER(PARTITION BEST) for
the best performance if the query does not require specific columns in the OVER()
clause. Any valid PARTITION BY clause is acceptable. However, only the PARTITION
BY clause which matches the segmentation clause of the table's projection provides
optimum performance. You can improve performance by segmenting on the columns in
the PARTITION BY clause.
Syntax
SentimentAnalysis(text [, 'language'] [ USING PARAMETERS
[ whitelistonly = boolean ]
[, filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, filterpunctiation = boolean ]
[, filterretweets = boolean ]
[, relatedwords = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
[, label='label']
[, granularity='ASD']
[, actionPattern='boolean']
])
Note: language can be specified as an argument and/or as a parameter. When
specified as both, the argument value supersedes the parameter value.
Parameters
Argument
Description
text
The text to analyze. Limited to 65,000 bytes.
whitelistonly
Optional. Default false. When set to true only attributes defined in
the whitelist user-dictionary are scored. Use this setting to limit
your analysis to the objects of action patterns.
filterlinks
Optional. Default false. When set to true, links are not included as
attributes.
HPE Vertica Analytic Database (7.2.x)
Page 80 of 117
Vertica Pulse
Pulse Function Reference
Argument
Description
filterusermentions Optional. Default false. When set to true, Twitter user mentions
(@username) are not included as attributes.
filterhashtags
Optional. Default false. When set to true, Twitter hashtags
(#hashtag) are not included as attributes.
filterpunctuation
Optional. Default true. Filters any punctuation that occurs at the
beginning of an attribute other than @ and #.
filterretweets
Optional. Defaults to false.Filters out the characters "RT" from retweets in attributes.
relatedwords
Optional. Defaults to false. When set to true, provides up to three
words from the sentence used to help determine the sentiment of
the attribute.
adjustcasing
Optional. Defaults to false. When set to true, all letters in the text
are converted to uppercase before sentence detection. After
performing sentence detection, Vertica converts all letter to
lowercase. This option can help you in cases where the original
data is all in lowercase letters and Pulse is incorrectly identifying
sentence boundaries.
language
The language:
l
'english' or 'en'
l
'spanish' or 'es'
label
Optional. The label of the dictionaries that you want to use for
sentiment analysis. If you do not include a label, Pulse uses the
default dictionaries.
granularity
Optional. The level of the sentiment analysis that you want to
perform:
l
A — Attribute level analysis
l
S — Sentence level analysis
l
D — Document level analysis
You can specify any granularity level or combination of levels
with your sentiment analysis. If you do not specify a granularity
level, Pulse performs an attribute level analysis.
HPE Vertica Analytic Database (7.2.x)
Page 81 of 117
Vertica Pulse
Pulse Function Reference
Argument
Description
actionPattern
Optional. Default false. When set to true checks for action patterns
in the analyzed content.
Examples
These examples show various ways you can use Pulse to detect user sentiment.
Query for sentiment in the following sentence.
SELECT SentimentAnalysis('The quick brown fox jumped over the lazy dog.') OVER(PARTITION BEST);
sentence | attribute | sentiment score
----------+-----------+----------------1 | fox
|
1
1 | dog
|
-1
(2 rows)
Query to identify the words that triggered the sentiment score.
SELECT SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
USING PARAMETERS relatedwords=true) OVER(PARTITION BEST);
sentence | attribute | sentiment_score | related_word_1 | related_word_2 | related_word_3
----------+-----------+-----------------+----------------+----------------+---------------1 | fox
|
1 | quick
| lazy
|
1 | dog
|
-1 | lazy
|
|
(2 rows)
SELECT SentimentAnalysis('The quick brown fox jumped over the lazy dog.', 'english')
OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fox
|
1
1 | dog
|
-1
(2 rows)
SELECT SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
using PARAMETERS language='english') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fox
|
1
1 | dog
|
-1
(2 rows)
SELECT SentimentAnalysis('El zorro rapido brinco sobre el perro flojo.',
'spanish') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | zorro
|
1
1 | perro
|
-1
(2 rows)
HPE Vertica Analytic Database (7.2.x)
Page 82 of 117
Vertica Pulse
Pulse Function Reference
SELECT SentimentAnalysis('El zorro rapido brinco sobre el perro flojo.'
using PARAMETERS language='spanish') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | zorro
|
1
1 | perro
|
-1
(2 rows)
SELECT SentimentAnalysis('The camera takes great quality pictures but is
expensive. It feels like a professional one.'
USING PARAMETERS granularity='ASD') over();
sentence |
attribute
| sentiment_score | mixed
----------+------------------+-----------------+------|
|
1 | true
1 |
|
0 | true
2 |
|
1 | false
1 | camera
|
1 |
1 | quality pictures |
1 |
SELECT sentimentAnalysis('Right after school on November 8th I will go to target, walmart, and best
buy and buy #blueslidepark just for @MacMiller' USING PARAMETERS
actionPattern=true,whitelistonly=true) over();
sentence | attribute | sentiment_score |
action
|
action_pattern
----------+-----------+-----------------+--------------+---------------------------1 | walmart
|
1 | go to target | #action{$verb $prep $verb}
1 | walmart
|
1 | go to target | #action{$verb to $verb}
(2 rows)
Getting Twitter User-Mentioned Sentiment
SELECT SentimentAnalysis('@company is great!') OVER(PARTITION BEST);
sentence | attribute | sentiment score
----------+-----------+----------------1 | @company |
1
(1 row)
Filtering Twitter User Sentiment
SELECT SentimentAnalysis('@company is great!' USING PARAMETERS
filterusermentions=true) OVER(PARTITION BEST);
sentence | attribute | sentiment score
----------+-----------+----------------(0 rows)
HPE Vertica Analytic Database (7.2.x)
Page 83 of 117
Vertica Pulse
Pulse Function Reference
See Also
l
LoadDictionary()
l
LoadMapping()
l
ExtractSentence()
l
GetSentenceCount()
l
GetAllSentences()
l
CommentAttributes()
HPE Vertica Analytic Database (7.2.x)
Page 84 of 117
Vertica Pulse
Pulse Function Reference
PartsOfSpeech
Tags the words in one or more sentences with their part of speech classification, using
Penn Treebank parts of speech tags.
Syntax
SELECT PartsOfSpeech('sentences'[, language='lang'] [using PARAMETERS [ language='lang'] [,
adjustcasing=boolean) OVER(PARTITION BEST);
Parameters
Argument
Description
sentences
One or more sentences to be tagged with parts of speech markup.
language
The language:
l
'english' or 'en'
l
'spanish' or 'es'
adjustcasing Optional. Defaults to false. When set to true, all letters in the text are
converted to uppercase before sentence detection. After performing
sentence detection, Vertica converts all letter to lowercase. This option
can help you in cases where the original data is all in lowercase letters
and Pulse is incorrectly identifying sentence boundaries.
Notes
l
l
This function returns a part of speech markup for each word. The markup used is the
Penn Treebank Project Parts of Speech Tags while for Spanish the Parole Reduced
Tagset is used.
This function must be used with the over() clause. Use with OVER
(PARTITION BEST) for the best performance if the query does not require specific
columns in the over() clause.
Examples
select partsOfSpeech('The quick brown fox jumped over the lazy dog.') OVER(PARTITION BEST);
HPE Vertica Analytic Database (7.2.x)
Page 85 of 117
Vertica Pulse
Pulse Function Reference
sentence | token | part_of_speech
----------+--------+---------------1 | the
| DT
1 | quick | JJ
1 | brown | JJ
1 | fox
| NN
1 | jumped | VBD
1 | over
| IN
1 | the
| DT
1 | lazy
| JJ
1 | dog
| NN
1 | .
| .
(10 rows)
select partsOfSpeech('Every good boy deserves fudge.') OVER(PARTITION BEST);
sentence | token
| part_of_speech
----------+----------+---------------1 | every
| DT
1 | good
| JJ
1 | boy
| NN
1 | deserves | VBZ
1 | fudge
| NN
1 | .
| .
(6 rows)
select partsOfSpeech('The quick brown fox jumped over the lazy dog.', 'english')
OVER(PARTITION BEST);
sentence | token
| part_of_speech
----------+--------+---------------1
| the
| DT
1
| quick | JJ
1
| brown | JJ
1
| fox
| NN
1
| jumped
| VBD
1
| over | IN
1
| the
| DT
1
| lazy | JJ
1
| dog
| NN
1
| .
| .
(10 rows)
select partsofSpeech('El zorro rapido brinco sobre el perro flojo','spanish')
over();
sentence | token | part_of_speech
----------+--------+---------------1 | El
| DA
1 | zorro | NC
1 | rapido | AQ
1 | brinco | AQ
1 | sobre | SP
1 | el
| DA
1 | perro | NC
1 | flojo | AQ
(8 rows)
HPE Vertica Analytic Database (7.2.x)
Page 86 of 117
Vertica Pulse
Pulse Function Reference
See Also
l
SentimentAnalysis()
HPE Vertica Analytic Database (7.2.x)
Page 87 of 117
Vertica Pulse
Pulse Function Reference
GetAllDictionarySetLabels
Lists all the dictionary labels that are loaded into the current Pulse session. This
function shows you which labels are currently in use. You can load only one dictionary
of each type in a single session.
Syntax
SELECT GetAllDictionarySetLabels() OVER();
Examples
SELECT GetAllDictionarySetLables() OVER();
label
--------default
sports_teams
(2 rows)
HPE Vertica Analytic Database (7.2.x)
Page 88 of 117
Vertica Pulse
Pulse Function Reference
GetAllDictionaryWords
Lists all dictionary words that are currently loaded into Pulse. This function can help you
determine which user-defined words in a sentence might be affecting the sentiment
score of an attribute.
Syntax
SELECT GetAllDictionaryWords([using PARAMETERS language='language'[, label='label']) OVER();
Parameters
Argument Description
language
label
The language of the dictionary:
l
'english' or 'en'
l
'spanish' or 'es'
The label of the dictionaries that you want to list. If you do not provide a
label, Pulse uses the default dictionaries.
Examples
SELECT GetAllDictionaryWords() OVER();
dictionary |
word
------------+------------neg_words | ratchet
neg_words | squirelly
select GetAllDictionaryWords(using parameters language='english') over();
dictionary
|
word
-------------------+-----------pos_words_en
| simplicity
(1 row)
select GetAllDictionaryWords(using parameters label='music') over();
dictionary
|
word
-------------------+------------white_list_en
| classical
white_list_en
| popular
white_list_en
| rock
(3 rows)
HPE Vertica Analytic Database (7.2.x)
Page 89 of 117
Vertica Pulse
Pulse Function Reference
See Also
l
GetAllMappingWords()
HPE Vertica Analytic Database (7.2.x)
Page 90 of 117
Vertica Pulse
Pulse Function Reference
GetAllLoadedDictionaries
Lists all the dictionaries and dictionary labels that are loaded into the current Pulse
session. This function shows you which dictionaries are determining the sentiment
score of an attribute. Only one dictionary of each type can be loaded in a single session.
Syntax
SELECT GetAllLoadedDictionaries() OVER();
Examples
SELECT GetAllLoadedDictionaries() OVER();
dictionary
| label
------------------+------neg_words_en
| default
stop_words_es
| default
neutral_words_es | default
white_list_en
| default
normalization_en | default
pos_words_es
| default
neg_words_es
| default
pos_words_en
| default
white_list_es
| default
neutral_words_en | default
stop_words_en
| default
normalization_es | default
(12 rows)
HPE Vertica Analytic Database (7.2.x)
Page 91 of 117
Vertica Pulse
Pulse Function Reference
GetAllMappingWords
Lists all user-defined bases and synonyms that are currently loaded into Pulse. This
function helps you determine which user-defined mappings in a sentence might be
affecting the sentiment score of an attribute.
Syntax
SELECT GetAllMappingWords([using PARAMETERS language='language'][, label='label']) OVER();
Parameters
Argument Description
language
label
The language of the dictionary:
l
'english' or 'en'
l
'spanish' or 'es'
The label of the mappings that you want to list. If you do not provide a
lable, Pulse uses the default dictionaries.
Examples
SELECT GetAllMappingWords() OVER() limit 10;
mapping
|
key
|
value
---------------+-------------+----------------normalization | hp
| hewlett packard
normalization | hp
| hewlett-packard
normalization | companycorp | company-corp
normalization | companycorp | companycorps
normalization | companycorp | companycorp's
normalization | producthd
| product hd
normalization | producthd
| product-hd
normalization | companycorp | company corp
(8 rows)
select getAllMappingWords(using parameters language='english') over();
mapping
| key |
value
-----------------------+-----+----------------normalization_en
| hp | hewlett-packard
normalization_en
| hp | hewlett Packard
(2 rows)
select getAllMappingWords(using parameters language='spanish') over();
mapping
|
key
|
value
HPE Vertica Analytic Database (7.2.x)
Page 92 of 117
Vertica Pulse
Pulse Function Reference
-----------------------+---------+---------------normalization_es
| hidalgo | miguel hidalgo
(1 row)
See Also
l
GetAllDictionaryWords()
HPE Vertica Analytic Database (7.2.x)
Page 93 of 117
Vertica Pulse
Pulse Function Reference
CommentAttributes
Retrieves the attributes (nouns) from a given piece of text.
Syntax
CommentAttributes(text[,language][ USING PARAMETERS
[ whitelistonly = boolean ]
[, filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, filterpunctuation = boolean
[, filterretweets = boolean ]
[, adjustcasing = boolean ]
[, language = string ]])
])
Parameters
Argument
Description
text
The text from which to extract the attributes.
language
The language:
l
'english' or 'en'
l
'spanish' or 'es'
whitelistonly
Optional. Default false. When set to true only attributes defined in
the white_list user-dictionary are returned.
filterlinks
Optional. Default false. When set to true, links are not set as
attributes.
filterusermentions Optional. Default false. When set to true, Twitter usernames
(@username) are not set as attributes.
filterhashtags
Optional. Default false. When set to true, removes the following
from tweets:
l
l
hashtag symbols - For example, #pizza becomes pizza.
@mentions - For example, Vertica would remove
@NewYorkCity from a tweet.
HPE Vertica Analytic Database (7.2.x)
Page 94 of 117
Vertica Pulse
Pulse Function Reference
Argument
Description
l
Link URLs
filterpunctuation
Optional. Default true. Filters any punctuation that occurs at the
beginning of an attribute other than @ and #.
filterretweets
Optional. Defaults to false.Filters out the characters "RT" from retweets in attributes.
adjustcasing
Optional. Defaults to false. When set to true, all letters in the
sentence are converted to upper-case before sentence detection.
After sentence detection all letters are converted to lower-case.
This option is helpful if the original data is all in lower-case and
Pulse is incorrectly identifying parts of speech in the sentence.
Notes
l
l
l
The text argument is limited to 65,000 bytes.
This function must be used with the over() clause. Use with OVER
(PARTITION BEST) for the best performance if the query does not require specific
columns in the over() clause.
language can be specified as an argument and/or as a parameter where the
argument value supersedes the parameter value.
Examples
select CommentAttributes('The quick brown fox jumped over the lazy dog. All good boys deserve
fudge.') OVER(PARTITION BEST);
sentence | attribute
----------+----------1 | fox
1 | dog
2 | boys
2 | fudge
(4 rows)
select commentattributes('the quick brown fox jumped over the lazy dog. All good boys deserve
fudge'
,'english') over();
sentence | attribute
----------+----------1 | fox
1 | dog
2 | boys
2 | fudge
HPE Vertica Analytic Database (7.2.x)
Page 95 of 117
Vertica Pulse
Pulse Function Reference
(4 rows)
select commentattributes('the quick brown fox jumped over the lazy dog. All good boys deserve
fudge'
using parameters language='english') over();
sentence | attribute
----------+----------1 | fox
1 | dog
2 | boys
2 | fudge
select commentattributes('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
,'spanish') over();
sentence | attribute
----------+----------1 | zorro
1 | perro
2 | chicos
2 | premio
(4 rows)
select commentattributes('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
using PARAMETERS language='spanish') over();
sentence | attribute
----------+----------1 | zorro
1 | perro
2 | chicos
2 | premio
(4 rows)
Filtering User-mentions
SELECT CommentAttributes('@user is always late. He kept me waiting 20 minutes last weekend.'
USING PARAMETERS filterusermentions=true) OVER(PARTITION BEST);
sentence | attribute
----------+----------2 | weekend
(1 row)
See Also
l
SentimentAnalysis()
HPE Vertica Analytic Database (7.2.x)
Page 96 of 117
Vertica Pulse
Pulse Function Reference
GetSentenceCount
Returns the number of sentences in a body of text. You can use this function to count the
number of sentences in a long piece of text. It is also useful if you are programmatically
using the ExtractSentence function and need to know the number of sentences in a
piece of text.
Syntax
select GetSentenceCount(text [, language] [ USING PARAMETERS
[ filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])
Parameters
Argument
Description
text
The text from which to extract the number of sentences. Currently
English and Spanish language text are supported for analysis.
language
The language:
filterlinks
l
'english' or 'en'
l
'spanish' or 'es'
Optional. Default false. When set to true, sentences that are only
links are not counted as a sentence.
filterusermentions Optional. Default false. When set to true, sentences that are only
Twitter user mentions (@username) are not counted as a
sentence.
filterhashtags
Optional. Default false. When set to true, sentences that are only
Twitter hashtags (#hashtag) are not counted as a sentence.
adjustcasing
Optional. Defaults to false. When set to true, all letters in the
sentence are converted to upper-case before sentence detection.
After sentence detection all letters are converted to lower-case.
This option is helpful if the original data is all in lower-case and
Pulse is incorrectly identifying parts of speech in the sentence.
HPE Vertica Analytic Database (7.2.x)
Page 97 of 117
Vertica Pulse
Pulse Function Reference
Notes
l
l
l
The text argument is limited to 65,000 bytes.
This function must be used with the over() clause. Use with OVER
(PARTITION BEST) for the best performance if the query does not require specific
columns in the over() clause.
language can be specified as an argument and/or as a parameter where the
argument value supersedes the parameter value.
Examples
SELECT GetSentenceCount('The quick brown fox jumped over the lazy dog. Every good boy deserves
fudge') OVER(PARTITION BEST);
sentence_count
---------------2
(1 row)
SELECT getsentencecount('http://hp.com. @hp. http://hp.com is great!') OVER(PARTITION BEST);
sentence_count
---------------3
(1 row)
select getsentencecount('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
using PARAMETERS language='spanish') over();
sentence_count
---------------2
(1 row)
select getsentencecount('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
,'spanish') over();
sentence_count
---------------2
(1 row)
select getsentencecount('the quick brown fox jumped over the lazy dog. All good boys deserve fudge'
using parameters language='english') over();
sentence_count
---------------2
(1 row)
select getsentencecount('the quick brown fox jumped over the lazy dog. All good boys deserve fudge'
,'english') over();
sentence_count
----------------
HPE Vertica Analytic Database (7.2.x)
Page 98 of 117
Vertica Pulse
Pulse Function Reference
2
(1 row)
Filtering Links and User Mentions
SELECT GetSentenceCount('http://hp.com. @hp. http://hp.com is great!' USING PARAMETERS
filterlinks=true, filterusermentions=true) OVER(PARTITION BEST);
sentence_count
---------------1
(1 row)
See Also
l
GetAllSentences()
l
ExtractSentence()
HPE Vertica Analytic Database (7.2.x)
Page 99 of 117
Vertica Pulse
Pulse Function Reference
ExtractSentence
Returns the specified sentence from a body of text.
Syntax
ExtractSentence(text, sentence [, language] [USING PARAMETERS
[ filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])
Parameters
Argument
Description
text
The text containing the sentence to extract.
language
The language:
l
'english' or 'en'
l
'spanish' or 'es'
sentence
Integer value. The number of the sentence in the text .
filterlinks
Optional. Default false. When set to true, sentences that are only
links are skipped over and ignored. Any links in a sentence are
not included in the extracted sentence.
filterusermentions Optional. Default false. When set to true, sentences that are only
Twitter user mentions (@username) are skipped over and
ignored. Any user-mentions in a sentence are not included in the
extracted sentence.
filterhashtags
Optional. Default false. When set to true, sentences that are only
Twitter hashtags (#hashtag) are skipped over and ignored. Any
hashtags in a sentence are not included in the extracted
sentence.
adjustcasing
Optional. Defaults to false. When set to true, all letters in the
sentence are converted to upper-case before sentence detection.
After sentence detection all letters are converted to lower-case.
HPE Vertica Analytic Database (7.2.x)
Page 100 of 117
Vertica Pulse
Pulse Function Reference
Argument
Description
This option is helpful if the original data is all in lower-case and
Pulse is incorrectly identifying parts of speech in the sentence.
Notes
l
l
l
The text argument is limited to 65,000 bytes.
This function must be used with the over() clause. Use with OVER
(PARTITION BEST) for the best performance if the query does not require specific
columns in the over() clause.
language can be specified as an argument and/or as a parameter where the
argument value supersedes the parameter value.
Examples
select ExtractSentence('The quick brown fox jumped. Every good boy deserves fudge', 2)
OVER(PARTITION BEST);
sentence
-------------------------------Every good boy deserves fudge.
(1 row)
select extractSentence('the quick brown fox jumped over the lazy dog. All good boys deserve fudge'
, 2, 'english') over();
sentence
----------------------------All good boys deserve fudge
(1 row)
select extractSentence('the quick brown fox jumped over the lazy dog. All good boys deserve fudge'
,2 using parameters language='english') over();
sentence
----------------------------All good boys deserve fudge
(1 row)
select extractSentence('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
, 2, 'spanish') over();
sentence
------------------------------------------Todos los chicos buenos merecen un premio
(1 row)
select extractSentence('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
,2 using parameters language='spanish') over();
sentence
------------------------------------------Todos los chicos buenos merecen un premio
HPE Vertica Analytic Database (7.2.x)
Page 101 of 117
Vertica Pulse
Pulse Function Reference
(1 row)
Filtering Links
SELECT ExtractSentence('HP - http://hp.com is a useful website. I
like HP.', 1 USING PARAMETERS filterlinks=true) OVER(PARTITION BEST);
sentence
---------------------------hp - is a useful website.
(1 row)
See Also
l
GetSentenceCount()
l
GetAllSentences()
HPE Vertica Analytic Database (7.2.x)
Page 102 of 117
Vertica Pulse
Pulse Function Reference
GetAllSentences
Extracts a row for each sentence in a body of text. This ability is useful if you need to
programmatically get each sentence in a piece of text.
Syntax
GetAllSentences(text [, language[ USING PARAMETERS
[ filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])
Parameters
Argument
Description
text
The text from which to get the sentences.
language
The language:
filterlinks
l
'english' or 'en'
l
'spanish' or 'es'
Optional. Default false. When set to true, sentences that are only
links are skipped over and ignored. Any links in a sentence are
not included in the extracted sentence.
filterusermentions Optional. Default false. When set to true, sentences that are only
Twitter user mentions (@username) are skipped over and
ignored. Any user-mentions in a sentence are not included in the
extracted sentence.
filterhashtags
Optional. Default false. When set to true, sentences that are only
Twitter hashtags (#hashtag) are skipped over and ignored. Any
hashtags in a sentence are not included in the extracted
sentence.
adjustcasing
Optional. Defaults to false. When set to true, all letters in the
sentence are converted to upper-case before sentence detection.
After sentence detection all letters are converted to lower-case.
This option is helpful if the original data is all in lower-case and
HPE Vertica Analytic Database (7.2.x)
Page 103 of 117
Vertica Pulse
Pulse Function Reference
Argument
Description
Pulse is incorrectly identifying parts of speech in the sentence.
Notes
l
l
l
The text argument is limited to 65,000 bytes.
This function must be used with the over() clause. Use with OVER
(PARTITION BEST) for the best performance if the query does not require specific
columns in the over() clause.
language can be specified as an argument and/or as a parameter where the
argument value supersedes the parameter value.
Examples
SELECT GetAllSentences('The quick brown fox jumped over the lazy
dog. Every good boy deserves fudge') OVER(PARTITION BEST);
sentence
----------------------------------------------The quick brown fox jumped over the lazy dog.
Every good boy deserves fudge.
(2 rows)
select getAllSentences('the quick brown fox jumped over the lazy dog. All good boys deserve fudge'
,'english') over();
sentence_index |
sentence_text
----------------+----------------------------------------------1 | the quick brown fox jumped over the lazy dog.
2 | All good boys deserve fudge
(2 rows)
select getAllSentences('the quick brown fox jumped over the lazy dog. All good boys deserve fudge'
using parameters language='english') over();
sentence_index |
sentence_text
----------------+----------------------------------------------1 | the quick brown fox jumped over the lazy dog.
2 | All good boys deserve fudge
(2 rows)
select getAllSentences('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
,'spanish') over();
sentence_index |
sentence_text
----------------+---------------------------------------------1 | el zorro rapido brinco sobre el perro flojo.
2 | Todos los chicos buenos merecen un premio
HPE Vertica Analytic Database (7.2.x)
Page 104 of 117
Vertica Pulse
Pulse Function Reference
(2 rows)
select getAllSentences('el zorro rapido brinco sobre el perro flojo. Todos los chicos buenos
merecen un premio'
using parameters language='spanish') over();
sentence_index |
sentence_text
----------------+---------------------------------------------1 | el zorro rapido brinco sobre el perro flojo.
2 | Todos los chicos buenos merecen un premio
(2 rows)
Filtering User-mentions
SELECT GetAllSentences('@user is always late. He kept me waiting 20 minutes last time.'
USING PARAMETERS filterusermentions=true)
OVER(PARTITION BEST);
sentence
----------------------------------------is always late.
he kept me waiting 20 minutes last time.
(2 rows)
See Also
l
GetSentenceCount()
l
ExtractSentence()
HPE Vertica Analytic Database (7.2.x)
Page 105 of 117
Vertica Pulse
Pulse Function Reference
SetDefaultLanguage
Sets the new default language to use for Pulse functions if no language is specified in a
Pulse function call.
Syntax
SetDefaultLanguage(language)
Parameters
Argument
Description
language
The language:
l
'english' or 'en'
l
'spanish' or 'es'
Notes
l
This function must be used with the OVER() clause.
l
The default language immediately after installation is English.
l
The language that is set when using this function is the default language across all
sessions and is persistent across database restarts.
Examples
=> select setDefaultLanguage('es') over();
Success
--------t
(1 row)
See Also
l
SentimentAnalysis
HPE Vertica Analytic Database (7.2.x)
Page 106 of 117
Vertica Pulse
Pulse Function Reference
GetLoadedDictionary
Lists the currently loaded words for the specified user-dictionary.
If the user-dictionary is not loaded, then nothing is returned. You must use the OVER()
clause with this function.
Syntax
SELECT GetLoadedDictionary(user-dictionary
();
[using PARAMETERS language = string][, label='label']) OVER
Parameters
Argument Description
userdictionary
The user-dictionary list to retrieve.
Valid values:
l
pos_words
l
neg_words
l
neutral_words
l
stop_words
l
white_list
See Dictionaries and Mappings for details on each type.
language
label
The language of the dictionary:
l
'english' or 'en'
l
'spanish' or 'es'
The label of the dictionaries that you want to list. If you do not provide a
label, Pulse uses the default dictionaries.
Examples
Note: This example is from a three node cluster, so three copies of the words are
returned.
HPE Vertica Analytic Database (7.2.x)
Page 107 of 117
Vertica Pulse
Pulse Function Reference
SELECT GetLoadedDictionary('pos_words') OVER();
word
------------------------:-)
adequate
admire
admiringly
adore
adoringly
adulation
adventuresome
advocated
affable
affably
affordable
affordably
afordable
all-around
alluringly
amazement
ameliorate
ample
amusing
--More-SELECT GetLoadedDictionary('pos_words' using PARAMETERS language='english') OVER();
word
-----------simplicity
(1 row)
SELECT GetLoadedDictionary('pos_words' using PARAMETERS language='spanish') OVER();
word
------------simplicidad
(1 row)
See Also
l
LoadDictionary()
l
GetLoadedMapping()
HPE Vertica Analytic Database (7.2.x)
Page 108 of 117
Vertica Pulse
Pulse Function Reference
GetLoadedMapping
Lists the currently loaded words for the specified user-defined mapping.
If the mapping is not loaded with LoadMapping , then nothing is returned. This function
must be used with the OVER() clause.
Syntax
SELECT GetLoadedMapping('normalization' [using PARAMETERS language = string][, label='label']) OVER();
Parameters
Argument Description
mapping
The mapping list to retrieve. Currently the only mapping supported is:
normalization
Note: By default, the normalization list is empty.
language
label
The language of the dictionary:
l
'english' or 'en'
l
'spanish' or 'es'
The label to which you want to load the specified mapping. If you do not
include a label, Pulse loads the default UDDs.
Examples
SELECT GetLoadedMapping('normalization') OVER();
key |
value
-----+----------------hp | hewlett packard
(1 row)
SELECT GetLoadedMapping('normalization' using PARAMETERS language='english') OVER();
key |
value
-----+----------------hp | hewlett-packard
hp | hewlett packard
(2 rows)
HPE Vertica Analytic Database (7.2.x)
Page 109 of 117
Vertica Pulse
Pulse Function Reference
SELECT GetLoadedMapping('normalization' using PARAMETERS language='spanish') OVER();
key
|
value
---------+---------------hidalgo | miguel hidalgo
(1 row)
See Also
l
LoadMapping()
l
GetLoadedDictionary()
HPE Vertica Analytic Database (7.2.x)
Page 110 of 117
Vertica Pulse
Pulse Function Reference
GetStorage
Lists the currently loaded user-dictionaries and user-defined mapping.
This function must be used with the OVER() clause.
Syntax
SELECT GetStorage([using PARAMETERS label='label']) OVER();
Parameters
Argument Description
label
The label of the dictionaries and mapping names that you want to list. If
you do not provide a label, Pulse uses the default dictionaries.
Examples
SELECT GetStorage() OVER();
key
-----------------neg_words_en
neutral_words_en
pos_words_en
stop_words_en
white_list_en
normalization_en
neg_words_es
neutral_words_es
pos_words_es
stop_words_es
white_list_es
normalization_es
(12 rows)
See Also
l
LoadDictionary()
l
LoadMapping()
l
GetLoadedDictionary()
l
GetLoadedMapping()
HPE Vertica Analytic Database (7.2.x)
Page 111 of 117
Vertica Pulse
Pulse Function Reference
UnloadLabeledDictionary
Unloads a specific dictionary from a Pulse session. The dictionary continues to exist,
and a user can later reload the dictionary, if needed.
You cannot unload a default dictionary, but you can replace it by loading a custom userdefined dictionary.
Syntax
SELECT unloadLabeledDictionary(USING PARAMETERS listname='listname'[, language='lang'] [,
label='label']) OVER();
Parameters
Argument Description
listName
The type of the dictionary that you want to unload. listName must be one
of:
l
pos_words
l
neg_words
l
neutral_words
l
stop_words
l
white_list
See Dictionaries and Mappings for details on each list type.
language
label
The language:
l
'english' or 'en'
l
'spanish' or 'es'
The label of the dictionary that you want to unload.
Examples
select unloadLabeledDictionary(USING PARAMETERS listname='neg_words',
label='custom_negatives') OVER();
HPE Vertica Analytic Database (7.2.x)
Page 112 of 117
Vertica Pulse
Pulse Function Reference
success
--------t
(1 row)
See Also
l
UnloadLabeledDictionarySet()
HPE Vertica Analytic Database (7.2.x)
Page 113 of 117
Vertica Pulse
Pulse Function Reference
UnloadLabeledDictionarySet
Unloads all user-defined dictionaries with a particular label from a Pulse session. The
dictionaries continue to exist, and a user can later reload the dictionaries, if needed.
You cannot unload a default dictionary, but you can replace it by loading a custom userdefined dictionary.
Syntax
SELECT unloadLabeledDictionarySet(USING PARAMETERS label='labelName') OVER();
Parameters
Argument
Description
label
The label of the dictionary set that you want to unload.
Examples
select unloadLabeledDictionarySet(USING PARAMETERS label='custom_negatives') OVER();
success
--------t
(1 row)
See Also
l
UnloadLabeledDictionary()
HPE Vertica Analytic Database (7.2.x)
Page 114 of 117
Vertica Pulse
Pulse Function Reference
UnloadLabeledMapping
Unloads a specific mapping from a Pulse session. The mapping continues to exist, and
a user can later reload it, if needed.
Syntax
SELECT unloadLabeledMapping(USING PARAMETERS mapName='normalization' [, language='lang'][,
label='label']) OVER();
Parameters
Argument Description
mapName
The name of the mapping from which you are unloading the dictionary.
language
The language:
label
l
'english' or 'en'
l
'spanish' or 'es'
The label of the mapping that you want to unload.
Examples
select unloadLabeledMapping(standard USING PARAMETERS label='custom_mapping') OVER();
success
--------t
(1 row)
HPE Vertica Analytic Database (7.2.x)
Page 115 of 117
Vertica Pulse
Pulse Function Reference
HPE Vertica Analytic Database (7.2.x)
Page 116 of 117
Send Documentation Feedback
If you have comments about this document, you can contact the documentation team by
email. If an email client is configured on this system, click the link above and an email
window opens with the following information in the subject line:
Feedback on Vertica Pulse (Vertica Analytic Database 7.2.x)
Just add your feedback to the email and click send.
If no email client is available, copy the information above to a new message in a web
mail client, and send your feedback to [email protected].
We appreciate your feedback!
HPE Vertica Analytic Database (7.2.x)
Page 117 of 117