Developers Guide To Coprocessors

A Developers Guide To Coprocessors
John Weatherford
https://github.com/jweatherford
Hbasecon 2013
Who Is Telescope?
Telescope is the leading provider of interactive television,
audience participation and customer engagement solutions.
Clients include TV networks, producers, digital platforms,
studios, and sponsors seeking to reach, engage, and retain
mass-audiences and consumers in real-time.
What Is A Coprocessor
Arbitrary code that can run on each server
Extend the functionality of Hbase
Avoid bothering the core committers
Two Types of Coprocessors
Observers
Endpoints
React to an event
Run code before or after
Call a function explicitly
Execute code on all regions
Client
Pre-Action
Action
Post-Action
Client
Endpoint
Endpoint
Endpoint
Region 1
Region 3
Region 2
What Can I Do With Coprocessors
Access Control
Secondary Indexes
Optimized Search
Data Aggregation
Ideas
what can be done
Real Time Analytics
Email split alerts
Cache Request
Reduce result sets
Control compaction times
A Short Story
Nothing ventured
Nothing gained
Getting Started With Code
preGet(ObserverContext<RegionCoprocessorEnvironment> c, Get get,
List<KeyValue> result)
postGet(ObserverContext<RegionCoprocessorEnvironment> c, Get get,
List<KeyValue> result)
prePut(ObserverContext<RegionCoprocessorEnvironment> c, Put put,
WALEdit edit, boolean writeToWAL)
postPut(ObserverContext<RegionCoprocessorEnvironment> c, Put put,
WALEdit edit, boolean writeToWAL)
preDelete(ObserverContext<RegionCoprocessorEnvironment> c, Delete delete,
WALEdit edit, boolean writeToWAL)
postDelete(ObserverContext<RegionCoprocessorEnvironment> c, Delete delete,
WALEdit edit, boolean writeToWAL)
Our First Observer
Intercept and modify the action
Consider all circumstances that will trigger the observer
Compile your jar to the same version of Java running your
Hbase Regions
Look for output from the coprocessor
Our First Observer
Motivation Apache flume only writes one column per put
JSON
{twitter:
{ name: “loljk4u”,
message: “<3”,
length: 2,
registered: true
},
favorite:
{ name: “Taylor”
...
Single
Row Put
key: id-1332343
family: twitter
qualifier: json_raw
value: “{twitter:
{name: \“loljk4u\”,
message: \“<3\”,
length: 2,
registered: true
...
preput()
key: id-1332343
twitter:name: “loljk4u”
twitter:message: “<3”
twitter:length: 0x2
twitter:registered: 0xFF
favorite:name: “Taylor”
favorite:song: “I knew
you were trouble”
put
JsonColumnExpander
//get the arguments on the coprocessor
public void start(CoprocessorEnvironment env) throws IOException {
Configuration c = env.getConfiguration();
families = c.get("families", "").split(":");
}
public void prePut(ObserverContext<…> e, Put put, WALEdit edit, boolean waL) {
if(!put.has(FAMILY, JSON_COLUMN)) { return; }
//check for the json_raw column
String json = Bytes.toString(put.get(FAMILY, JSON_COLUMN).get(0).getValue());
for(Entry<String, ?> column : columns.entrySet()) { //loop through the json
String value = (String) column.getValue();
put.add(family, Bytes.toBytes(column.getKey()), Bytes.toBytes(value));
}
//remove the original json from the put
put.add(FAMILY, JSON_COLUMN, "--removed--".getBytes());
}
Loading the Coprocessor
Push the jar to where your cluster can find it
$>hadoop fs –put JsonColumnExpander.jar /
Alter the table to enable the coprocessor
$> alter ‘test', METHOD => 'table_att',
'coprocessor'=>'hdfs:///JsonColumnExpander.jar|telescope.hbase.Json
ColumnExpander|1001|arg1=1,arg2=2‘
Verify the load by checking the master web UI.
Running The Code
Trigger the coprocessor with a put on the table
Put put = new Put(“rowkey”);
Put.add(“goat”.toBytes(), “json_raw”.toBytes(), json_data);
Check each server’s local logs
http://regionnode:60030/logs/
hbase-hbase-regionserver-node2.
dev-hadoop.telescope.tv.out
Creating Your First Endpoint
Define the available methods a protocol
Implement the protocol
Extend BaseRegionEndpoint
Load the endpoint on the table
Endpoint Example
public interface TrendsProtocol extends CoprocessorProtocol{
HashMap<String, Long> getData() throws IOException;
}
//The endpoint class implements the protocol we wrote above
public class TrendsEndpoint extends BaseEndpointCoprocessor implements TrendsProtocol {
@Override
public HashMap<String, Long> getTrends() throws IOException {
RegionCoprocessorEnvironment environment = getEnvironment();
InternalScanner scanner = environment.getRegion().getScanner(s);
try {
List<KeyValue> curVals = new ArrayList<KeyValue>();
do {
curVals.clear();
for(KeyValue pair : curVals){
//loop through values on the region and process
}
}while(!done);
}
}
}
Endpoint Returned Results
htable = HBaseDB.getTable(connection, “hbase_demo");
Map<byte[], HashMap<String, Long>> results = null;
results = m_analytics.coprocessorExec(
TrendsProtocol.class,
null,
//start row
null,
//end row
new Batch.Call<TrendsProtocol, HashMap<String, Long>>(){
@Override
public HashMap<String, Long> call(TrendsProtocol trends)throws IOException {
return trends.getData();
}
}
);
for (Map.Entry<byte[], Boolean> entry : results.entrySet()) {
//process results from each region server
}
Addendum to Endpoints
0.96 is changing Endpoints to use protobuf
public static abstract class RowCountService
implements com.google.protobuf.Service {
...
public interface Interface {
public abstract void getRowCount(
com.google.protobuf.RpcController controller,
CountRequest request,
com.google.protobuf.RpcCallback done);
public abstract void getKeyValueCount(
com.google.protobuf.RpcController controller,
CountRequest request,
com.google.protobuf.RpcCallback done);
}
}
Telescope’s Coprocessors
Observers collect real time analytics data for our
moderation platform as well as to create aggregate tables
for the steaming data
Endpoints optimize searches and transmit only the
necessary data. Perform simple reporting queries that
don’t need the full power of mapreduce.
Questions?
Already using coprocessors? I would love to hear about it.
Curious to know more about a specific part?
All code samples and table definitions can be found at
https://github.com/jweatherford