Topics - Meetup

•
•
•
•
Flume
Kafka
Storm
Demo
•
•
•
•
Used for creating streaming data flow
Distributed
Reliable
Support for many inbound ingest protocols
Real-Time Streaming
Offline/Batch processing
Source
Web/
File
…
Channel
Sink
HDFS/
NoSQL…
• HTTP
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler
• Spool Directory
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true
• Exec
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
• Memory – High Throughput, not reliable
• JDBC – Durable, slower
• File – Good throughput , supports recovery
• HDFS
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
• HIVE
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
• Kafka
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = mytopic
a1.sinks.k1.brokerList = localhost:9092
• Can chain
• Multiplex
• Fan-in Fan-out
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
• bin/flume-ng agent --conf conf --conf-file example.conf --name
a1 -Dflume.root.logger=INFO,console
Kafka
• Design Considerations
•
•
•
•
•
Log aggregator
Distributed
Batch messages to reduce number of connections
Offline/Periodic consumption
Pull model
• Uses Zookeeper for node and consumer status.
• At-least-Once delivery (using offset you can get exactly-Once
processing)
• Built-in data loss auditing
• A producer writes to a topic and consumer reads from a topic.
• Topic is divided into ordered set of partitions. Each partition is
consumed by one consumer at a time. Offset is maintained for
each consumer per partition.
• Partition count determines the maximum consumer parallelism.
• Each partition can have multiple replicas. This provides failover.
• A broker can host multiple partition but can be leader for only
one partition.
• The leader receives message and replicates to other servers.
• Server.properties file
• Host name, Port
• Zookeepers
• bin/zookeeper-server-start.sh config/zookeeper.properties
• bin/kafka-server-start.sh config/server.properties
Storm
•
•
•
•
•
•
•
•
Topologies
Streams
Spouts
Bolts
Stream groupings
Reliability
Tasks
Workers
• Unbounded sequence of tuples
• Tuple is a list of values
• Generates Streams
• Can be Reliable or Unreliable
• Reliable spouts use ack() and fail(). Tuples can be replayed.
• Used for filtering, functions, aggregations, joins, talking to
databases, and more.
• Complex processing is achieved by using multiple bolts.
• Types of Bolt Interfaces:
• IRichBolt: this is general interface for bolts. Manual ack needed.
• IBasicBolt: this is a convenience interface for defining bolts that do filtering
or simple functions. Auto ack.
• Tells storm how to process tuples with available tasks
• Shuffle grouping – Tuples are randomly sent to tasks
• Fields grouping – Group processing by fields. Makes sure only
one task processes a grouped field value.
Shuffle
AB
Field
AB
“X”
XYZ
XYZ
CAB
CAB
XYF
XYF
“A”
• Nimbus – Master node
• Zookeeper – Cluster Coordination
• Supervisor – Worker Processes
• Nimbus : Master node. There can be only one master node in a
cluster. Reassigns tasks in case of worker node failure.
• Zookeeper : Communication backbone in cluster. Maintains state
to aid in failover/recovery.
• Supervisor : Worker node. Governs worker processes.
Worker Node
Worker Process
Zookeeper
node
Supervisor
Nimbus
Executor
Executor
Task
Task
Task
Task
Task