Introduction to Apache Kafka

Introduc)on to Apache Ka1a Jun Rao Co-­‐founder of Confluent Agenda •  Why people use Ka1a •  Technical overview of Ka1a •  What’s coming What’s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage Common PaIern in Data-­‐driven Companies Value ↑
User
Product
Virality ↑
Data
Insights ↑
Signals ↑
Science
5 GeLng Data Is Hard •  Data scien)sts spent 70% )me on geLng and cleaning data •  Why? Variety in Data Sources •  Database records –  Users, products, orders •  Business metrics –  Clicks, impressions, pageviews •  Opera)onal metrics –  CPU usage, requests/sec •  Applica)on logs –  Service calls, errors •  IOT •  … Variety in New Specialized Systems •  Batch oriented –  Hadoop ecosystem (Pig, Hive, Spark, …) •  Real )me –  Key/value store (Cassandra, MongoDB, HBase, …) –  Search (Elas)cSearch, Solr, …) –  Stream processing (Storm, Spark streaming, Samza) –  Graph (GraphLab, FlockDB, …) –  Time series DB (open TSDB, …) –  … Danger of Point-­‐to-­‐point Pipelines Espresso
Espresso
Espresso
Hadoop
Voldemort
Voldemort
Voldemort
Log
Search
Oracle
Oracle
Oracle
Monitoring
User Tracking
Data
Warehouse
Operational
Logs
Social
Graph
Operational
Metrics
Rec.
Engine
Search
Security
Production Services
...
Email
Ideal Architecture: Stream Data Plaborm Espresso
Espresso
Espresso
Voldemort
Voldemort
Voldemort
Oracle
Oracle
Oracle
User Tracking
Operational
Logs
Operational
Metrics
Log
Hadoop
Log
Search
Monitoring
Data
Warehouse
Social
Graph
Rec
Engine
Search
Security
...
Email
Production Services
Ka1a is the center of stream data plaborm! Ka1a at LinkedIn • 
• 
• 
• 
• 
800 billion messages, 175 TB data per day 13 million messages wriIen/sec 65 million messages read/sec Tens of thousands of producers Thousands of consumers Agenda •  Why people use Ka1a •  Technical overview of Ka1a –  Throughput and scalability –  Real )me and batch consump)on –  Durability and availability •  What’s coming Concepts and API •  Topic defines a message queue •  Producer (Java) byte[] key = "key".getBytes();
byte[] value = "value".getBytes();
record = new ProducerRecord("my-topic", key, value);
producer.send(record);
•  Consumer (Java) streams[] = Consumer.createMessageStreams("topic1", 1);
for(message: streams[0]) {
// do something with message
}
Distributed Architecture producer
producer
producer
producer
producer
producer
producer
producer
producer
kafka
cluster
consumer
consumer
consumer
consumer
consumer
consumer
consumer
consumer
consumer
Topics are par))oned for parallelism 14 Built-­‐in Cluster Management •  Automated failure detec)on and failover –  Leverage Apache Zookeeper •  Online data movement Simple Efficient Storage with a Log Data Source
writes
Log
0
1
2
3
4
5
reads
Destination
System A
(time = 7)
6
7
8
9
10
11
12
reads
Destination
System B
(time = 11)
16 Batching and Compression •  Batching –  Producer, broker, consumer •  Compression –  Gzip, snappy, lz4 –  End-­‐to-­‐end Real Time and Batch Consump)on •  Mul)-­‐subscrip)on model •  Data persisted on disk, NOT cached in JVM –  Rely on pagecache •  Zero-­‐copy transfer (broker -­‐> consumer) •  Ordered consump)on per topic par))on –  Significantly less bookkeeping Durability and Availability Built-­‐in replica)on •  Configurable replica)on factor •  Tolera)ng f – 1 failures with f replicas •  Automated failover Replicas and Layout •  Topic par))on has replicas •  Replicas spread evenly among brokers logs
logs
logs
logs
topic1-part1
topic1-part2
topic2-part1
topic2-part2
topic2-part2
topic1-part1
topic1-part2
topic2-part1
topic2-part1
topic2-part2
topic1-part1
topic1-part2
broker 1
broker 2
broker 3
broker 4
Data Flow in Replica)on producer
2 1 ack 2 leader
follower
follower
3 4 commit
topic1-part1
topic1-part1
topic1-part1
broker 1
broker 2
broker 3
consumer
When producer receives ack Latency Durability on failures no ack no network delay some data loss wait for leader 1 network roundtrip a few data loss wait for commiIed 2 network roundtrips no data loss Extend to Mul)ple Par))ons producer
leader
follower
topic1-part1
topic1-part1
follower
topic1-part1
producer
leader
follower
topic2-part1
follower
broker 1
follower
topic2-part1
follower
producer
topic2-part1
leader
topic3-part1
topic3-part1
topic3-part1
broker 2
broker 3
broker 4
Leaders are evenly spread among brokers Agenda •  Why people use Ka1a •  Technical overview of Ka1a •  What’s coming Stream Data Plaborm Future Releases of Apache Ka1a •  New java consumer client –  BeIer performance –  Easier protocol for non-­‐java client •  Security –  Authen)ca)on: SSL and Kerberos –  Authoriza)on •  BeIer cluster management –  More automated tools –  Quotas •  Transac)onal support –  Exactly once delivery Confluent •  Mission: Make stream data plaborm a reality •  Ka1a development and support •  Product: –  Metadata management (released) –  Rest endpoint (released) –  Connectors for common systems –  Monitor data flow end-­‐to-­‐end –  Stream processing integra)on Q&A •  More info on Apache Ka1a –  hIp://ka1a.apache.org/ •  Confluent –  hIp://confluent.io –  hIp://confluent.io/careers •  Ka1a meetup tonight at 6:30 (Texas V)