Control Plane Architectures: Design Solutions Shane

Control Plane Architectures:
Design Solutions
Shane Gibson – Cloud Infrastructure Architect
ZeroStack, Inc. - https://zerostack.com/
OpenStack Summit - Boston, MA- May 11, 2017
ZeroStack Inc. | zerostack.com
©©ZeroStack
Inc. | zerostack.com
QR Code
Why take pix? Just use the QR Code!
https://www.slideshare.net/ShaneGibson3/openstack-control-plane-architectures-design-solutions
© ZeroStack Inc. | zerostack.com
2
IMPORTANT LEGAL STUFF
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris venenatis
posuere odio vel auctor. Fusce non turpis nec lorem varius dictum. Nulla
felis neque, convallis a congue sed, molestie vel lectus. Aenean hendrerit
metus non nunc commodo sodales. Etiam ac erat at massa tincidunt lobortis at
et ligula. Fusce sed lorem tellus. Suspendisse potenti. Ut dignissim
suscipit aliquet. Donec luctus pulvinar lectus quis condimentum. Etiam sed
iaculis nunc, sed blandit magna. Fusce mattis nisl nec dapibus luctus. Proin
a augue facilisis, vehicula mauris non, cursus augue. Mauris tristique,
justo vitae tempor tincidunt, metus ligula ornare tellus, at condimentum ex
diam nec odio. Sed porttitor ultrices libero sed efficitur.
Cras sem diam, eleifend sit amet dui eu, pretium cursus
fermentum nibh. Donec tincidunt cursus enim a varius.
Nam placerat eu nunc id rutrum. Praesent ullamcorper
fringilla eros, vitae rutrum elit consectetur in. Aliquam
eu tempus dui, a feugiat nulla. Quisque laoreet imperd ex,
in facilisis elit tristique a. Nunc ante felis, faucibus
at semper nec, consequat commodo magna.
© ZeroStack Inc. | zerostack.com
3
About Shane Gibson
Shane Gibson serves as the Cloud Infrastructure Architect for ZeroStack, Inc., which is a private cloud
solutions company. There he is responsible for the architecture, implementation, and management of the
internal cloud platform that drives the SaaS and Cloud Portal that power the ZeroStack solution.
Previously, he served as Sr. Principal Infrastructure Architect at Symantec for the Cloud Platform
Engineering (CPE) team. He was responsible for the infrastructure design of the underlying platforms,
operating systems, tools, and application stack that enables the OpenStack clusters within the CPE group.
In previous roles, Shane has served as a Systems Architect, Network Architect,
Security Architect, Unix Systems Administrator, Mainframe Operator, Mainframe
Hardware Specialist, and has also served in the United States Marine Corps.
In his "spare" time, he loves to anything on two wheels; motorcycling, mountain
biking, road biking, cyclocross, etc…
© ZeroStack Inc. | zerostack.com
4
Agenda








what we'll be talking about (and not)
problem statement
needs analysis
solutions
summary
questions
thank you
references
© ZeroStack Inc. | zerostack.com
5
what we'll be talking about (and not) …
© ZeroStack Inc. | zerostack.com
What we'll be talking about
● Short definition of what "Control Plane" means
● Short definition of what "Data Plane" means
● How much Control Plane do you need?
● Briefly discuss general HA design solutions
● Introduce four design architectures
○ Stand alone (seriously!)
○ Active/passive
○ Fully Redundant, separate control plane
○ Distributed, embedded control plane
● Discuss the architecture of these design solutions
© ZeroStack Inc. | zerostack.com
7
What we won't be talking about
● Things that aren't OpenStack
○ Ancillary services (eg AD/LDAP behind Keystone)
○ Server Load Balancers architectures (they're key to HA!)
■ ok, we'll talk about them a bit …
● Specifics of Network Controller architecture
● Container Orchestration Engine (COE) HA
● Physical infrastructure (eg power, cooling, etc.)
● Complex DB setups (sharding, multisite … )
● Multi-site Control Plane
● Storage HA architecture (Ceph, Swift, etc…)
© ZeroStack Inc. | zerostack.com
8
Control Plane Definition
● The control plane is the management traffic responsible for
sending signaling and commands, examples:
○ give me a token so I can do something
control plane
○ create port, network, router
○ instantiate/terminate an instance
● Sort of like a Drill Sergeant:
○ instructs recruits (data plane)
○ signals and commands
Ref: 1
© ZeroStack Inc. | zerostack.com
9
Data Plane Definition
● The data plane is all of the bits and bytes moving around
related to doing the work as instructed by the control plane:
○ actually instantiating the instance
data plane
○ east/west traffic between VMs,
○ north/south traffic in and out
of your cloud
● Kind of like these poor Recruits
○ stand at attention, pass out at
attention !!
Ref: 1
© ZeroStack Inc. | zerostack.com
10
problem statement
© ZeroStack Inc. | zerostack.com
Problem Statement
● So you've completed a PoC … like what you see …
● Need to build a shiny new cloud
● From PoC to production - what
architecture do you need?
● Understand your needs
● Match your needs to a design
● Overbuilding is just as dangerous
as under building
● But, keep in mind - you may need/
want/forced to scale
○ You're control plane needs to grow with your cloud
© ZeroStack Inc. | zerostack.com
man, this
devstack
is easy !!
12
needs analysis
© ZeroStack Inc. | zerostack.com
Needs Analysis
● Understanding how much reliability you need is critical to
determining an appropriate CP architecture
● Quantify how available your platform needs to be
● Be honest … can you live with a 95% available CP? How
about 98% ? Do you *need* 99.9%? Can you afford to
build, staff, support, and maintain 99.999%?
● Complexity adds cost, time, and significant risk
© ZeroStack Inc. | zerostack.com
14
Needs Analysis - how much is enough
● Downtime, based on percentage of availability:
Percentage
Yearly
Monthly
Weekly
Daily
95%
18d 6h 17m 27.6s
1d 12h 31m 27.3s
8h 24m 0.0s
1h 12m 0.0s
98%
7d 7h 18m 59.0s
14h 36m 34.9s
3h 21m 36.0s
28m 48.0s
99%
3d 15h 39m 29.5s
7h 18m 17.5s
1h 40m 48.0s
14m 24.0s
99.5%
1d 19h 49m 44.8s
3h 39m 8.7s
50m 24.0s
7m 12.0s
99.9%
8h 45m 57.0s
43m 49.7s
10m 4.8s
1m 26.4s
99.99%
52m 35.7s
4m 23.0s
1m 0.5s
8.6s
99.999%
5m 15.6s
26.3s
6.0s
0.9s
365.243 days per year
(leap year, baby!)
© ZeroStack Inc. | zerostack.com
52.178 weeks per year
30.437 days per month
4.348 weeks per month
calculations source: http://uptime.is/
15
Needs Analysis
● To match your uptime/downtime threshold
○ Understand business use of your platform
○ Survey your user groups to determine what
applications they will be using, and how critical they are
● Determine how much talent (be honest) you have to build
or you can buy (hire or rent) for the platform you need…
○ *You* might be a rock star, but you need a dedicated
and competent team to tend to a complex HA solution
○ A well tended single server solution *may* outperform
a poorly managed highly complex one
■ performance, of course, not-withstanding …
© ZeroStack Inc. | zerostack.com
16
Needs Analysis: match uptime to solution
●
A complete (bogus?) guideline:
© ZeroStack Inc. | zerostack.com
95 to 98 %
Active/Passive
98 to 99.5 %
99.5 to 99.99 %
99.99+ %
Active/Active
or Distributed
Standalone
17
Needs Analysis
● How much capacity (compute, memory, storage, etc) do
you need for your control plane services?
o Great resource/data:
o URL: https://docs.openstack.org/developer/performance-docs/test_results/
o Example Control Plane resource consumption for:
•
•
•
•
6 nodes
200 nodes
400 nodes
1000 nodes
© ZeroStack Inc. | zerostack.com
18
patterns - basics of availability designs
© ZeroStack Inc. | zerostack.com
HA Design Solutions - single system with hardware redundancy
Server
(redundant hardware subystems)
typically located in a
datacenter(like)
location with redundant
power, network, cooling,
etc…
capacity / scaling is going
to be your bug-a-boo (you
can only scale "up" so
much),
suggest building in
service LB from the
beginning
© ZeroStack Inc. | zerostack.com
20
HA Design Solutions - active/passive
● either bare metal or virtualized /
containerized work loads
mysql
mysql
(active)
(standby)
VIP
svc A
svc A
svc B
svc B
svc C
svc C
mysql
replication
replicated data (eg DRBD)
service based replication
example: mysql repl.
© ZeroStack Inc. | zerostack.com
externally replicated, service is
unaware - eg use of load balancer
and pacemaker + DRBD
21
HA Design Solutions - clustered
follower
leader
C
A
B
A
B
C
follower
application maintains and
controls cluster replication,
leader election, and take-overs
© ZeroStack Inc. | zerostack.com
A
B
C
22
HA Design Solutions - virtualized services
● Implement simple hypervisors (eg just bare KVM)
● or implement a small OpenStack cluster (caution !!)
● a lot of interesting Containerized CP solutions are maturing
VIP A
VIP B
VIP A
VIP B
VIP A
VIP B
VM - service A
VM - Service A
VM - Service A
VM - service B
VM - Service B
VM - Service B
hypervisor 1
hypervisor 2
hypervisor 3
© ZeroStack Inc. | zerostack.com
23
HA Design Solutions - distributed services
● Embed a VM or Container in each hypervisor of your cluster
which is responsible for service orchestration tasks
VIP A / B
services
orch.
data
VIP A / B
services
orch.
data
VIP A / B
services
orch.
data
controller service A
controller service B
controller service C
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
hypervisor A
© ZeroStack Inc. | zerostack.com
hypervisor B
hypervisor C
24
solutions - as applied to control plane
© ZeroStack Inc. | zerostack.com
Solutions: Overview
● Standalone (yes, still!)
© ZeroStack Inc. | zerostack.com
26
Solutions: Overview
● Standalone (yes, still!)
● Active/Passive
○ one master, one standby system
© ZeroStack Inc. | zerostack.com
27
Solutions: Overview
● Standalone (yes, still!)
● Active/Passive
○ one master, one standby system
● Active/Active Cluster
○ Multiple members in cluster
○ Leader election system / quorom protocols
○ Or - use of load balancers for singleton systems
© ZeroStack Inc. | zerostack.com
28
Solutions: Overview
● Standalone (yes, still!)
● Active/Passive
○ one master, one standby system
● Active/Active Cluster
○ Multiple members in cluster
○ Leader election system / quorom protocols
○ Or - use of load balancers for singleton systems
● Distributed System
○ Embedded across the cluster
© ZeroStack Inc. | zerostack.com
29
Solutions: how the control plane correlates
● Remember our
Drill Sergeant ?
control plane
control plane
data plane
Ref: 3
Ref: 1
© ZeroStack Inc. | zerostack.com
30
Solutions: how the control plane correlates
● how this might look in racks …
tor01
tor02
tor03
tor04
tor05
tor06
tor07
tor08
tor09
tor10
tor11
tor12
infra01
infra03
compute01
compute07
compute13
compute19
infra02
builder
compute02
compute08
compute14
compute20
controller01
controller03
...
...
...
...
controller02
db01
...
...
...
...
db02
db03
...
...
...
...
network01
network03
...
...
...
...
network02
storage01
data01
data04
data07
data10
storage02
storage03
data02
data05
data08
data11
...
...
...
...
© ZeroStack Inc. | zerostack.com
31
Solutions: standalone
● Stand alone doesn't have to mean "prone to failure"
○ Redundant power supplies (with redundant feeds)
○ Redundant NICs/separate LOM + PCIe (or 2x PCIe)
○ Hardware RAID based storage
○ Redundant Top-of-Rack (bonded NICs)
○ In an environmentally controlled facility
■ cooling, power, electrical, etc.
● You would be surprised how fault tolerant a single, well
designed system can be…
● Can only "scale up" so much before you have to "scale out"
○ Edgar Magana of Workday: OpenStack HA, or not HA
■ not HA - Level 4 Ballroom G at 5:30pm
© ZeroStack Inc. | zerostack.com
32
Solutions: standalone
© ZeroStack Inc. | zerostack.com
33
Solutions: how much can HA/Reliability cost you ?
● Have you ever heard of the Jepsen tests or articles?
○ Check out "The Network is Reliable" [Ref: 2] (Kyle Kingsbury):
○ it just might chill your blood …
© ZeroStack Inc. | zerostack.com
34
Solutions: active/passive
● Ok, maybe standalone doesn't cut it for you …
● Active/Passive utilizes a service to monitor the main (active)
service, and then execute a coup if trouble is detected…
for example: STONITH
(Shoot The Other Node In The Head)
© ZeroStack Inc. | zerostack.com
35
Solutions: active/passive
● Most of the services aren't aware of the fact they have a
"shadow partner" …
● Utilize various tools to monitor services, and initiate a takeover if the primary/active service fails
○ keepalived, pacemaker, corosync, STONITH, etc…
● Data is usually replicated outside of applications knowledge
○ DRBD (Distributed Replicated Block Devices)
■ very stable, around a LONG time, actively maintained and supported
○ xNBD/bNBD, SAN based replication
○ Ceph RBD (replica of 2), GlusterFS, etc…
○ Or … "simply" via database replication
© ZeroStack Inc. | zerostack.com
36
Solutions: active/passive
● Primary mechanism is
Service LB with a
watchdog of some type
● Let distributed
services (eg rabbitmq
and mysql) replicate
natively
● Shared storage for
things like
configurations,
backing instances, etc.
© ZeroStack Inc. | zerostack.com
37
Solutions: fully redundant
● So you've decided you're "all in"
○ Fully Redundant - requires very careful consideration
○ Complex HA and Reliability solutions have their own
baggage that just might cost you more than you bargained for
○ But if you need to drive towards the 99% and better
uptime…
○ Each service requires it's own
treatment in terms of architecture … but there are
common threads
© ZeroStack Inc. | zerostack.com
38
Solutions: fully redundant - virtualized
● Like active/passive - but we now scale 3, 5, etc… (odd
numbers for proper quorum) of fully active members
© ZeroStack Inc. | zerostack.com
39
Solutions: fully redundant - containerized
● New alternatives emerging around COE models for
managing your Control Plane services.
● Kubernetes Example:
Kubernetes Master with HA –
One of many proposed HA
models
© ZeroStack Inc. | zerostack.com
40
Solutions: fully redundant - containerized
● Kubernetes Worker Nodes:
kubelet
kubernetes
kubernetes
masterN
kubernetes
masterN
masterN
kubelet
kubelet
mysql
neutron
neutron
glance
glance
nova
cinder
nova
cinder
nova
cinder
...etc...
worker 1
worker 2
worker 3
© ZeroStack Inc. | zerostack.com
41
Solutions: distributed
● Big departure from the traditional model
● With distributed (embedded) clusters, there are some special
considerations necessary:
○ Be very careful of "noisy neighbor" problem causing your
control plane grief
○ See "Quantifying the Noisy Neighbor Problem" by ZeroStack
from Austin 2016 Summit
○ Designing the algorithms on placing and managing your
control plane systems in the cluster can be very complex
○ Need a distributed state/service orchestration piece (eg etcd,
consul, serf, atomix, zookeeper)
© ZeroStack Inc. | zerostack.com
42
Solutions: distributed
● or ...
● Consider a COE (container orchestration engine) to manage
the placement and healing properties of your CP:
○ Still a relatively young solution with potential pitfalls
○ Can utilize this model with Fully Redundant or Distributed
models
○ Consider tight QoS controls (eg namespaces and cgroups)
for service guarantees if using Distributed
© ZeroStack Inc. | zerostack.com
43
Solutions: distributed - four node cluster
© ZeroStack Inc. | zerostack.com
44
Solutions: distributed
●
When you have a CP that dynamical does this, auto-heals,
deals with noisy neighbors, and can scale on demand …
© ZeroStack Inc. | zerostack.com
45
QUESTIONS ?
© ZeroStack Inc. | zerostack.com
We are hiring!! Check us out on the
thingy called the "web", at:
https://www.zerostack.com/careers/
© ZeroStack Inc. | zerostack.com
THANK YOU!
Shane Gibson
[email protected]
ZeroStack Inc. | zerostack.com
©©ZeroStack
Inc. | zerostack.com
References
[1] CartoonStock License Agreement: https://www.cartoonstock.com/licenseagreement.asp
[2] "The network is reliable" (Kyle Kingsbury and Peter Bailis): https://aphyr.com/posts/288-the-network-is-reliable
[3] OpenStack Operators Guide: http://docs.openstack.org/openstack-ops/content/example_architecture.html#example_archs_conclusion
© ZeroStack Inc. | zerostack.com
49