Fast Read Scaling PostgreSQL with repmgr

Fast Read Scaling
PostgreSQL with repmgr
Greg Smith
2ndQuadrant US
© 2ndQuadrant Limited 2010-2011
Overview

repmgr is an open source clusterware tool for
PostgreSQL replication

Uses Streaming Replication and Hot Standby

Provides
–
Ease of Use
–
Performance
–
Monitoring
–
Best Practice
© 2ndQuadrant Limited 2010-2011
Target Cluster Architecture
Standby
Node


Master
Node

Standby
Node
© 2ndQuadrant Limited 2010-2011
Master
Multiple Standby
nodes
Streaming
replication
Actions

master
–

register
standby
–
clone
–
register
–
promote
–
follow
© 2ndQuadrant Limited 2010-2011
Adding Standby node
Node2:
Standby
Node
Node1:
Current
Master
© 2ndQuadrant Limited 2010-2011


repmgr standby clone
Performs all actions
required to add one
standby node
Adding Standby node
Node2:
Standby
Node

Node1:
Current
Master
Node3:
Standby
Node
© 2ndQuadrant Limited 2010-2011

repmgr standby clone
Performs all actions
required to add second
standby node
Sample setup
[postgres@node1]:~$ repmgr master register
[postgres@node2]:~$ repmgr -D $PGDATA -U repmgr
standby clone node1
[postgres@node2]:~$ pg_ctl start
[postgres@node2]:~$ repmgrd -f $HOME/repmgr/repmgr.conf
© 2ndQuadrant Limited 2010-2011
Usage

All actions are simple one line commands

Default is “the current node”


Can execute actions on other nodes by explicitly
naming them
Configuration file provides additional parameters
© 2ndQuadrant Limited 2010-2011
Failover (1)
Node2:
Standby
Node


Master
Node

Node3:
Master
Node
© 2ndQuadrant Limited 2010-2011
repmgr standby promote
Changes standby into new
master node
Fencing the old master: still
your problem (for now!)
Failover (2)
Node2:
Standby
Node

Master
Node
Node3:
Master
Node
© 2ndQuadrant Limited 2010-2011

repmgr standby follow
Changes standby to
follow newly promoted
master
Failover (3)
Node2:
Standby
Node
Node1:
Standby
Node



Node3:
Master
Node
© 2ndQuadrant Limited 2010-2011
repmgr standby clone
--force
Forces old master into
being a standby of
newly promoted master
Takes advantage of rsync
optimization
repmgrd


Monitoring daemon on each node
–
repmgr master register
–
repmgr standby register
Allows monitoring and management
© 2ndQuadrant Limited 2010-2011
repmgrd configuration
cluster=test
node=1
conninfo='host=node1 user=repmgr dbname=pgbench'
© 2ndQuadrant Limited 2010-2011
Monitoring
$ psql ­x ­c "SELECT * FROM repmgr_test.repl_status"
primary_node | 1
standby_node | 2
last_monitor_time | 2011­02­23 08:19:39.791974­05
last_wal_primary_location | 0/1902D5E0
last_wal_standby_location | 0/1902D5E0
replication_lag | 0 bytes
apply_lag | 0 bytes
time_lag | 00:00:13.30293
© 2ndQuadrant Limited 2010-2011
Read Scaling Use Cases
•
High availability with active monitoring
•
Offload long running reports
•
Materialize views
•
Load balance small read-only queries
© 2ndQuadrant Limited 2010-2011
Routing reads and writes
•
Writes must go to master
•
Reads execute against master or any standby
•
Application may know
•
Application servers may support this concept
•
–
JDBC
–
Django
Database proxy servers can do this routing
–
•
pgpool-II 3.0 “tastes” queries
Hard to solve in all cases
–
Where do functions (stored procedures) go?
© 2ndQuadrant Limited 2010-2011
Architecture: Read Scaling
Many read copies with slight lag

Each is also a potential failover node

Not suitable for long reports

pgpool-II
router
Reads
“Hot”
Read-Only
Node
Writes
Primary
Node
© 2ndQuadrant Limited 2010-2011
“Hot”
Read-Only
Node
Architecture: Reporting Server
Rolling Reporting Server(s)

Live servers runs queries

Other servers provide failover
capability for Primary

“Hot”
Read-Only
Reporting
Node
Primary
Node
© 2ndQuadrant Limited 2010-2011
Archive
Failover
Node
Architecture: Relay Server
Archive data streamed to a standby

Ship the result to a second layer standby

pg_streamrecv
– https://github.com/mhagander

Hot Standby
Primary
Node
© 2ndQuadrant Limited 2010-2011
Hot Standby
Architecture challenges
•
Doing all maintenance on the master is hard
–
VACUUM
–
CREATE INDEX CONCURRENTLY
•
All writes are still going to all the slaves
•
Write bottlenecks can occur in multiple places
–
•
5 hour checkpoints are no fun
Query cancellation
© 2ndQuadrant Limited 2010-2011
Prioritization
•
Keep the standby current for failover
•
Long running reports on the standby
•
Avoid adding overhead to the master
© 2ndQuadrant Limited 2010-2011
Query Conflicts

Primary: Drop Database X

Standby: Query on database X

Cannot do both

Action on primary has already happened, so
whatever occurs, WAL recovery must always win
© 2ndQuadrant Limited 2010-2011
Query Visibility: 9.0
•
Queries executing on standby are independent
•
Master does not know what is running
© 2ndQuadrant Limited 2010-2011
9.0 tuning in theory
•
Increase vacuum_defer_cleanup_age to
reduce vacuum cleanup cancellation
•
Increase max_standby_*_delay for long
running reports
•
Use dblink “sleep on open snapshot”
technique to make MVCC data export back to
the master
© 2ndQuadrant Limited 2010-2011
9.0 tuning in practice
•
Setting vacuum_defer_cleanup_age in txid
units is impossible for most
•
The maximum values available for
max_standby_*_delay are only ~35 minutes
•
dblink snapshot export techniques work, but
difficult to implement for most
•
Spurious cancellations are hard to eliminate
completely
© 2ndQuadrant Limited 2010-2011
What's new in 9.1
•
pg_stat_replication makes non-wizard monitoring
possible
•
max_standby_*_delay can be big
•
hot_standby_feedback makes MVCC style
snapshot export easy
•
Base backups possible using the database
connection
•
Synchronous replication
•
Improvements in b-tree delete handling
© 2ndQuadrant Limited 2010-2011
Why still care about repmgr?
•
Remote node command execution makes
management easier
•
rsync based approach can make fail-back
dramatically faster
•
Newer features like autofailover
•
Best practices and workarounds for real-world
issues are incorporated
© 2ndQuadrant Limited 2010-2011
Shared knowledge helps
•
Albourne deployment and feedback from
Martin Eriksson critical to V1.0 design
•
Heroku deployment contributed support for a
new use case
•
Early adopters of V2.0 autofailover are paving
that part of the roadmap right now
•
Unusual issues are being identified by
community bug reports
© 2ndQuadrant Limited 2010-2011
Community

Publicly released in December 2010

Project hosted at GitHub

Core team: Simon Riggs, Jaime Casanova, Greg
Smith, Cédric Villemain

GPL license to encourage sharing modifications

V1.1 included code from 3 other companies

V2.0 already pushed out, in unannounced beta
© 2ndQuadrant Limited 2010-2011