High-Volume Writes with PostgreSQL

High-Volume Writes
with PostgreSQL
Greg Smith - © 2ndQuadrant US 2011
Major parameters to set
•
shared_buffers: 512MB to 8GB
•
checkpoint_segments: 16 to 256
•
effective_cache_size: typically ¾ RAM
•
wal_buffers: typically 16MB
–
Auto-tuned in 9.1
Greg Smith - © 2ndQuadrant US 2011
Checkpoints



Dirty data in buffer must be flushed
WAL segments are 16MB
Requested checkpoint


checkpoint_segments of writes
Timed checkpoint

checkpoint_timeout (5 minute default)
Greg Smith - © 2ndQuadrant US 2011
Checkpoint spikes




8.3 added Spread Checkpoints
Aims to finish at 50% of progress
fsync flush to disk at end of checkpoint
Optimal behavior:



OS wrote data out before fsync call
Spreading sync out didn’t work usefully
Spikes still happen
Greg Smith - © 2ndQuadrant US 2011
A bad checkpoint
LOG: checkpoint complete: wrote 127961 buffers (12.2%); 0 transaction log file(s)added, 1818 removed,
0 recycled; write=80.190 s, sync=359.823 s, total=520.913 s
Greg Smith - © 2ndQuadrant US 2011
A funding checkpoint
LOG: checkpoint complete: wrote 141563 buffers (13.5%); 0 transaction log file(s) added,
1109 removed, 257 recycled; write=944.601 s, sync=10635.130 s, total=11613.953 s
Greg Smith - © 2ndQuadrant US 2011
Types of writes



Checkpoint write: most efficient
Background writer write: still good
Backend write, fsync



Fine if aborbed by background writer
Write will be cached by OS later
Backend write, BGW queue filled



backend does fsync itself
Very bad, multi-hour checkpoints possible
Improved in 9.1
Greg Smith - © 2ndQuadrant US 2011
bgwriter monitoring
$ psql ­x ­c "select * from pg_stat_bgwriter"
checkpoints_timed | 0
checkpoints_req | 4
buffers_checkpoint | 6
buffers_clean | 0
maxwritten_clean | 0
buffers_backend | 654685
buffers_backend_fsync | 84
buffers_alloc | 1225
Greg Smith - © 2ndQuadrant US 2011
Time analysis
$ psql ­c "select now(),* from pg_stat_bgwriter"






Sample two points
Buffers are 8K each (normally)
Compute time delta, value delta
Buffers allocated: read MB/s
Sum of buffers written: write MB/s
Compute or graph

Munin has an example
Greg Smith - © 2ndQuadrant US 2011
bgwriter trends
Greg Smith - © 2ndQuadrant US 2011
Cache refill
Greg Smith - © 2ndQuadrant US 2011
Linux tuning



ext3 on old kernels does blocky fsync
dirty_ratio lowers write cache size in %
Kernel 2.6.29 is finer grained


dirty_bytes sets exact amount of RAM
Cannot go too far


OS write caching is expected
VACUUM slows a lot: 50% drop possible
Greg Smith - © 2ndQuadrant US 2011
VACUUM
•
•
•
Cleans up after UPDATE and DELETE
The hidden cost of MVCC
Must happen eventually
–
Frozen ID cleanup
Greg Smith - © 2ndQuadrant US 2011
Autovacuum
•
•
•
•
Cleans up after dead rows
Also updates database stats
Large tables: 20% change required
autovacuum_vacuum_scale_factor=20
Greg Smith - © 2ndQuadrant US 2011
VACUUM Overhead
•
•
Intensive when it happens
Focus on smoothing and scheduling
•
Dead rows add invisible overhead
–
Putting it off makes it worse
–
–
Table “bloat” can be very large
Thresholds can be per-table
Greg Smith - © 2ndQuadrant US 2011
Index Bloating

Indexes can become less efficient after deletes

VACUUM FULL before 9.0 makes this worse

REINDEX helps, but it locks the table

CREATE INDEX can run CONCURRENTLY
–
–

Rename: simulate REINDEX CONCURRENTLY
All transactions must end to finish
CLUSTER does a full table rebuild
–
–
Same “fresh” performance as after dump/reload
Full table lock to do it
Greg Smith - © 2ndQuadrant US 2011
VACUUM Gone Wrong
•
Aim at a target peak performance
•
VACUUM isn't accounted for
•
Just survive peak load?
–
You won't survive VACUUM
Greg Smith - © 2ndQuadrant US 2011
VACUUM monitoring
•
Watch pg_stat_user_tables timestamps
•
Beware long-running transactions
•
log_autovacuum_min_duration
•
Sizes of tables/indexes critical too
Greg Smith - © 2ndQuadrant US 2011
Improving efficiency




maintenance_work_mem: up to 2GB
shared_buffers & checkpoint_segments
(again)
Hardware write caches
Tune read-ahead
Greg Smith - © 2ndQuadrant US 2011
VACUUM Cost Limits
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20
vacuum_cost_limit = 200
autovacuum_vacuum_cost_delay = 20ms
Greg Smith - © 2ndQuadrant US 2011
autovacuum Cost Basics

Every 20 ms = 50 runs/second

Each run accumulates 200 cost units

200 * 50 = 10000 cost / second
Greg Smith - © 2ndQuadrant US 2011
Costs and Disk I/O
20ms = 10000 cost/second

All misses @ 10 cost?




–
–
10000 / 10 = 1000 reads/second
1000*8192/(1024*1024)=7.8MB/s read
–
–
10000 / 20 = 500 writes/second
500*8192/(1024*1024)=3.9 MB/s write
–
Doubles the rate: 17.2 MB/s / 7.8 MB/s
–
Halves the rate: 3.9 MB/s / 1.95 MB/s
All dirty @ 20 cost?
Halve the delay to 10ms?
Double the delay to 40ms?
Greg Smith - © 2ndQuadrant US 2011
Submission for 9.2

“Displaying accumulated autovacuum
cost”

In November CommitFest

Easily applies to older versions
–
–

Not very risky to production
Just adds some logging
Useful for learning how to tune costs
Greg Smith - © 2ndQuadrant US 2011
Sample logging output
LOG: automatic vacuum of table "pgbench.public.pgbench_accounts": index scans: 1
pages: 0 removed, 163935 remain
tuples: 2000000 removed, 2928356 remain
buffer usage: 117393 hits, 123351 misses, 102684 dirtied, 2.168 MiB/s write rate
system usage: CPU 2.54s/6.27u sec elapsed 369.99 sec Greg Smith - © 2ndQuadrant US 2011
Common tricks

Manual VACUUM during slower periods
–
–

•
Break down by table size
Alternate fast/slow configurations
–
–

Make sure to set vacuum_cost_delay
Start with daily
Two postgresql.conf files, or edit script
Swap/change using cron or pgAgent
Aggressive freezing
Greg Smith - © 2ndQuadrant US 2011
Write to disk, slow way

Data page change to pg_xlog WAL

Checkpoint pushes page to disk

Hint bits update page for faster visibility

Autovacuum marks free space

Freeze old transaction IDs
Greg Smith - © 2ndQuadrant US 2011
Manually maintained path

Data page change to pg_xlog WAL

Checkpoint pushes page to disk

Manually freeze old transaction Ids
–
Tweak vacuum_freeze_min_age and/or
vacuum_freeze_table_age
Greg Smith - © 2ndQuadrant US 2011
Hardware for fast writes
•
•
Log checkpoints to catch spikes
Battery-backed write cache
–
–
Fast commits
Beware volatile write caches
http://wiki.postgresql.org/wiki/Reliable_Writes
Greg Smith - © 2ndQuadrant US 2011
Hard Drive Latency
Type
Latency-ms Transactions/Sec
5400 RPM
11.1
90
7200 RPM
8.3
120
6.0
167
Greg Smith - © 2ndQuadrant US 2011
10K RPM
Latency driving TPS
Greg Smith - © 2ndQuadrant US 2011
Partitioning





Time-series data splits most easily
Monthly partitions typical
Setup is manual and requires some code
Queries can only exclude partitions
Old partitions don't need vacuum


Indexes are smaller


Once frozen, they're ignored
Less used indexes fade from cache
Oldest data can be truncated

No deletion VACUUM cleanup!
Greg Smith - © 2ndQuadrant US 2011
Skytools



Proven to handle write scaling
Database access wrapped in functions
PL/Proxy routes to appropriate notes



Replication used for shared data
Rebalancing is tricky


Any, all, etc.
Pause feature in pgbouncer helps
Hard to retrofit to existing system
Greg Smith - © 2ndQuadrant US 2011
Special thanks
Some monitoring samples provided by:
Track, measure, and improve your fitness
Clients for Android and iPhone
http://runkeeper.com/
Greg Smith - © 2ndQuadrant US 2011
PostgreSQL Books
http://www.2ndQuadrant.com/books/
Greg Smith - © 2ndQuadrant US 2011
Questions
Slides at 2ndQuadrant.com
Resources / Talks
Greg Smith - © 2ndQuadrant US 2011