High-Volume Writes with PostgreSQL Greg Smith - © 2ndQuadrant US 2011 Major parameters to set • shared_buffers: 512MB to 8GB • checkpoint_segments: 16 to 256 • effective_cache_size: typically ¾ RAM • wal_buffers: typically 16MB – Auto-tuned in 9.1 Greg Smith - © 2ndQuadrant US 2011 Checkpoints Dirty data in buffer must be flushed WAL segments are 16MB Requested checkpoint checkpoint_segments of writes Timed checkpoint checkpoint_timeout (5 minute default) Greg Smith - © 2ndQuadrant US 2011 Checkpoint spikes 8.3 added Spread Checkpoints Aims to finish at 50% of progress fsync flush to disk at end of checkpoint Optimal behavior: OS wrote data out before fsync call Spreading sync out didn’t work usefully Spikes still happen Greg Smith - © 2ndQuadrant US 2011 A bad checkpoint LOG: checkpoint complete: wrote 127961 buffers (12.2%); 0 transaction log file(s)added, 1818 removed, 0 recycled; write=80.190 s, sync=359.823 s, total=520.913 s Greg Smith - © 2ndQuadrant US 2011 A funding checkpoint LOG: checkpoint complete: wrote 141563 buffers (13.5%); 0 transaction log file(s) added, 1109 removed, 257 recycled; write=944.601 s, sync=10635.130 s, total=11613.953 s Greg Smith - © 2ndQuadrant US 2011 Types of writes Checkpoint write: most efficient Background writer write: still good Backend write, fsync Fine if aborbed by background writer Write will be cached by OS later Backend write, BGW queue filled backend does fsync itself Very bad, multi-hour checkpoints possible Improved in 9.1 Greg Smith - © 2ndQuadrant US 2011 bgwriter monitoring $ psql x c "select * from pg_stat_bgwriter" checkpoints_timed | 0 checkpoints_req | 4 buffers_checkpoint | 6 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 654685 buffers_backend_fsync | 84 buffers_alloc | 1225 Greg Smith - © 2ndQuadrant US 2011 Time analysis $ psql c "select now(),* from pg_stat_bgwriter" Sample two points Buffers are 8K each (normally) Compute time delta, value delta Buffers allocated: read MB/s Sum of buffers written: write MB/s Compute or graph Munin has an example Greg Smith - © 2ndQuadrant US 2011 bgwriter trends Greg Smith - © 2ndQuadrant US 2011 Cache refill Greg Smith - © 2ndQuadrant US 2011 Linux tuning ext3 on old kernels does blocky fsync dirty_ratio lowers write cache size in % Kernel 2.6.29 is finer grained dirty_bytes sets exact amount of RAM Cannot go too far OS write caching is expected VACUUM slows a lot: 50% drop possible Greg Smith - © 2ndQuadrant US 2011 VACUUM • • • Cleans up after UPDATE and DELETE The hidden cost of MVCC Must happen eventually – Frozen ID cleanup Greg Smith - © 2ndQuadrant US 2011 Autovacuum • • • • Cleans up after dead rows Also updates database stats Large tables: 20% change required autovacuum_vacuum_scale_factor=20 Greg Smith - © 2ndQuadrant US 2011 VACUUM Overhead • • Intensive when it happens Focus on smoothing and scheduling • Dead rows add invisible overhead – Putting it off makes it worse – – Table “bloat” can be very large Thresholds can be per-table Greg Smith - © 2ndQuadrant US 2011 Index Bloating Indexes can become less efficient after deletes VACUUM FULL before 9.0 makes this worse REINDEX helps, but it locks the table CREATE INDEX can run CONCURRENTLY – – Rename: simulate REINDEX CONCURRENTLY All transactions must end to finish CLUSTER does a full table rebuild – – Same “fresh” performance as after dump/reload Full table lock to do it Greg Smith - © 2ndQuadrant US 2011 VACUUM Gone Wrong • Aim at a target peak performance • VACUUM isn't accounted for • Just survive peak load? – You won't survive VACUUM Greg Smith - © 2ndQuadrant US 2011 VACUUM monitoring • Watch pg_stat_user_tables timestamps • Beware long-running transactions • log_autovacuum_min_duration • Sizes of tables/indexes critical too Greg Smith - © 2ndQuadrant US 2011 Improving efficiency maintenance_work_mem: up to 2GB shared_buffers & checkpoint_segments (again) Hardware write caches Tune read-ahead Greg Smith - © 2ndQuadrant US 2011 VACUUM Cost Limits vacuum_cost_page_hit = 1 vacuum_cost_page_miss = 10 vacuum_cost_page_dirty = 20 vacuum_cost_limit = 200 autovacuum_vacuum_cost_delay = 20ms Greg Smith - © 2ndQuadrant US 2011 autovacuum Cost Basics Every 20 ms = 50 runs/second Each run accumulates 200 cost units 200 * 50 = 10000 cost / second Greg Smith - © 2ndQuadrant US 2011 Costs and Disk I/O 20ms = 10000 cost/second All misses @ 10 cost? – – 10000 / 10 = 1000 reads/second 1000*8192/(1024*1024)=7.8MB/s read – – 10000 / 20 = 500 writes/second 500*8192/(1024*1024)=3.9 MB/s write – Doubles the rate: 17.2 MB/s / 7.8 MB/s – Halves the rate: 3.9 MB/s / 1.95 MB/s All dirty @ 20 cost? Halve the delay to 10ms? Double the delay to 40ms? Greg Smith - © 2ndQuadrant US 2011 Submission for 9.2 “Displaying accumulated autovacuum cost” In November CommitFest Easily applies to older versions – – Not very risky to production Just adds some logging Useful for learning how to tune costs Greg Smith - © 2ndQuadrant US 2011 Sample logging output LOG: automatic vacuum of table "pgbench.public.pgbench_accounts": index scans: 1 pages: 0 removed, 163935 remain tuples: 2000000 removed, 2928356 remain buffer usage: 117393 hits, 123351 misses, 102684 dirtied, 2.168 MiB/s write rate system usage: CPU 2.54s/6.27u sec elapsed 369.99 sec Greg Smith - © 2ndQuadrant US 2011 Common tricks Manual VACUUM during slower periods – – • Break down by table size Alternate fast/slow configurations – – Make sure to set vacuum_cost_delay Start with daily Two postgresql.conf files, or edit script Swap/change using cron or pgAgent Aggressive freezing Greg Smith - © 2ndQuadrant US 2011 Write to disk, slow way Data page change to pg_xlog WAL Checkpoint pushes page to disk Hint bits update page for faster visibility Autovacuum marks free space Freeze old transaction IDs Greg Smith - © 2ndQuadrant US 2011 Manually maintained path Data page change to pg_xlog WAL Checkpoint pushes page to disk Manually freeze old transaction Ids – Tweak vacuum_freeze_min_age and/or vacuum_freeze_table_age Greg Smith - © 2ndQuadrant US 2011 Hardware for fast writes • • Log checkpoints to catch spikes Battery-backed write cache – – Fast commits Beware volatile write caches http://wiki.postgresql.org/wiki/Reliable_Writes Greg Smith - © 2ndQuadrant US 2011 Hard Drive Latency Type Latency-ms Transactions/Sec 5400 RPM 11.1 90 7200 RPM 8.3 120 6.0 167 Greg Smith - © 2ndQuadrant US 2011 10K RPM Latency driving TPS Greg Smith - © 2ndQuadrant US 2011 Partitioning Time-series data splits most easily Monthly partitions typical Setup is manual and requires some code Queries can only exclude partitions Old partitions don't need vacuum Indexes are smaller Once frozen, they're ignored Less used indexes fade from cache Oldest data can be truncated No deletion VACUUM cleanup! Greg Smith - © 2ndQuadrant US 2011 Skytools Proven to handle write scaling Database access wrapped in functions PL/Proxy routes to appropriate notes Replication used for shared data Rebalancing is tricky Any, all, etc. Pause feature in pgbouncer helps Hard to retrofit to existing system Greg Smith - © 2ndQuadrant US 2011 Special thanks Some monitoring samples provided by: Track, measure, and improve your fitness Clients for Android and iPhone http://runkeeper.com/ Greg Smith - © 2ndQuadrant US 2011 PostgreSQL Books http://www.2ndQuadrant.com/books/ Greg Smith - © 2ndQuadrant US 2011 Questions Slides at 2ndQuadrant.com Resources / Talks Greg Smith - © 2ndQuadrant US 2011
© Copyright 2026 Paperzz