當前位置：維易PHP培訓學院 > PostgreSQL數據庫 > 內容正文

PostgreSQL的9.6 IO航問題淺析與優化

作者：云棲社區時間 2016-09-30

《PostgreSQL的9.6 IO航問題淺析與優化》要點：
本文介紹了PostgreSQL的9.6 IO航問題淺析與優化，希望對您有用。如果有疑問，可以聯系我們。

背景

PostgreSQL的檢查點是將共享緩沖區中的臟頁打標記,并集中將其刷到磁盤的動作(FSYNC)(期間可能有刷盤的調度,降低當臟頁很多時帶來的IO影響)

在檢查點之外,平時bgwriter進程則會使用bufferio的方式(寫)將臟頁寫到OS的臟頁.

如果共享緩存非常大,而且數據庫應用如果是頻繁產生臟頁的應用,那么檢查點帶來的性能影響會非常的明顯.

例如共享緩存有100G,活躍數據有100G,同時活躍數據在不停的被更新(產生臟頁),那么在發生檢查點時,FSYNC的過程中,可能導致性能急劇下降.

現象

接下來重現一下以上問題.

單機開啟100個PG實例,每個實例限制一定的內存,CPU,IO以及資源,其中日志盤IOPS限制4000,數據盤IOPS限制800.

壓測方法

每個實例最大數據量1億,對數據進行隨機的UPSERT操作.

echo "\set id random(1,100000000)" > ~/test$i.sql

echo "insert into test (id,info,crt_time) values (:id, md5(random()::text), now()) on conflict on constraint test_pkey do update set info=excluded.info, crt_time=excluded.crt_time;" >> ~/test$i.sql

因此全表都是熱點.

每個實例連4個連接,同時進行壓測.

測試用例參考

20160927_01.md

由于同時開啟測試,每個節點幾乎在同一時間點進入檢查點狀態.

產生大量的寫回內存.

通過以下方法可以觀察到

while(true) ; do cat /proc/meminfo |grep -E "Dirty|Writeback"; sleep 0.5; doneDirty: 24752872 kBWriteback: 11312408 kBWritebackTmp: 0 kB

解釋

Dirty — The total amount of memory, in kilobytes, waiting to be written back to the disk.Writeback — The total amount of memory, in kilobytes, actively being written back to the disk.

在產生了大量的寫回內存計數后,最后檢查點調用FSYNC前,因為臟頁沒有完全落盤,導致實例的檢查點在FSYNC的階段需要耗費自己的IOPS進行刷盤,非常慢.

甚至實例完全不可用.

觀察到的現象

數據庫整機IO很低(只有數據盤的IO,并且受到CGROUP限制)

TPS降到0(更新塊被堵塞)(共享緩沖區中沒有剩余的塊?)

progress: 1321.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1322.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1323.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1324.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1325.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1326.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1327.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1328.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1329.0 s, 0.0 tps, lat -nan ms stddev -nanprogress: 1330.0 s, 0.0 tps, lat -nan ms stddev -nan

需要等待實例的回寫全部刷盤后才能恢復.

期間進程狀態如下

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND49799 digoal 20 0 1300m 155m 155m S 0.0 0.0 0:00.59 postgres -B 1GB -c port=1922 -c listen_addresses=0.0.0.0 -c synchronous_commit=on -c full_page_writes=on -c wal_buffers=128MB -c wal_writer_flush_after=0 -c bgwriter_delay=10ms49844 digoal 20 0 1300m 129m 128m S 0.0 0.0 0:09.01 postgres: wal writer process 49845 digoal 20 0 1300m 1952 1224 S 0.0 0.0 0:05.71 postgres: autovacuum launcher process 49838 digoal 20 0 113m 892 460 S 0.0 0.0 0:00.03 postgres: logger process 16531 digoal 20 0 1300m 1.1g 1.1g D 0.0 0.2 1:22.71 postgres: postgres postgres 127.0.0.1(49777) INSERT 16534 digoal 20 0 1300m 1.1g 1.1g D 0.0 0.2 1:22.32 postgres: postgres postgres 127.0.0.1(49778) INSERT 16535 digoal 20 0 1300m 1.1g 1.1g D 0.0 0.2 1:22.73 postgres: postgres postgres 127.0.0.1(49780) INSERT 16537 digoal 20 0 1300m 1.1g 1.1g D 0.0 0.2 1:22.43 postgres: postgres postgres 127.0.0.1(49781) INSERT 49842 digoal 20 0 1301m 1.0g 1.0g D 0.0 0.2 0:23.70 postgres: checkpointer process 49846 digoal 20 0 115m 1048 552 D 0.0 0.0 0:12.83 postgres: stats collector process 49843 digoal 20 0 1300m 978m 977m D 0.0 0.2 0:46.35 postgres: writer process

狀態解釋

w: S -- Process Status

The status of the task which can be one of:

’D’ = uninterruptible sleep

’R’ = running

’S’ = sleeping

’T’ = traced or stopped

’Z’ = zombie

進程堆棧信息

checkpointer進程

cat /proc/49842/stack [<ffffffff81121281>] generic_file_aio_write+0x71/0x100[<ffffffffa00c0463>] ext4_file_write+0x43/0xe0 [ext4][<ffffffff8118863a>] do_sync_write+0xfa/0x140[<ffffffff81188938>] vfs_write+0xb8/0x1a0[<ffffffff81189231>] sys_write+0x51/0x90[<ffffffff8100c072>] system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 0xffffffffffffffff

統計收集進程

cat /proc/49846/stack [<ffffffffa00a708a>] start_this_handle+0x25a/0x480 [jbd2][<ffffffffa00a7495>] jbd2_journal_start+0xb5/0x100 [jbd2][<ffffffffa00e4b24>] ext4_journal_start_sb+0x74/0x140 [ext4][<ffffffffa00d20ba>] ext4_create+0x7a/0x150 [ext4][<ffffffff811972c4>] vfs_create+0xb4/0xe0[<ffffffff8119ad90>] do_filp_open+0xb10/0xdd0[<ffffffff81185829>] do_sys_open+0x69/0x140[<ffffffff81185940>] sys_open+0x20/0x30[<ffffffff8100c072>] system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 0xffffffffffffffff

bgwriter進程

cat /proc/49843/stack [<ffffffffa00a708a>] start_this_handle+0x25a/0x480 [jbd2][<ffffffffa00a7495>] jbd2_journal_start+0xb5/0x100 [jbd2][<ffffffffa00e4b24>] ext4_journal_start_sb+0x74/0x140 [ext4][<ffffffffa00c896a>] ext4_dirty_inode+0x2a/0x60 [ext4][<ffffffff811b461b>] __mark_inode_dirty+0x3b/0x160[<ffffffff811a3e12>] file_update_time+0xf2/0x170[<ffffffff81120fb0>] __generic_file_aio_write+0x230/0x490[<ffffffff81121298>] generic_file_aio_write+0x88/0x100[<ffffffffa00c0463>] ext4_file_write+0x43/0xe0 [ext4][<ffffffff8118863a>] do_sync_write+0xfa/0x140[<ffffffff81188938>] vfs_write+0xb8/0x1a0[<ffffffff81189231>] sys_write+0x51/0x90[<ffffffff8100c072>] system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 0xffffffffffffffff

后端進程進程

cat /proc/16537/stack [<ffffffffa00bfff0>] ext4_llseek+0x60/0x110 [ext4][<ffffffff81186eda>] vfs_llseek+0x3a/0x40[<ffffffff81188b96>] sys_lseek+0x66/0x80[<ffffffff8100c072>] system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 0xffffffffffffffff

記錄器進程

cat /proc/49838/stack [<ffffffffa00a708a>] start_this_handle+0x25a/0x480 [jbd2][<ffffffffa00a7495>] jbd2_journal_start+0xb5/0x100 [jbd2][<ffffffffa00e4b24>] ext4_journal_start_sb+0x74/0x140 [ext4][<ffffffffa00c896a>] ext4_dirty_inode+0x2a/0x60 [ext4][<ffffffff811b461b>] __mark_inode_dirty+0x3b/0x160[<ffffffff811a3e12>] file_update_time+0xf2/0x170[<ffffffff81120fb0>] __generic_file_aio_write+0x230/0x490[<ffffffff81121298>] generic_file_aio_write+0x88/0x100[<ffffffffa00c0463>] ext4_file_write+0x43/0xe0 [ext4][<ffffffff8118863a>] do_sync_write+0xfa/0x140[<ffffffff81188938>] vfs_write+0xb8/0x1a0[<ffffffff81189231>] sys_write+0x51/0x90[<ffffffff8100c072>] system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 0xffffffffffffffff

沃爾瑪作家進程

cat /proc/49844/stack [<ffffffff811d0bfd>] ep_poll+0x2ad/0x330[<ffffffff811d0d45>] sys_epoll_wait+0xc5/0xe0[<ffffffff8100c072>] system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 0xffffffffffffffff

文件系統已使用的數據寫回=掛載

/dev/mapper/vgdata01-lv01 on /u01 type ext4 (rw,noatime,nodiratime,nodelalloc,barrier=0,data=writeback)/dev/mapper/vgdata01-lv02 on /u02 type ext4 (rw,noatime,nodiratime,nodelalloc,barrier=0,data=writeback)

原因分析

PostgreSQL的9.6的檢查點改進如下

1.階段1(調用寫+檢查點調度)

2.階段2(調用sync_file_range)

實際上通過設置OS調度也能緩解,例如.

vm.dirty_background_ratio = 0vm.dirty_background_bytes = 102400000vm.dirty_ratio = 95vm.dirty_bytes = 0vm.dirty_writeback_centisecs = 100vm.dirty_expire_centisecs = 3000

3.階段3(FSYNC)

分析

1.從檢查點源碼開始

* CheckPointBuffers

* Flush all dirty blocks in buffer pool to disk at checkpoint time.

* Note: temporary relations do not participate in checkpoints, so they don't

* need to be flushed.

*/voidCheckPointBuffers(int flags){

TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);

CheckpointStats.ckpt_write_t = GetCurrentTimestamp();

BufferSync(flags);

CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();

TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();

smgrsync();

CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();

TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();}

階段1(寫+檢查點調度)

2.調用BufferSync

* BufferSync -- Write out all dirty buffers in the pool.

* This is called at checkpoint time to write out all dirty shared buffers.

* The checkpoint request flags should be passed in. If CHECKPOINT_IMMEDIATE

* is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,

* CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even

* unlogged buffers, which are otherwise skipped. The remaining flags

* currently have no effect here.

*/static voidBufferSync(int flags){.....

WritebackContextInit(&wb_context, &checkpoint_flush_after);.....

* Iterate through to-be-checkpointed buffers and write the ones (still)

* marked with BM_CHECKPOINT_NEEDED. The writes are balanced between

* tablespaces; otherwise the sorting would lead to only one tablespace

* receiving writes at a time, making inefficient use of the hardware.

num_processed = 0;

num_written = 0;

while (!binaryheap_empty(ts_heap))

{......

if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)

{

// 調用 write,產生os dirty page,同時記錄writeback wb_context.

if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)

{

TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);

BgWriterStats.m_buf_written_checkpoints++;

num_written++;

}

}.......

* Sleep to throttle our I/O rate.

// 這里有一個檢查點調度,通過GUC變量checkpoint_completion_target設置.

// 不展開,詳見 src/backend/postmaster/checkpointer.c

// 這里只是write調度,并不是fsync的調度.

CheckpointWriteDelay(flags, (double) num_processed / num_to_scan); .....

}.....

// 告訴操作系統內核,開始將dirty page write out到磁盤. (異步)

/* issue all pending flushes */

IssuePendingWritebacks(&wb_context);.....

3.調用SyncOneBuffer

...

FlushBuffer(bufHdr, NULL);...

ScheduleBufferTagForWriteback(wb_context, &tag);...

4.調用FlushBuffer

...

* bufToWrite is either the shared buffer or a copy, as appropriate.

smgrwrite(reln,

buf->tag.forkNum,

buf->tag.blockNum,

bufToWrite,

false);...

5.調用mdwrite

nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);

6.調用FILEWRITE

returnCode = write(VfdCache[file].fd, buffer, amount);

調用寫產生臟頁

7.調用ScheduleBufferTagForWriteback

* Perform pending flushes if the writeback limit is exceeded. This

* includes the case where previously an item has been added, but control

* is now disabled.

if (context->nr_pending >= *context->max_pending)

IssuePendingWritebacks(context);

8.調用IssuePendingWritebacks

作用見階段2.

階段2(sync_file_range)

9.調用IssuePendingWritebacks

* Issue all pending writeback requests, previously scheduled with

* ScheduleBufferTagForWriteback, to the OS.

* Because this is only used to improve the OSs IO scheduling we try to never

* error out - it's just a hint.

*/voidIssuePendingWritebacks(WritebackContext *context){

int i;

if (context->nr_pending == 0)

return;

* Executing the writes in-order can make them a lot faster, and allows to

* merge writeback requests to consecutive blocks into larger writebacks.

// 對臟頁排除,減少fsync時的隨機IO

qsort(&context->pending_writebacks, context->nr_pending,

sizeof(PendingWriteback), buffertag_comparator);

* Coalesce neighbouring writes, but nothing else. For that we iterate

* through the, now sorted, array of pending flushes, and look forward to

* find all neighbouring (or identical) writes.

for (i = 0; i < context->nr_pending; i++)

{

PendingWriteback *cur;

PendingWriteback *next;

SMgrRelation reln;

int ahead;

BufferTag tag;

Size nblocks = 1;

cur = &context->pending_writebacks[i];

tag = cur->tag;

* Peek ahead, into following writeback requests, to see if they can

* be combined with the current one.

// 合并順序的BLOCK,減少IO次數.XFS文件系統的sync_file_range操作已經自動支持了.

for (ahead = 0; i + ahead + 1 < context->nr_pending; ahead++)

{

next = &context->pending_writebacks[i + ahead + 1];

/* different file, stop */

if (!RelFileNodeEquals(cur->tag.rnode, next->tag.rnode) ||

cur->tag.forkNum != next->tag.forkNum)

break;

/* ok, block queued twice, skip */

if (cur->tag.blockNum == next->tag.blockNum)

continue;

/* only merge consecutive writes */

if (cur->tag.blockNum + 1 != next->tag.blockNum)

break;

nblocks++;

cur = next;

}

i += ahead;

/* and finally tell the kernel to write the data to storage */

reln = smgropen(tag.rnode, InvalidBackendId);

// 告訴OS內核,準備刷臟頁,一個range為以上合并的頁數.

smgrwriteback(reln, tag.forkNum, tag.blockNum, nblocks);

}

context->nr_pending = 0;}......

10.調用smgrwriteback

的src /后端/存儲/ smgr / md.c

* mdwriteback() -- Tell the kernel to write pages back to storage.

* This accepts a range of blocks because flushing several pages at once is

* considerably more efficient than doing so individually.

*/voidmdwriteback(SMgrRelation reln, ForkNumber forknum,

BlockNumber blocknum, BlockNumber nblocks){

* Issue flush requests in as few requests as possible; have to split at

* segment boundaries though, since those are actually separate files.

while (nblocks > 0)

{

BlockNumber nflush = nblocks;

off_t seekpos;

MdfdVec *v;

int segnum_start,

segnum_end;

v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,

EXTENSION_RETURN_NULL);

* We might be flushing buffers of already removed relations, that's

* ok, just ignore that case.

if (!v)

return;

/* compute offset inside the current segment */

segnum_start = blocknum / RELSEG_SIZE;

/* compute number of desired writes within the current segment */

segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;

if (segnum_start != segnum_end)

nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));

Assert(nflush >= 1);

Assert(nflush <= nblocks);

seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));

// 調用FileWriteback

FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush);

nblocks -= nflush;

blocknum += nflush;

}}

11.調用FileWriteback

voidFileWriteback(File file, off_t offset, off_t nbytes){

int returnCode;

Assert(FileIsValid(file));

DO_DB(elog(LOG, "FileWriteback: %d (%s) " INT64_FORMAT " " INT64_FORMAT,

file, VfdCache[file].fileName,

(int64) offset, (int64) nbytes));

* Caution: do not call pg_flush_data with nbytes = 0, it could trash the

* file's seek position. We prefer to define that as a no-op here.

if (nbytes <= 0)

return;

returnCode = FileAccess(file);

if (returnCode < 0)

return;

// 調用pg_flush_data

pg_flush_data(VfdCache[file].fd, offset, nbytes);}

12.調用pg_flush_data

的src /后端/存儲/文件/ fd.c

voidpg_flush_data(int fd, off_t offset, off_t nbytes){...#if defined(HAVE_SYNC_FILE_RANGE)

{

int rc;

// 注意,如果臟頁很多時,sync_file_range的異步模式也可能被堵塞.

* sync_file_range(SYNC_FILE_RANGE_WRITE), currently linux specific,

* tells the OS that writeback for the specified blocks should be

* started, but that we don't want to wait for completion. Note that

* this call might block if too much dirty data exists in the range.

* This is the preferable method on OSs supporting it, as it works

* reliably when available (contrast to msync()) and doesn't flush out

* clean data (like FADV_DONTNEED).

// 調用sync_file_range

rc = sync_file_range(fd, offset, nbytes,

SYNC_FILE_RANGE_WRITE);

/* don't error out, this is just a performance optimization */

if (rc != 0)

{

ereport(WARNING,

(errcode_for_file_access(),

errmsg("could not flush dirty data: %m")));

}

return;

}...

(前面已經調用了寫了,現在告訴OS內核,開始將臟頁刷到磁盤)

注意,如果范圍指定的臟頁很多時,sync_file_range的異步模式也可能被堵塞.

調用sync_file_range

異步模式

SYNC_FILE_RANGE_WRITE

Start write-out of all dirty pages in the specified range which are not presently under write-out.

This is an asynchronous flush-to-disk operation.

This is not suitable for data integrity operations.

不安定因素分析

1.以上動作做完后,操作系統不一定把臟頁都刷盤了.

因為調用的是異步的sync_file_range.

2.同時在此過程中,bgwrite,后端進程還有可能將共享緩沖區中新產生的臟頁寫入OS臟頁.

這些臟頁也許涉及到接下來檢查點需要FSYNC的文件.

階段3(FSYNC)

13.接下來,檢查點開始調用smgrsync

開始FSYNC文件級別,如果文件又產生了臟頁怎么辦(見以上不穩定因素分析).

* smgrsync() -- Sync files to disk during checkpoint.

*/voidsmgrsync(void){

int i;

for (i = 0; i < NSmgr; i++)

{

if (smgrsw[i].smgr_sync)

(*(smgrsw[i].smgr_sync)) ();

}}

14.調用mdsync

* mdsync() -- Sync previous writes to stable storage.

*/voidmdsync(void){......

* If we are in the checkpointer, the sync had better include all fsync

* requests that were queued by backends up to this point. The tightest

* race condition that could occur is that a buffer that must be written

* and fsync'd for the checkpoint could have been dumped by a backend just

* before it was visited by BufferSync(). We know the backend will have

* queued an fsync request before clearing the buffer's dirtybit, so we

* are safe as long as we do an Absorb after completing BufferSync().

AbsorbFsyncRequests();.....

/* Now scan the hashtable for fsync requests to process */

absorb_counter = FSYNCS_PER_ABSORB;

hash_seq_init(&hstat, pendingOpsTable);

while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)

{.....

* Scan over the forks and segments represented by the entry.

* The bitmap manipulations are slightly tricky, because we can call

* AbsorbFsyncRequests() inside the loop and that could result in

* bms_add_member() modifying and even re-palloc'ing the bitmapsets.

* This is okay because we unlink each bitmapset from the hashtable

* entry before scanning it. That means that any incoming fsync

* requests will be processed now if they reach the table before we

* begin to scan their fork.

for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)

{......

/* Attempt to open and fsync the target segment */

seg = _mdfd_getseg(reln, forknum,

(BlockNumber) segno * (BlockNumber) RELSEG_SIZE,

false,

EXTENSION_RETURN_NULL

| EXTENSION_DONT_CHECK_SIZE);

INSTR_TIME_SET_CURRENT(sync_start);

if (seg != NULL &&

// 調用FileSync, 同步整個文件

FileSync(seg->mdfd_vfd) >= 0)

{

/* Success; update statistics about sync timing */

INSTR_TIME_SET_CURRENT(sync_end);

sync_diff = sync_end;

INSTR_TIME_SUBTRACT(sync_diff, sync_start);

elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);

if (elapsed > longest)

longest = elapsed;

total_elapsed += elapsed;

processed++;

if (log_checkpoints)

elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",

processed,

FilePathName(seg->mdfd_vfd),

(double) elapsed / 1000);

break; /* out of retry loop */

}

15.調用FileSync,同步整個文件

intFileSync(File file){

int returnCode;

Assert(FileIsValid(file));

DO_DB(elog(LOG, "FileSync: %d (%s)",

file, VfdCache[file].fileName));

returnCode = FileAccess(file);

if (returnCode < 0)

return returnCode;

// 調用pg_fsync

return pg_fsync(VfdCache[file].fd);}

16.調用pg_fsync

* pg_fsync --- do fsync with or without writethrough

*/intpg_fsync(int fd){

// 從代碼分析 linux下面不會調用pg_fsync_writethrough

/* #if is to skip the sync_method test if there's no need for it */#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)

if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)

return pg_fsync_writethrough(fd);

else#endif

return pg_fsync_no_writethrough(fd);}

17.調用pg_fsync_no_writethrough

* pg_fsync_no_writethrough --- same as fsync except does nothing if

* enableFsync is off

*/intpg_fsync_no_writethrough(int fd){

if (enableFsync)

return fsync(fd);

else

return 0;}

18.調用FSYNC刷盤

檢查點帶來的不安定因素分析

1.調用FSYNC前,操作系統不一定把臟頁都刷盤了.

因為調用的是異步的sync_file_range.

2.同時在此過程中,bgwrite,后端進程還有可能將共享緩沖區中新產生的臟頁寫入OS臟頁.

這些臟頁也許涉及到接下來檢查點需要FSYNC的文件.

因為這兩個不安定因素的存在,同時加上環境中有多個PG實例,并且每個PG實例都限制了較小的數據盤的IO,導致FSYNC時刷盤非常的慢.

REDO的IO能力遠大于數據盤的IO能力時,檢查點過程中可能又會產生很多熱點臟頁.

導致檢查點在最后FSYNC收官時,需要刷臟頁,而同時又被實例的cgroup中限制住,看起來就好像實例掛住一樣.

檢查點調度在什么階段

是在寫操作階段進行調度,在sync_file_range和FSYNC過程中都沒有任何調度.

檢查點抖動優化方法

PostgreSQL的9.6 IO航問題淺析與優化

1.解決不安定因素1 - 避免檢查點過程中產生未刷盤的臟頁

在檢查點過程中,bgwriter或后端進程從共享緩沖產生的臟頁寫出來時,會調用寫即緩沖IO.

進入檢查點后,bgwriter或后端進程從共享緩沖產生的臟頁寫出來時,同時記錄該PAGE的ID到列表(1或2).

2.檢查點在最后階段,即調用FSYNC前,插入一個階段.

將列表(1或2)的PAGE實行sync_file_range,等待其刷盤成功.

使用以下標志

SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE

Ensures that all pages in the specified range which were dirty when sync_file_range() was called are placed under write-out. This is a start-write-for-data-integrity operation.或

SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER

This is a write-for-data-integrity operation that will ensure that all pages in the specified range which were dirty when sync_file_range() was called are committed to disk.

3.為了防止bgwrite或后端進程與檢查點的同步文件范圍的沖突.

使用兩個目錄來交替記錄檢查點開始后的共享緩存逐出頁面.

4.新增一個GUC變量,配置當關卡最后一次同步的文件范圍的列表頁面樹少于多少時,進入FSYNC階段.

允許用戶根據IOPS的規格,配置這個GUC變量,從而減少最后FSYNC時需要等待的頁面數.

注意這個值也不能設得太小,否則可能造成漫長的很多輪的List1和list2中的同步文件范圍的過程.

需要修改PostgreSQL的內核,動作較大.

5.解決不安定因素2 - 檢查點最后的階段,調用FSYNC前,確保FD的所有臟頁都已經寫出來的.

目前檢查站調用的pg_flush_data是異步的sync_file_range,我們需要將其修改為同步的模式.

建議只修改checkoint的調用,不要動到原有的邏輯.

void(int fd, off_t offset, off_t nbytes){...#if defined(HAVE_SYNC_FILE_RANGE)

{

int rc;

// 注意,如果臟頁很多時,sync_file_range的異步模式也可能被堵塞.

* sync_file_range(SYNC_FILE_RANGE_WRITE), currently linux specific,

* tells the OS that writeback for the specified blocks should be

* started, but that we don't want to wait for completion. Note that

* this call might block if too much dirty data exists in the range.

* This is the preferable method on OSs supporting it, as it works

* reliably when available (contrast to msync()) and doesn't flush out

* clean data (like FADV_DONTNEED).

// 調用sync_file_range , 修改如下

rc = sync_file_range(fd, offset, nbytes,

SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);

/* don't error out, this is just a performance optimization */

if (rc != 0)

{

ereport(WARNING,

(errcode_for_file_access(),

errmsg("could not flush dirty data: %m")));

}

return;

}

6.從操作系統內核層面解決IO掛起的問題.

阿里云RDS PostgreSQL的已從數據庫內核層面完美的解決了這個問題,歡迎使用.

摘錄sync_file_range分析

http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html

計數

維易PHP培訓學院每天發布《PostgreSQL的9.6 IO航問題淺析與優化》等實戰技能，PHP、MYSQL、LINUX、APP、JS,CSS全面培養人才。

轉載請注明本頁網址：
http://www.snjht.com/jiaocheng/12970.html

標簽：

欧美97色伦欧美一区二区日韩,国产福利片在线观看,freexxx性欧美vide0高清,西西亚洲,日本欧美国产精品第一页久久,成人18免费软件

PHP教程

WEB前端開發

數據庫

WEB服務器

APP開發

LINUX學習

后端開發課程

前端開發課程

數據庫課程

PostgreSQL的9.6 IO航問題淺析與優化

同類教程排行

特輯教程