group by + limit 의 성능 이슈

PostgreSQL

group by + limit 의 성능 이슈

DBASeoul 2023. 1. 17. 15:34

테스트 데이터 생성(테이블 생성 + 벌크 데이터)

CREATE TABLE public.test2m 
(  
	id serial primary key,  
	timeslice timestamp ,  
	system_id	VARCHAR(100),  
	group_id	VARCHAR(100),  
	agent_name	VARCHAR(100),  
	svc_name	VARCHAR(100),  
	COUNT	INT  
);  

INSERT INTO    public.test2m  (timeslice, system_id, group_id, agent_name, svc_name, COUNT)   
SELECT   
now() as timeslice,   
LEFT(md5(random()::text),3) as system_id,   
LEFT(md5(random()::text),5) as group_id,   
md5(random()::text) as agent_name,   
md5(random()::text) as svc_name,   
1 as count   
FROM     generate_series(1, 1000000) as gs;


select * from public.test2m  limit 100;

--alter table public.test2m add primary key(id);
analyze public.test2m;

explain analyze 
SELECT system_id,group_id,max(agent_name),COUNT(*) FROM public.test2m  
where id between 1 and 500000
GROUP BY system_id , group_id 
order by system_id , group_id 
LIMIT 5000

검색 조건 id 단일 조건이기때문에 id 만 인덱스를 생성한경우 16만건을 모두 읽고, group by 적용후 5000건을 잘라냅니다.

Limit  (cost=45404.69..46196.17 rows=5000 width=50) (actual time=1081.514..1143.438 rows=5000 loops=1)
  ->  Finalize GroupAggregate  (cost=45404.69..105299.76 rows=378377 width=50) (actual time=1081.513..1143.232 rows=5000 loops=1)
        Group Key: system_id, group_id
        ->  Gather Merge  (cost=45404.69..97410.53 rows=410546 width=50) (actual time=1081.509..1141.291 rows=5001 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Partial GroupAggregate  (cost=44404.67..49023.31 rows=205273 width=50) (actual time=965.212..968.167 rows=2122 loops=3)
                    Group Key: system_id, group_id
                    ->  Sort  (cost=44404.67..44917.85 rows=205273 width=43) (actual time=965.201..967.283 rows=2123 loops=3)
                          Sort Key: system_id, group_id
                          Sort Method: external merge  Disk: 9696kB
                          Worker 0:  Sort Method: external merge  Disk: 8136kB
                          Worker 1:  Sort Method: external merge  Disk: 8136kB
                          ->  Parallel Index Scan using test2m_pkey on test2m  (cost=0.42..19974.72 rows=205273 width=43) (actual time=0.294..54.606 rows=166667 loops=3)
                                Index Cond: ((id >= 1) AND (id <= 500000))
Planning Time: 1.383 ms
Execution Time: 1165.056 ms

집계레벨 1번째 컬럼인 system_id + id 인덱스 추가후 동일 쿼리실행

create index ix_test2m_system_id_id on public.test2m(system_id,id);

explain analyze 
SELECT system_id,group_id,max(agent_name),COUNT(*) FROM public.test2m  
where id between 1 and 500000
GROUP BY system_id , group_id 
order by system_id , group_id 
LIMIT 5000

아래와 같이 5000건정도만 읽고 멈춤(excute 시간도 감소됩니다.)

Limit  (cost=28.65..1768.01 rows=5000 width=50) (actual time=0.478..22.543 rows=5000 loops=1)
  ->  GroupAggregate  (cost=28.65..131400.21 rows=377644 width=50) (actual time=0.478..22.321 rows=5000 loops=1)
        Group Key: system_id, group_id
        ->  Incremental Sort  (cost=28.65..122627.53 rows=499624 width=43) (actual time=0.473..20.306 rows=5001 loops=1)
              Sort Key: system_id, group_id
              Presorted Key: system_id
              Full-sort Groups: 42  Sort Method: quicksort  Average Memory: 30kB  Peak Memory: 30kB
              Pre-sorted Groups: 42  Sort Method: quicksort  Average Memory: 35kB  Peak Memory: 35kB
              ->  Index Scan using ix_test2m_system_id_id on test2m  (cost=0.42..87516.65 rows=499624 width=43) (actual time=0.059..5.022 rows=5072 loops=1)
                    Index Cond: ((id >= 1) AND (id <= 500000))
Planning Time: 1.945 ms
Execution Time: 22.671 ms

아래쿼리는 system_id 로 먼저 group by 후에 각 system_id 별로 개별 loop join 하여 5000번 호출 합니다.

explain analyze 
select * from (
	select system_id from public.test2m  group by system_id
) as A
INNER JOIN LATERAL( 
		SELECT group_id, COUNT(*) AS CNT ,max(agent_name) as agent_name FROM public.test2m AS B  
		WHERE A.system_id = B.system_id and id between 1 and 500000
		group by system_id ,group_id 
	) as c ON TRUE  
limit 5000

성능은 직전 쿼리와 비교하여 비슷한 수행 시간을 보여줍니다.

내부적으로 거이 비슷한 데이터 탐색방식으로 데이터를 처리할것으로 예상 됩니다.

이런형태의 데이터 조회방식을 부분 범위 처리 라고들 합니다.

Limit  (cost=468.18..20131.36 rows=5000 width=50) (actual time=0.371..18.333 rows=5000 loops=1)
  ->  Nested Loop  (cost=468.18..1966613.78 rows=499956 width=50) (actual time=0.370..18.131 rows=5000 loops=1)
        ->  Group  (cost=0.42..28480.42 rows=4098 width=4) (actual time=0.009..1.366 rows=42 loops=1)
              Group Key: test2m.system_id
              ->  Index Only Scan using ix_test2m_system_id_id on test2m  (cost=0.42..25980.42 rows=1000000 width=4) (actual time=0.008..0.804 rows=10020 loops=1)
                    Heap Fetches: 0
        ->  GroupAggregate  (cost=467.75..470.50 rows=122 width=50) (actual time=0.341..0.391 rows=119 loops=42)
              Group Key: b.system_id, b.group_id
              ->  Sort  (cost=467.75..468.06 rows=122 width=43) (actual time=0.339..0.343 rows=119 loops=42)
                    Sort Key: b.group_id
                    Sort Method: quicksort  Memory: 32kB
                    ->  Bitmap Heap Scan on test2m b  (cost=5.98..463.52 rows=122 width=43) (actual time=0.033..0.120 rows=121 loops=42)
                          Recheck Cond: (((test2m.system_id)::text = (system_id)::text) AND (id >= 1) AND (id <= 500000))
                          Heap Blocks: exact=5027
                          ->  Bitmap Index Scan on ix_test2m_system_id_id  (cost=0.00..5.95 rows=122 width=0) (actual time=0.022..0.022 rows=121 loops=42)
                                Index Cond: (((system_id)::text = (test2m.system_id)::text) AND (id >= 1) AND (id <= 500000))
Planning Time: 0.224 ms
Execution Time: 18.498 ms