Archive

Posts Tagged ‘SQL’

String Aggregation Techniques

September 3, 2010 Leave a comment

I’ve recently been getting my head around the model clause in Oracle.

I’ve never had a use for it previously, but recently I remembered the first time I learned regular expressions: Shortly after mastering them, I found a use for them all over the place.

I’m assuming that once I’ve mastered model clauses, I’ll be able to find quite a few uses for them also, but for the moment I’m struggling to find a use where they would be better than good old analytical clauses.

Just the other day, however I read that good old string aggregation can be much faster when performed using a model cluase rather than the standard pseudo-hierachical hack.

So I made a quick example to test it:

bear in mind that this is on my crappy little ibm thinkpad that I got for practically free.

--create a test table
create table straggtest (id number(12)
                        ,str varchar2(10));
                       
--Populate with groups of 10 rows with random strings                     
insert into straggtest(
select trunc(rownum/10), dbms_random.string('a',dbms_random.value(2,10)) 
  from dual
connect by rownum <= 1000000);

commit;

Create index straggtest on straggtest (id);

Now, the Standard method that I’ve used in the past of aggregating these strings into a single row per id uses a hierarchical query:

select id, ltrim(sys_connect_by_path(str,'|'),'|')
 from ( select id
              ,str
              ,row_number() over (partition by id order by str) rn
   from straggtest stragg)
 where connect_by_isleaf = 1
 connect by id = prior id and rn = prior rn + 1
 start with rn = 1;

This produces rows as follows:

2142   GmgagsI|NWUZtqsCp|ON|PTqt|VLVpk|WYXEQj|XgE|ZnZm|aMHf|kpEQdbQ

2162  JHiB|MDcHpEKGw|VWjL|XYB|YHt|cEPLDfQsC|fD|qJAkFqX|roJYxjXukd|uK

This runs in a respectable minute or so for 1 million records.

This, however is apparently not the most efficient way of doing things.

Using the Model clause and the POWER OF ORACLE!(tm) we can reduce that significantlyi by executing the following:

select id, rtrim(agg,'|') string_agg
 from (select id
             ,agg
             ,rn
         from straggtest
      model
      partition by (id)
      dimension by (row_number() over (partition by id order by str) rn )
      measures (cast(str as varchar2(100)) agg)
      rules
      (agg[any] order by rn desc = agg[cv()]||'|'||agg[cv() + 1]))
      where rn = 1;

Here are the trace results – as you can see, the model clause runs in just over 11 seconds.

Hierachical:


select id, ltrim(sys_connect_by_path(str,'|'),'|')
 from ( select id
              ,str
              ,row_number() over (partition by id order by str) rn
   from straggtest stragg)
 where connect_by_isleaf = 1
 connect by id = prior id and rn = prior rn + 1
 start with rn = 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        4     48.34      56.84      11634      21267    3072251          46
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        7     48.34      56.84      11634      21267    3072251          46

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 36  

Rows     Row Source Operation
-------  ---------------------------------------------------
     46  FILTER  (cr=21267 pr=11634 pw=17853 time=56829889 us)
    459   CONNECT BY WITH FILTERING (cr=21267 pr=11634 pw=17853 time=56844717 us)
 200002    FILTER  (cr=4726 pr=0 pw=0 time=8267464 us)
2000000     COUNT  (cr=4726 pr=0 pw=0 time=25871530 us)
2000000      VIEW  (cr=4726 pr=0 pw=0 time=17988862 us)
2000000       WINDOW SORT (cr=4726 pr=0 pw=0 time=9915924 us)
2000000        TABLE ACCESS FULL STRAGGTEST (cr=4726 pr=0 pw=0 time=4000358 us)
 549366    HASH JOIN  (cr=14178 pr=735 pw=735 time=28640690 us)
 600001     CONNECT BY PUMP  (cr=0 pr=0 pw=0 time=1200107 us)
5493656     COUNT  (cr=14178 pr=0 pw=0 time=66408293 us)
5493656      VIEW  (cr=14178 pr=0 pw=0 time=44422445 us)
5493656       WINDOW SORT (cr=14178 pr=0 pw=0 time=22447768 us)
6000000        TABLE ACCESS FULL STRAGGTEST (cr=14178 pr=0 pw=0 time=12000741 us)
1000000    COUNT  (cr=2363 pr=0 pw=0 time=12917606 us)
1000000     VIEW  (cr=2363 pr=0 pw=0 time=8917601 us)
1000000      WINDOW SORT (cr=2363 pr=0 pw=0 time=4917594 us)
1000000       TABLE ACCESS FULL STRAGGTEST (cr=2363 pr=0 pw=0 time=2000091 us)

Model Clause:

select id, rtrim(agg,'|') string_agg
 from (select id
             ,agg
             ,rn
         from straggtest
      model
      partition by (id)
      dimension by (row_number() over (partition by id order by str) rn )
      measures (cast(str as varchar2(100)) agg)
      rules
      (agg[any] order by rn desc = agg[cv()]||'|'||agg[cv() + 1]))
      where rn = 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch       16      4.67      11.72        496       2363          0         226
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       19      4.67      11.72        496       2363          0         226

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 36  

Rows     Row Source Operation
-------  ---------------------------------------------------
    226  VIEW  (cr=2363 pr=496 pw=14973 time=11723552 us)
   2251   SQL MODEL ORDERED (cr=2363 pr=496 pw=14973 time=11727580 us)
1000000    WINDOW SORT (cr=2363 pr=0 pw=0 time=5102306 us)
1000000     TABLE ACCESS FULL STRAGGTEST (cr=2363 pr=0 pw=0 time=2000155 us)

So it was definitely worth learning to use the model clause, if only for the savings that it will give me on this particular piece of work.