Archive

Archive for the ‘Technology’ Category

String Aggregation Techniques

September 3, 2010 Leave a comment

I’ve recently been getting my head around the model clause in Oracle.

I’ve never had a use for it previously, but recently I remembered the first time I learned regular expressions: Shortly after mastering them, I found a use for them all over the place.

I’m assuming that once I’ve mastered model clauses, I’ll be able to find quite a few uses for them also, but for the moment I’m struggling to find a use where they would be better than good old analytical clauses.

Just the other day, however I read that good old string aggregation can be much faster when performed using a model cluase rather than the standard pseudo-hierachical hack.

So I made a quick example to test it:

bear in mind that this is on my crappy little ibm thinkpad that I got for practically free.

--create a test table
create table straggtest (id number(12)
                        ,str varchar2(10));
                       
--Populate with groups of 10 rows with random strings                     
insert into straggtest(
select trunc(rownum/10), dbms_random.string('a',dbms_random.value(2,10)) 
  from dual
connect by rownum <= 1000000);

commit;

Create index straggtest on straggtest (id);

Now, the Standard method that I’ve used in the past of aggregating these strings into a single row per id uses a hierarchical query:

select id, ltrim(sys_connect_by_path(str,'|'),'|')
 from ( select id
              ,str
              ,row_number() over (partition by id order by str) rn
   from straggtest stragg)
 where connect_by_isleaf = 1
 connect by id = prior id and rn = prior rn + 1
 start with rn = 1;

This produces rows as follows:

2142   GmgagsI|NWUZtqsCp|ON|PTqt|VLVpk|WYXEQj|XgE|ZnZm|aMHf|kpEQdbQ

2162  JHiB|MDcHpEKGw|VWjL|XYB|YHt|cEPLDfQsC|fD|qJAkFqX|roJYxjXukd|uK

This runs in a respectable minute or so for 1 million records.

This, however is apparently not the most efficient way of doing things.

Using the Model clause and the POWER OF ORACLE!(tm) we can reduce that significantlyi by executing the following:

select id, rtrim(agg,'|') string_agg
 from (select id
             ,agg
             ,rn
         from straggtest
      model
      partition by (id)
      dimension by (row_number() over (partition by id order by str) rn )
      measures (cast(str as varchar2(100)) agg)
      rules
      (agg[any] order by rn desc = agg[cv()]||'|'||agg[cv() + 1]))
      where rn = 1;

Here are the trace results – as you can see, the model clause runs in just over 11 seconds.

Hierachical:


select id, ltrim(sys_connect_by_path(str,'|'),'|')
 from ( select id
              ,str
              ,row_number() over (partition by id order by str) rn
   from straggtest stragg)
 where connect_by_isleaf = 1
 connect by id = prior id and rn = prior rn + 1
 start with rn = 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        4     48.34      56.84      11634      21267    3072251          46
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        7     48.34      56.84      11634      21267    3072251          46

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 36  

Rows     Row Source Operation
-------  ---------------------------------------------------
     46  FILTER  (cr=21267 pr=11634 pw=17853 time=56829889 us)
    459   CONNECT BY WITH FILTERING (cr=21267 pr=11634 pw=17853 time=56844717 us)
 200002    FILTER  (cr=4726 pr=0 pw=0 time=8267464 us)
2000000     COUNT  (cr=4726 pr=0 pw=0 time=25871530 us)
2000000      VIEW  (cr=4726 pr=0 pw=0 time=17988862 us)
2000000       WINDOW SORT (cr=4726 pr=0 pw=0 time=9915924 us)
2000000        TABLE ACCESS FULL STRAGGTEST (cr=4726 pr=0 pw=0 time=4000358 us)
 549366    HASH JOIN  (cr=14178 pr=735 pw=735 time=28640690 us)
 600001     CONNECT BY PUMP  (cr=0 pr=0 pw=0 time=1200107 us)
5493656     COUNT  (cr=14178 pr=0 pw=0 time=66408293 us)
5493656      VIEW  (cr=14178 pr=0 pw=0 time=44422445 us)
5493656       WINDOW SORT (cr=14178 pr=0 pw=0 time=22447768 us)
6000000        TABLE ACCESS FULL STRAGGTEST (cr=14178 pr=0 pw=0 time=12000741 us)
1000000    COUNT  (cr=2363 pr=0 pw=0 time=12917606 us)
1000000     VIEW  (cr=2363 pr=0 pw=0 time=8917601 us)
1000000      WINDOW SORT (cr=2363 pr=0 pw=0 time=4917594 us)
1000000       TABLE ACCESS FULL STRAGGTEST (cr=2363 pr=0 pw=0 time=2000091 us)

Model Clause:

select id, rtrim(agg,'|') string_agg
 from (select id
             ,agg
             ,rn
         from straggtest
      model
      partition by (id)
      dimension by (row_number() over (partition by id order by str) rn )
      measures (cast(str as varchar2(100)) agg)
      rules
      (agg[any] order by rn desc = agg[cv()]||'|'||agg[cv() + 1]))
      where rn = 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch       16      4.67      11.72        496       2363          0         226
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       19      4.67      11.72        496       2363          0         226

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 36  

Rows     Row Source Operation
-------  ---------------------------------------------------
    226  VIEW  (cr=2363 pr=496 pw=14973 time=11723552 us)
   2251   SQL MODEL ORDERED (cr=2363 pr=496 pw=14973 time=11727580 us)
1000000    WINDOW SORT (cr=2363 pr=0 pw=0 time=5102306 us)
1000000     TABLE ACCESS FULL STRAGGTEST (cr=2363 pr=0 pw=0 time=2000155 us)

So it was definitely worth learning to use the model clause, if only for the savings that it will give me on this particular piece of work.

Advertisements

You want me to tell you what your requirements are?

August 30, 2010 Leave a comment

I am an Oracle Programmer. I have been for quite a while. I specialise in Oracle PL/SQL and SQL and I also do a bit of forms. In the past I have dabbled in other languages as the need arises, but I am primarily an Oracle Database guy.

So I was a little bit surprised today when I was given the task of reverse engineering a MS SQL server T-SQL script that someone had written before I joined and who was no longer here to ask questions of. Not only that, the e-mail from the business analyst requesting this went something like this:

So I can advise the business on comprehensive UAT testing can you please tell me what data bases the [script name here] will impact and what front end software can be used to validate the expected changes have taken place.

I had to read this e-mail a few times before I fully understood the implications of what he was asking.

Let me break it down:

In a normal company where the IT department is supporting the business, a waterfall-style Software Development Lifecycle would go like this:

1) The Business users would identify the need for change in the system that they use

2) The Business analyst would specify this and get sign-off from the business users.

3) The IT department would complete their own version of this with a technical design, specification and sign-off.

4) The Developers would then develop the requested solution perform Unit testing and integration testing

5) The testing team would perform testing

6) The users would then verify the system does what they asked for it to do by comparing the delivered solution against the requirements that they specified in the beginning.

The e-mail that I received from the business analyst basically said the complete opposite of what should happen.

They want me to reverse engineer a script, written by someone else who is no longer here, in a language that I’m not familiar with, for an application that I’ve never used in order to tell the users how they’re supposed to test to see if it does what I say it does…

This comes to mind – The systems architect has it pinned to his cubicle wall:

Photobucket

… surely if they can’t tell me what they want it to do, then I have to question whether they actually need it or not…

At this point of course, I seriously considered just telling that it doesn’t do anything that they can see, and then just NOT release it into production….

I have a frustration brick. it’s a piece of foam shaped and coloured to look realistically like a brick. It’s good for bashing one’s head against it and for throwing at other people’s heads. I bought it as a gag for home use, but tomorrow I shall take it to work.