2012-05: SOUNDEX and other “cool” features – part two for DB2 V9

 

Part two of my walk through new Scalar functions and I will start with a whole new “group” of functions all to do with character manipulation for cross-system strings. ASCII_CHR, ASCII_STR, EBCDIC_CHR, EBCDIC_STR, UNICODE, and UNICODE_STR

 

See the  Scalar functions in DB2 V8: Newsletter 2012-04 SOUNDEX part 1

See the Scalar functions in DB2 10:  Newsletter 2012-06  SOUNDEX part 3

 

Character manipulation for cross-system strings in DB2 V9

SELECT ASCII('$') -- THIS IS THE V8 FUNCTION AS AN EXAMPLE TO GIVE US INPUT                              
FROM SYSIBM.SYSDUMMY1 ;                                        
         36                                                    
                                                               
SELECT ASCII_CHR(36)                                           
FROM SYSIBM.SYSDUMMY1 ;                                        
          $                                                              

SELECT ASCII_STR(                                              
                'EMBEDDED UNICODE ' CONCAT UX'C385' CONCAT ' FRED'
                )                                              
FROM SYSIBM.SYSDUMMY1 ;                                        
EMBEDDED UNICODE C385 FRED                                    

DSNT400I SQLCODE = 000,  SUCCESSFUL EXECUTION                  
DSNT418I SQLSTATE   = 01517 SQLSTATE RETURN CODE               

SELECT EBCDIC_CHR(124)                                                 
FROM SYSIBM.SYSDUMMY1 ;                                                
          §                                                                      

SELECT EBCDIC_STR(                                                     
                 'EMBEDDED UNICODE ' CONCAT UX'C385' CONCAT ' FRED'       
                 )                                                      
FROM SYSIBM.SYSDUMMY1 ;                                                
EMBEDDED UNICODE C385 FRED                                            

DSNT400I SQLCODE = 000,  SUCCESSFUL EXECUTION                          
DSNT418I SQLSTATE   = 01517 SQLSTATE RETURN CODE                       
                                                                       
SELECT UNICODE(UX'C385')                                               
      ,HEX(UNICODE(UX'C385'))                                          
FROM SYSIBM.SYSDUMMY1 ;                                                
      50053  0000C385                                                  
                                                                     
SELECT UNICODE_STR(                                                    
                  'A UNICODE CHAR C385 WITH BACK SLASH AT THE END'     
                  )                                                      
FROM SYSIBM.SYSDUMMY1 ;                                                
A UNICODE CHAR . WITH BACK SLASH AT THE END                           

DSNT400I SQLCODE = 000,  SUCCESSFUL EXECUTION 
DSNT418I SQLSTATE   = 01517 SQLSTATE RETURN CODE

The _CHR functions return the character in the respective code page using the decimal number as input. The _STR functions return a string where the “unprintable” data is prefixed with a and then the hex codes are listed out. Note the extra use of the \ in the UNICODE_STR to get a in UNICODE.

Then we have the SOUNDEX and DIFFERENCE functions which are pretty neat and excellent for searching for names that are similar but not exact (Eg Roy and Roi)  here are some samples of both

SELECT DIFFERENCE('LUCY' , 'JUICY')
      ,SOUNDEX('LUCY')            
      ,SOUNDEX('JUICY')           
FROM SYSIBM.SYSDUMMY1 ;           
          3  L200  J200           

SELECT DIFFERENCE('BEER' , 'BUYER')
      ,SOUNDEX('BEER')            
      ,SOUNDEX('BUYER')           
FROM SYSIBM.SYSDUMMY1 ;           
          4  B600  B600           

SELECT DIFFERENCE('COOL' , 'KEWL')
      ,SOUNDEX('COOL')            
      ,SOUNDEX('KEWL')            
FROM SYSIBM.SYSDUMMY1 ;           
          3  C400  K400           

SELECT DIFFERENCE('AARDVARK' , 'ZOMBIE')
      ,SOUNDEX('AARDVARK')             
      ,SOUNDEX('ZOMBIE')               
FROM SYSIBM.SYSDUMMY1 ;                
          0 A631  Z510

The SOUNDEX takes any input string and converts it down into a four character string. As you can see the first character is the first character of the string the rest are the “value” of the string. This is of academic interest as it really is of use when you plug in the DIFFERENCE which returns a value of 0, 1, 2, 3, or 4. 0 is the words have no similarity 4 is the words are practically the same. That BEER == BUYER was clear!

MONTHS_BETWEEN harks back to one of my examples from last month

SELECT MONTHS_BETWEEN(                                      
              CURRENT_TIMESTAMP                             
            , CAST('1939-05-10-01.00.00.000000' AS TIMESTAMP)
                     ) AS GRANDADS_AGE_IN_MONTHS            
FROM SYSIBM.SYSDUMMY1 ;                                     
---------+---------+---------+---------+---------+---------+-
           GRANDADS_AGE_IN_MONTHS                           
---------+---------+---------+---------+---------+---------+-
              874.645161290322580

Given two timestamps it calculates the number of months between – However this time more accurately than the bit switch stuff in TIMESTAMPDIFF

Lots of string stuff came in 9 as well. LPAD, RPAD, LOCATE_IN_STRING, and OVERLAY here are some examples:

SELECT LPAD('FRED', 8, 'Q')
FROM SYSIBM.SYSDUMMY1 ;   
QQQQFRED                  

SELECT RPAD('FRED', 8, 'Q')
FROM SYSIBM.SYSDUMMY1 ;   
FREDQQQQ                  

SELECT LOCATE_IN_STRING('RÖY LIVES IN DÜSSELDORF' ,
                        'Ü' , 1 , CODEUNITS32)    
FROM SYSIBM.SYSDUMMY1 ;                           
         15

LPAD pads any given input string to the left with the supplied padding character. RPAD does the same but from the right hand side. These two are very handing for filling data up with “.” To make alignment better on the eye. LOCATE_IN_STRING does what it says – it returns the position in the string of the character given but this works, like the CHARACTER_LENGTH and SUBSTRING functions in V8, at the character and not the byte level. Further there is an “instance”, which I did not use in the example but it defaults to 1, so that you can find the nth character if desired. Very handy if you want to check that the user has entered four –‘s in the input field etc.

SELECT LOCATE_IN_STRING('TEST 1-2-3-4-5' , '-' , 1 , 4 , CODEUNITS32)
FROM SYSIBM.SYSDUMMY1 ;                                             
         13                                                         

SELECT LOCATE_IN_STRING('TEST 1-2-3-4 5' , '-' , 1 , 4 , CODEUNITS32)
FROM SYSIBM.SYSDUMMY1 ;                                             
          0

Here you can see that I start the search at the start (1) and that I want the position of the 4th occurrence. So the first returns 13 which is correct and the next returns 0 as there are indeed only three hyphens. As you can imagine – lots of potential uses here!

OVERLAY is the same as good old INSERT except that the length argument is optional.

Maths got a small boost in DB2 9 with the introduction of QUANTIZE. A strange function that outputs a quantized DECFLOAT field based on the parameters input. Again I have never found a use for it but it *must* be used somewhere… Here are the documented examples of what it does:

 QUANTIZE(2.17, DECFLOAT(0.001)) = 2.170
 QUANTIZE(2.17, DECFLOAT(0.01)) = 2.17
 QUANTIZE(2.17, DECFLOAT(0.1)) = 2.2
 QUANTIZE(2.17, DECFLOAT(’1E+0’)) = 2
 QUANTIZE(2.17, DECFLOAT(’1E+1’)) = 0E+1
 QUANTIZE(2, DECFLOAT(INFINITY)) = NAN –- exception
 QUANTIZE(-0.1, DECFLOAT(1) ) = 0
 QUANTIZE(0, DECFLOAT(’1E+5’)) = 0E+5
 QUANTIZE(217, DECFLOAT(’1E-1’)) = 217.0
 QUANTIZE(217, DECFLOAT(’1E+0’)) = 217
 QUANTIZE(217, DECFLOAT(’1E+1’)) = 2.2E+2
 QUANTIZE(217, DECFLOAT(’1E+2’)) = 2E+2

The first parameter is simply changed to have the same look-and-feel as the second parameter definition. I guess when working with floating point, bigint etc. it would be nice to get the output all the same format…

Finally they brought out a few new DATE, TIME functions. EXTRACT, TIMESTAMPADD and TIMESTAMP_ISO. EXTRACT is a high level version of existing functions and looks like this

SELECT EXTRACT(YEAR FROM CURRENT_TIMESTAMP)                            
      ,        YEAR(CURRENT_TIMESTAMP)                                         
      ,EXTRACT(MONTH FROM CURRENT_TIMESTAMP)                           
      ,        MONTH(CURRENT_TIMESTAMP)                                        
      ,EXTRACT(DAY FROM CURRENT_TIMESTAMP)                             
      ,        DAY(CURRENT_TIMESTAMP)                                          
FROM SYSIBM.SYSDUMMY1 ;                                                
  2012         2012            3            3           30           30
                                                                     
SELECT EXTRACT(HOUR FROM CURRENT_TIMESTAMP)                            
      ,        HOUR(CURRENT_TIMESTAMP)                                         
      ,EXTRACT(MINUTE FROM CURRENT_TIMESTAMP)                          
      ,        MINUTE(CURRENT_TIMESTAMP)                                       
      ,EXTRACT(SECOND FROM CURRENT_TIMESTAMP)                          
      ,        SECOND(CURRENT_TIMESTAMP)                                       
FROM SYSIBM.SYSDUMMY1 ;                                                
 10           10           11           11   26.511601              26

As can be seen the functions are the same (apart from the better accuracy at EXTRACT(SECOND FROM xxxx) but you could always simply encapsulate it in an INTEGER function and then you would really have 1:1 functionality. The results are all INTEGER apart from SECOND which is DECIMAL(8,6)

TIMESTAMP_ISO simply returns a version of your given date in the ISO timestamp format – pretty handy when all you get as input is a date.

SELECT TIMESTAMP_ISO (CURRENT DATE)                         
FROM SYSIBM.SYSDUMMY1 ;                                     
2012-03-30-00.00.00.000000

TIMESTAMPADD was obviously written by the same programmer who developed TIMESTAMPDIFF as it uses the same “odd” Bit masking to get the job done. If you remember my example with granddads age in months we can now reverse it so

SELECT TIMESTAMPADD( 64                                     
            , 874                                           
            , CAST('1939-05-10-01.00.00.000000' AS TIMESTAMP)
                     ) AS GRANDADS_BIRTHDATE_PLUS_MONTHS    
FROM SYSIBM.SYSDUMMY1 ;                                     
---------+---------+---------+---------+---------+---------+-
GRANDADS_BIRTHDATE_PLUS_MONTHS                              
---------+---------+---------+---------+---------+---------+-
2012-03-10-01.00.00.000000

Aha! The first parameter is 64 (Which means Months – Look at last month’s newsletter for a table listing all valid values) the second parameter is 874 so it adds 874 months to the supplied timestamp. There are better uses I am sure!

Last but not least are two TEXT extender functions that I have not actually used but seem pretty neat. CONTAINS searches through the text index for a hit. The result is 1 if the text is found in the document. Here is an example from the docu:

SELECT EMPNO
FROM EMP_RESUME
WHERE RESUME_FORMAT = ’ascii’
AND CONTAINS(RESUME, ’cobol’) = 1

If the text index column RESUME contains the word “cobol” it returns 1

Finally SCORE just cut-and-pasted from the docu as I have not got the text extender stuff installed

The following statement generates a list of employees in the order of how well their resumes matches the query “programmer AND (java OR cobol)”, along with a relevance value that is normalized between 0 (zero) and 100.

SELECT EMPNO, INTEGER(SCORE(RESUME, ’programmer AND
      (java OR cobol)’) * 100) AS RELEVANCE
  FROM EMP_RESUME
 WHERE RESUME_FORMAT = ’ascii’
   AND CONTAINS(RESUME, ’programmer AND (java OR cobol)’) = 1
 ORDER BY RELEVANCE DESC

 

DB2 first evaluates the CONTAINS predicate in the WHERE clause, and therefore, does not evaluate the SCORE function in the SELECT list for every row of the table. In this case, the arguments for SCORE and CONTAINS must be identical.

If you have the text support then these two functions could be very useful indeed!
OK, next month DB2 10 functions!
Feel free to send me your comments and ask questions.

TTFN,
Roy Boxwell
Senior Architect