Tuesday, April 21, 2009

Data Patterns and the LIKE Clause

The "LIKE" clause of the SQL SELECT statement is one of the more interesting features of SQL's character processing. Most everyone is familiar with the "%" wildcard, which allows queries such as "SELECT LastName FROM Customers WHERE LastName LIKE 'Mc%'". This returns all the customer last names beginning with "Mc" (such as yours truly). But I suspect that many developers are unaware of some of the deeper uses of data pattern expressions.

The list of other wildcard characters related to LIKE includes "_", "[", "-", "]", and "^". The first, "_", is the 'any single character' expression. The "[]" characters act as a single character wildcard, but allow us to specify which characters will match. The WHERE clause above is equivalent to "WHERE LastName LIKE '[M][c]%'". When multiple characters reside within the brackets, the filter acts like an "or" expression. So changing the filter to "WHERE LastName LIKE '[M][c][aeiou]%'" would produce last names beginning with "Mc", then followed by a vowel, then any terminating string.

If you use the "-" with the brackets, you can specify ranges of characters (ranges defined by ASCII order). For example, let's say we want to search for user names that begin with 'jmclain' and are then followed by a single digit number. We would execute "SELECT * FROM Users WHERE UserName LIKE 'jmclain[0-9]'".

Where it gets complicated is when you want to search a column for wildcard literals. For example, let's say that you have a column called 'SalesDescription', and you want to count the rows where the SalesDescription column contains the string "50% off". If you were to execute "SELECT COUNT(*) FROM Sales WHERE SalesDescription LIKE '50% off'", you would mistakenly pull in rows with SalesDescription values such as '50 cents off', since the "%" wildcard represents "any string". To correct this, you have two options. The simplest is to enclose the "%" wildcard with brackets, so that the filter changes to "WHERE SalesDescription LIKE '50[%] off'".

The second option is to make use of the ESCAPE clause of the LIKE operator. What this method lacks in simplicity, it make up in robustness (and isn't really that complicated anyways). To solve the above problem suchwise, the filter changes to "WHERE SalesDescription LIKE '50!% off' ESCAPE '!'". I prefer the first method above because 1. it is simpler, and 2. in order to use the ESCAPE clause, you must be certain that your target expression doesn't contain the escape character. So if a given SalesDescription value in the table was, unbeknowst to you, something like '50% off!!!', the results start to become unreliable. Best practices for using ESCAPE stipulate first starting with uncommon characters such as "~" or "", and then querying your column to make sure they are not present.

The best use of ESCAPE is when you want to find brackets in your target. Let's say that you wanted to find the SalesDescription value "[50% off]". After checking to ensure that the column values don't contain the tilde ("~") character, you would use the filter "WHERE SalesDescription LIKE '~[50~% off~]' ESCAPE '~'".

Friday, April 17, 2009

Converting Datetime Values to Varchar

Back when I used to code in Foxpro, I had to write custom code to convert date/time values to various formats. T-SQL provides for a great number of formats using the CONVERT function. An article on MSSQLTips lists many (but not all) of these formats. This code (which can easily be wrapped into a stored procedure), will list out all valid format codes and an example of how they will appear:

CREATE TABLE #Fmts (FmtNo tinyint, Example varchar(max))
DECLARE @fmt int; SET @fmt = 0
DECLARE @dt datetime; SET @dt = GETDATE()
WHILE @fmt < 132
INSERT INTO #Fmts (FmtNo, Example)
VALUES (@fmt, CONVERT(varchar, @dt, @fmt))
PRINT '@fmt = ' + LTRIM(STR(@fmt)) + ' is not valid.'
SET @fmt = @fmt + 1
SELECT FmtNo, Example = LEFT(Example, 30) FROM #Fmts

And sample output:

Wednesday, April 15, 2009

Question of the Day

I stumped the gurus on SQLServerCentral.com with another challenging Question of the Day on April 6th, 2009:

Given this code,

DECLARE @val int;
SET @val = -1
CREATE TABLE #empty (val int)

which statement(s) will result in @val being NULL? (select all that apply)

  1. SET @val = NULL
  2. SELECT @val = NULL FROM #empty
  3. SELECT @val = val FROM #empty
  4. SELECT @val = (SELECT val FROM #empty)
As of today, only about 30% of respondents answered correctly, and judging from the comments in the discussion section, a lot of them gained a deeper understanding of the concept of null in SQL. I try to make my questions (and answers), tricky enough so as to not be obvious, but my goal isn't to trick people with arcane technicalities - I want to make them aware of certain subtleties of the database engine. This question arose after some unexpected results made me delve into a bit of code, and tested scenarios just like those in the question.

Monday, April 13, 2009

Collation Sequences

Being a database developer rather than a DBA, I rarely deal with collation types, but I came across a situation recently where I had to dig into the issue. My objective was to produce a breakdown of how many rows contained each ASCII character. I considered two approaches: slice the values into one-character chunks, or loop through all 256 ASCII values and count the number of rows containing each character. The former approach has the advantage of not only counting the rows but the frequency of characters (e.g., "100 rows contain 250 instances of the character 'A'"), but I opted for the second approach since it intuitively seemed faster. If your database was created with case-insensitive collation (such as "SQL_Latin1_General_CP1_CI_AS"), checking for the characters 'A', would pull in values of 'a':

USE NorthWind

FROM dbo.Customers

To fix, simply add the "COLLATE" clause to the query:

FROM dbo.Customers
WHERE CHARINDEX('A' COLLATE Latin1_General_BIN, City) > 0

Friday, April 3, 2009

"Average" Date

Have you ever tried to calculate the average of a datetime column, and gotten this error:

"Msg 8117, Level 16, State 1, Line 1
Operand data type datetime is invalid for avg operator."

When you first think about it, the error makes sense - are you trying to determine the average month, year, hour, or second? But shouldn't there be such a thing as an "average" date? If we have a bunch of sales orders in a given month, doesn't the "average" date of sale actually mean something?

What if I calculated the MIN value, then calc'd the DATEDIFF between the MIN and all other values? At that point I'd essentially have an integer value, which of course I could average, and then derive the "average" date:

CvrtToDate AS (
/* "DataValue" is assumed to be varchar(max) */
DataValue = CONVERT(datetime, DataValue)
FROM DataSet
WHERE ISDATE(DataValue) = 1
,MinAndMax AS (
ValueMin = MIN(DataValue)
,ValueMax = MAX(DataValue)
FROM CvrtToDate
,DateDiffs AS (
DaysFromMin = DATEDIFF(d, MinAndMax.ValueMin, DataValue)
FROM CvrtToDate, MinAndMax
,AvgDaysFromMin AS (
SELECT DataValue = AVG(DaysFromMin)
FROM DateDiffs
AvgDate = DATEADD(d, AvgDaysFromMin.DataValue, MinAndMax.ValueMin)
FROM MinAndMax, AvgDaysFromMin

This query bears a result that makes sense - we have a date that is between the oldest, and most recent, that is somewhere near the midway point.

A little Google research bears fruit for a much simpler calculation. From "Ask Ben: Averaging Date/Time Stamps In SQL": "The secret to date/time averaging is that date/time stamps can be represented as a floating point number. I have covered this a number of times on this blog so I won't go into too much detail, but the idea is that as a floating point number, the integer part represents the number of days since the beginning of time (as the SQL server defines it) and the decimal part represents the time or rather, the fraction of days. SQL does not make this conversion for you; you have to CAST the date/time stamp as a FLOAT value."

This leads to the revised calculation:

ValueMin = MIN(CONVERT(datetime, DataValue))
,ValueMax = MAX(CONVERT(datetime, DataValue))
,ValueAvg = CONVERT(datetime, AVG(CONVERT(float,
CONVERT(datetime, DataValue))))
FROM DataSet
WHERE ISDATE(DataValue) = 1

Not only is this calc far simpler, but it is slightly more precise, as it includes time-of-day in the result.

Calculating Median

I found an article on calculating the median in SQL (http://www.sqlservercentral.com/scripts/Miscellaneous/31775/), and after reading a Wikipedia article on it, I realized that it was incorrect for sample sets of even size. I left a comment with a version of the calc that accounts for this, with a copy of this code:

TopHalf AS (
FROM DataSet
,BottomHalf AS (
FROM DataSet
,BottomOfTopHalf AS (
SELECT TOP 1 DataValue
FROM TopHalf
,TopOfBottomHalf AS (
SELECT TOP 1 DataValue
FROM BottomHalf
Median = (BottomOfTopHalf.DataValue
+ TopOfBottomHalf.DataValue) / 2.0
FROM BottomOfTopHalf, TopOfBottomHalf

Wednesday, April 1, 2009

Function to Insert Commas Into Number String

While looking into some data quality issues, I considered the case of source data arriving as a string of numbers with embedded commas (for example, "1,526,734.56"), and as I didn't have any such data to test with, I decided to create some, and also created this function in the process (which can be used to create some): fn_AddCommasToNumberString.

(Followup on 4/23/09): A forum post on SQLServerCentral.com explained a very easy way to do this using varchar conversion:

declare @test float
set @test = 7265342.12
select @test, convert(varchar(20),cast(@test as money),1)