Wednesday, June 26, 2013

Case-Sensitive Comparisons

One of the first things we "learn" about SQL is that it is case-insensitive, so that "Smith", "SMITH", and "smith" are equal in comparison. In fact, it is not SQL Server that is case-insensitive, or even the database, but the collation of the database. Collation is the way the database sorts string values. An example of why this is important can be found in the ordering of alphabets of foreign languages. Another example is case-sensitivity. Most SQL Server databases are set up as case-insensitive, which allows searches for string values to ignore case.
If we right-click on a typical database in the Object Explorer of SSMS, and select the last menu option, "Properties", a window named "Database Properties" pops up. The last property listed on the General page is Collation, and we can see that it is set to SQL_Latin1_General_CP1_CI_AS. According to this SO Post, this collation has the following properties:
  • Latin1 makes the server treat strings using charset Latin 1, basically ASCII
  • CI case insensitive comparisons so 'ABC' would equal 'abc'
  • AS accent sensitive, so 'ΓΌ' does not equal 'u'
Let's run a query to look for all of the "Smiths" in an imaginary sample table
SELECT * FROM contacts
WHERE [last_name] = 'smith'
Let's say that the query retrieves 10 rows, and that 5 of the rows have names that appear completely in upper-case, and none are completely lower-cased. If we wanted to clean up this data so that these 5 names appear in mixed-case (e.g., "Smith" rather than "SMITH"), how would we rewrite the query to isolate these rows?
If we run this
SELECT * FROM contact
WHERE [last_name] = 'smith'
AND [last_name] = UPPER([last_name])
We still get 10 rows. The additional restriction had no effect, in the same way that it makes no difference in the first query whether we search for Smith, smith, or SMITH. The way we get around this is by forcing the collation in our query
SELECT * FROM contacts
WHERE [last_name] = 'smith'
AND [last_name] = UPPER([last_name]) COLLATE SQL_Latin1_General_CP1_CS_AS
Notice the "CS" in the collation string. It specifies case-sensitive comparisons. Running this query successfully returns the 5 rows in which we are interested. This override of the database collation applies only to the comparison preceding it (otherwise no rows would be returned, because there are no rows where [last_name] = 'smith' with respect to case-sensitivity).
References
Collation - Wikipedia, the free encyclopedia
COLLATE (Transact-SQL)
http://msdn.microsoft.com/en-us/library/aa174903%28v=sql.80%29.aspx
sql - what does COLLATE SQL_Latin1_General_CP1_CI_AS do - Stack Overflow