Jesse's SQL Blog: 2010

Monday, August 23, 2010

Getting Row Counts

Have you ever tried to get the row count of a table by a simple "SELECT COUNT(*)" statement, and been stupefied by how long it took to return the result? This article demonstrates how to use sysindexes to speed up that row count:

SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2

Params for Pivoting

So let's add a second parameter to indicate the format of the return set. The default value will return it as

AttribName AttribValue Row Number
---------- ----------- ----------

and the other value will return the fully-pivoted data set:

AttribName AttribValue_Row1 ... AttribValue_RowN
---------- ---------------- ... ----------------

Thursday, August 19, 2010

How to Pivot

How about we construct a CTE using dynamic SQL, and we pivot the attribute values by constructing a series of SELECT...UNION SELECT statements, so that we have one SELECT per attribute???

PivotTable SQL Code

I want to write TSQL code that will pivot any given table. The way I imagine it, is that I implement this code as a stored proc, which accepts a table name as parameter. It returns a result set comprising the perfect pivot of that table. The result set need not be named, but could be inserted into a table. It will have to convert all values to varchar (or char for easier printing; perhaps can have a switch as a parameter).

Structure of return set:

AttribName AttribValue_Row1 ... AttribValue_RowN
---------- ---------------- ... ----------------

So, how do we do this? We can easily pivot the column names into values for 'AttribName' by pulling from Sys.Columns view. The problem then becomes how do we pivot the attribute values into the genericized column?

We need to first pivot the data into this structure:

AttribName AttribValue Row Number
---------- ----------- ----------

Then we can self join the data set as many times as there are rows in the original data. Or not.

Stairway to Database Design

"Designing a database is very easy to do, but hard to do well. There are many databases that have been built in the world that do not meet the needs of the applications they support.

The fundamentals of design are often not taught to developers or DBAs when they are given the task of building tables. Many people do not understand the logic used to decide what columns to use, the data types for each one, constraints between columns, and the keys that should be defined.

Joe Celko, widely viewed as an expert in the SQL language, brings us a series that looks to help you understand the fundamentals of the design process. The articles in the series are linked below..."

Monday, August 9, 2010

Random Rows

Here's an easy way to select a random row from a table:

SELECT TOP 1 * FROM Test4 ORDER BY NEWID()

Tuesday, August 3, 2010

Data Profiling, Data Quality

A couple of good Wikipedia articles on data profiling and data quality, and one on the (relatively) new data profiling tool in SQL Server 2008 SSIS.

Friday, July 30, 2010

SQL Date Functions Syntax

One thing that drives me a little crazy with T-SQL is the syntax of the date functions:

DATEADD (datepart , number, date )
DATEDIFF ( datepart , startdate , enddate )
DATEPART ( datepart , date )

The designers were consistent with putting the 'datepart' as the first parameter, but for DATEADD, why did they made the base date the last parameter? I think it should be consistent with DATEPART and DATEDIFF, so that the second parameter is always a DATETIME.

Tips to optimize your SQL statements

Interesting article on SQLServerCentral.com By Brian Ellul, 2010/07/29:

There is a huge difference between writing an SQL statement which works and one which works well and performs well. Sometimes developers are too focused on just writing their SQL to perform the required task, without taking into consideration its performance and most importantly its impact on the SQL Server Instance, i.e. the amount of CPU, IO, and memory resources their SQL is consuming. Thus, they starve other SQL Server processes during the SQL statement execution bringing the whole instance to its knees. This article is intended to provide the SQL developer with a set of easy checks to perform to try and optimize the SQL Statements."

Tuesday, July 27, 2010

Level of measurement

Interesting article in Wikipedia about metrics, and the different sorts:

The "levels of measurement", or scales of measure are expressions that typically refer to the theory of scale types developed by the psychologist Stanley Smith Stevens. Stevens proposed his theory in a 1946 Science article titled "On the theory of scales of measurement"^[1]. In this article Stevens claimed that all measurement in science was conducted using four different types of scales that he called "nominal", "ordinal", "interval" and "ratio".

Monday, July 26, 2010

NK Search

I just tested the NK search algorithm on an unindexed table of 9 columns and over 10 million rows, and it ran for 4 minutes before returning the result that no natural key of 4 columns or fewer exists. This ran 255 queries in that time. The biggest performance boost was the use of the sample check, whereby the duplicate count is ran on the TOP 1000 rows first, then the full data set.

I'm polishing off the code and should post it by the end of the week.

Thursday, July 15, 2010

New and Improved Natural Key Discovery Algorithm

While preparing for my presentation on code I wrote that discovers natural keys,for the Baltimore SQL Server Users Group meeting on Monday, I discovered a blind spot in the design. I hit the panic button a few days ago, but today I think I've got a new solution that solves a lot of problems.

Wednesday, July 14, 2010

Table Variables and Query Optimizer

In yesterday's post I mentioned that I replaced the temp tables in a stored procedure with table variables. This worked perfectly functionally, but doomed the performance by orders of magnitude. The problem was that the query optimizer was choosing MERGE JOINs instead of HASH JOINs for the table variables. Forcing the hash joins with query hints fixed the problem.

[Ed. note 7/26]Here's another interesting discussion about this. Note the post that reads "And also, because table variables do not have statistics, the query optimizer will often take dubious choices because it believes the row count to be 1 all the time - leading to inappropriate join methods etc..."

Tuesday, July 13, 2010

VS2008 SSRS and Temp Tables in Spds

Ran into a problem using VS2008 to report the results of data from a SQL Server 2005 stored procedure. Apparently there is an issue with VS2008 and stored procedures that make use of temp tables. According to this article, I replaced all of the temp tables with variable tables, and the problem went away.

Code Review, Years Later

I'm reviewing code I wrote a couple of years ago to determine natural keys of raw data sets, in order to do a talk on it at the local SQL Server users' group meeting next week. After an initial reading, I thought that the code worked a certain way (doing breadth-first searches), but it actually works in a hybrid breadth/depth first fashion.

The first thought one might have would be something along the lines of "comment much?", but truthfully, I do, or at least I think I do. The code is found in a SQL Server Central article, explaining much of the background of the problem I was trying to solve, and the code itself has an entire page of comments, and yet I find now that there simply isn't one concise explanation of the resultant design of my algorithm.

There must be a magic number of weeks after code is finished and polished off, when the original intent of the developer is still in his or her brain, but beginning to fade. That is the perfect time to finish commenting the code, because at that time the developer will have to approach the code as an outsider, but still be able to summon those fuzzy thoughts about why something was done a certain way. Does anyone know what that magic number of weeks is?

Sunday, July 11, 2010

Recursive Stored Procedures and Temp Tables

If a stored procedure creates a temp table and then calls itself recursively, will the newly instantiated stored procedure have access to the temp table?

Friday, July 2, 2010

Temp Table Scope

One of the gotchas of TSQL is temp table scope. The fact that I can create a temp table within a stored procedure, then call a second stored procedure and reference that temp table, scares me a little. It reminds me of the 3G languages, with their global variables and modified private scope and such.

A big enhancement in SQL in this department: table variables. The scope of a table variable is the same as that of a "regular" variable - local to the scope of the procedure. So replacing the use of a temp table in a stored procedure eliminates the concern that a called inner procedure will unexpectedly change the data. However, there are major performance declines when using table variables to hold large amounts of data, as this article explains.

Another link: http://www.mssqltips.com/tip.asp?tip=1556

Wednesday, June 30, 2010

Installing Second Hard Drive

DISCLAIMER: I am not a hardware guy, nor am I a sysadmin type. But I'm trying to do things on the cheap in the DIY spirit. A friend gave me some old hardware, and I'm using it to try and upgrade my old desktop. I had some misadventures last night trying to install a second IDE drive on my Dell Optiplex 270 (yes I know it's old, it's just a lab computer). The drive is a 40gb Seagate that I'm planning on using for TempDb. When I first put it in, I didn't change the jumper settings, so the pc thought I had two master IDE drives. This caused BIG problems in BIOS. I could no longer boot, and my computer no longer recognized my original master drive. I changed the jumper settings of the second drive and tried to reboot - no luck. Now the pc was reporting TWO unknown drives (which I suppose is progress). I played around with the BIOS settings, but again, no luck. Then today after googling the task, I came across this how-to article that explains that the IDE cable that connects the IDE drives to the motherboard must be plugged in to the drives in a very specific way. The instructions worked perfectly. When I booted up in Windows, I noticed that that drive had two partitions of 20gb each. Wanting to use the entire drive for my new TempDb, I researched how to delete the partitions and reformat.

Tuesday, June 29, 2010

T-SQL Challenge #33

Just finished T-SQL Challenge #33. It was less challenging than some of the others (I finished it in under 20 minutes), but the problem was interesting enough and still required the declarative thinking that is the aim of these challenges.

Also last week R. Barry Young published articles on how to gain this mode of thinking: "There Must Be 15 Ways To Lose Your Cursors... part 1, Introduction".

Thursday, June 24, 2010

SQL Search Tool

I'm not big on 3rd party plug-in tools, and am not much for shilling for companies, but I LOVE Red Gate's SQL Search product. It makes searching your database for text fragments very easy (like when you analyze the impact of changing a column name, for example), and displays the results in an easy-to-use interface that enables you to click and edit the affected objects.

Friday, June 11, 2010

Needle in a Haystack

While working on loading 13m rows into a staging table, a name-parsing routine came across unexpected data, and the load failed. One of the greatest drawbacks to set-based ETL such as T-SQL, versus row- or batch-based ETL tools such as Informatica or SSIS, is that data discrepancies cause failures that are difficult to diagnose. In this case, the load failed with the error:

"Msg 537, Level 16, State 5, Procedure spd_Load_Stg_Names, Line 31
Invalid length parameter passed to the LEFT or SUBSTRING function.
The statement has been terminated."

This is almost useless as far as finding the problem data, as it does not tell me what value, or which row, caused the failure.

Sunday, May 23, 2010

Bulk Insert CSV with Text Qualifiers

One of the biggest shortcomings of SQL Server's BCP/Bulk Insert tool is the lack of specifying a text qualifier in a comma-delimited ("CSV"), text file for import. Let's take a look at an example of a record from a CSV file that we wish to import into a table:

"John Smith", "123 Main St, Apt 3", "Anytown", "XX", "12345"

So the format of the record is Name, Address, City, State, ZIP. The fields are separated by commas, and encapsulated with double-quotes. The "Address" field demonstrates the need for the text qualifier: the embedded comma that separates the first address line from the second. Without text qualifiers each line of address would appear as a separate field, rather than one discrete value. That in itself would not be a problem, but because some addresses only have one line, the number of delimiters becomes variable, and the import is then faulty.

The problem crystalizes when we try to import a file of that format into SQL Server. The BCP/Bulk Insert utilities do not have an option to specify the text qualifier. We have a couple of "kloodgy" options before us: We can specify the delimiter as ','. The problem here is that every field encapsulated with the double quotes will retain that character in the database, leaving us to update the table to remove those characters. We can specify that the delimiter is '","' instead of ','; this is a step in the right direction, but it will still leave a double quote leading in the first field and trailing in the last, leaving us with less work than in the first case.

Given the maturity of the SQL Server product, I'm surprised that Microsoft hasn't added this feature. I suppose that is their way of moving developers towards SSIS, which of course does have it.

If we really want to properly import this file using BCP or BULK INSERT without any weird cleanup kludges, we have to use a format file. Here's a good article on MSDN about how to create a format file from scratch using a BCP option. To complete my task, I will take the resultant format file and modify it to account for my comma delimiters and text qualifiers.