Wednesday, June 30, 2010
Installing Second Hard Drive
DISCLAIMER: I am not a hardware guy, nor am I a sysadmin type. But I'm trying to do things on the cheap in the DIY spirit. A friend gave me some old hardware, and I'm using it to try and upgrade my old desktop. I had some misadventures last night trying to install a second IDE drive on my Dell Optiplex 270 (yes I know it's old, it's just a lab computer). The drive is a 40gb Seagate that I'm planning on using for TempDb. When I first put it in, I didn't change the jumper settings, so the pc thought I had two master IDE drives. This caused BIG problems in BIOS. I could no longer boot, and my computer no longer recognized my original master drive. I changed the jumper settings of the second drive and tried to reboot - no luck. Now the pc was reporting TWO unknown drives (which I suppose is progress). I played around with the BIOS settings, but again, no luck. Then today after googling the task, I came across this how-to article that explains that the IDE cable that connects the IDE drives to the motherboard must be plugged in to the drives in a very specific way. The instructions worked perfectly. When I booted up in Windows, I noticed that that drive had two partitions of 20gb each. Wanting to use the entire drive for my new TempDb, I researched how to delete the partitions and reformat.
Tuesday, June 29, 2010
T-SQL Challenge #33
Just finished T-SQL Challenge #33. It was less challenging than some of the others (I finished it in under 20 minutes), but the problem was interesting enough and still required the declarative thinking that is the aim of these challenges.
Also last week R. Barry Young published articles on how to gain this mode of thinking: "There Must Be 15 Ways To Lose Your Cursors... part 1, Introduction".
Also last week R. Barry Young published articles on how to gain this mode of thinking: "There Must Be 15 Ways To Lose Your Cursors... part 1, Introduction".
Thursday, June 24, 2010
SQL Search Tool
I'm not big on 3rd party plug-in tools, and am not much for shilling for companies, but I LOVE Red Gate's SQL Search product. It makes searching your database for text fragments very easy (like when you analyze the impact of changing a column name, for example), and displays the results in an easy-to-use interface that enables you to click and edit the affected objects.
Friday, June 11, 2010
Needle in a Haystack
While working on loading 13m rows into a staging table, a name-parsing routine came across unexpected data, and the load failed. One of the greatest drawbacks to set-based ETL such as T-SQL, versus row- or batch-based ETL tools such as Informatica or SSIS, is that data discrepancies cause failures that are difficult to diagnose. In this case, the load failed with the error:
"Msg 537, Level 16, State 5, Procedure spd_Load_Stg_Names, Line 31
Invalid length parameter passed to the LEFT or SUBSTRING function.
The statement has been terminated."
This is almost useless as far as finding the problem data, as it does not tell me what value, or which row, caused the failure.
"Msg 537, Level 16, State 5, Procedure spd_Load_Stg_Names, Line 31
Invalid length parameter passed to the LEFT or SUBSTRING function.
The statement has been terminated."
This is almost useless as far as finding the problem data, as it does not tell me what value, or which row, caused the failure.
Sunday, May 23, 2010
Bulk Insert CSV with Text Qualifiers
One of the biggest shortcomings of SQL Server's BCP/Bulk Insert tool is the lack of specifying a text qualifier in a comma-delimited ("CSV"), text file for import. Let's take a look at an example of a record from a CSV file that we wish to import into a table:
"John Smith", "123 Main St, Apt 3", "Anytown", "XX", "12345"
So the format of the record is Name, Address, City, State, ZIP. The fields are separated by commas, and encapsulated with double-quotes. The "Address" field demonstrates the need for the text qualifier: the embedded comma that separates the first address line from the second. Without text qualifiers each line of address would appear as a separate field, rather than one discrete value. That in itself would not be a problem, but because some addresses only have one line, the number of delimiters becomes variable, and the import is then faulty.
The problem crystalizes when we try to import a file of that format into SQL Server. The BCP/Bulk Insert utilities do not have an option to specify the text qualifier. We have a couple of "kloodgy" options before us: We can specify the delimiter as ','. The problem here is that every field encapsulated with the double quotes will retain that character in the database, leaving us to update the table to remove those characters. We can specify that the delimiter is '","' instead of ','; this is a step in the right direction, but it will still leave a double quote leading in the first field and trailing in the last, leaving us with less work than in the first case.
Given the maturity of the SQL Server product, I'm surprised that Microsoft hasn't added this feature. I suppose that is their way of moving developers towards SSIS, which of course does have it.
If we really want to properly import this file using BCP or BULK INSERT without any weird cleanup kludges, we have to use a format file. Here's a good article on MSDN about how to create a format file from scratch using a BCP option. To complete my task, I will take the resultant format file and modify it to account for my comma delimiters and text qualifiers.
"John Smith", "123 Main St, Apt 3", "Anytown", "XX", "12345"
So the format of the record is Name, Address, City, State, ZIP. The fields are separated by commas, and encapsulated with double-quotes. The "Address" field demonstrates the need for the text qualifier: the embedded comma that separates the first address line from the second. Without text qualifiers each line of address would appear as a separate field, rather than one discrete value. That in itself would not be a problem, but because some addresses only have one line, the number of delimiters becomes variable, and the import is then faulty.
The problem crystalizes when we try to import a file of that format into SQL Server. The BCP/Bulk Insert utilities do not have an option to specify the text qualifier. We have a couple of "kloodgy" options before us: We can specify the delimiter as ','. The problem here is that every field encapsulated with the double quotes will retain that character in the database, leaving us to update the table to remove those characters. We can specify that the delimiter is '","' instead of ','; this is a step in the right direction, but it will still leave a double quote leading in the first field and trailing in the last, leaving us with less work than in the first case.
Given the maturity of the SQL Server product, I'm surprised that Microsoft hasn't added this feature. I suppose that is their way of moving developers towards SSIS, which of course does have it.
If we really want to properly import this file using BCP or BULK INSERT without any weird cleanup kludges, we have to use a format file. Here's a good article on MSDN about how to create a format file from scratch using a BCP option. To complete my task, I will take the resultant format file and modify it to account for my comma delimiters and text qualifiers.
Tuesday, December 8, 2009
TSQL Challenge #18 Blues
I'm working on TSQL Challenge #18, and hitting up against a brick wall. The challenge involves building a calendar for given months of interest (month/year that is). So I build a CTE that first figures out the first & last day of the month, then the week numbers for those days. Matching the week numbers against a tally table, I can then tell which weeks that the days of a particular month will span - those weeks then become rows in the results. I then build a CTE to hold all of my days of interest, with the week number and day-of-month as columns. Using my 'Weeks' CTE as a base, I LEFT JOIN the 'Days' dataset seven times - one for each day of the week.
This is the error I encounter: "The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value." If I limit my day-related tally table to a max of 28 (rather than 31), the error disappears, but I don't get any days in my calendar past the 28th of course. The weird thing is, if I set the tally to max of 31, and I select from the 'Days2' CTE, I don't get an error, and all the days I expect to see are present. Something strange is going on under the hood of SQL.
This is the error I encounter: "The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value." If I limit my day-related tally table to a max of 28 (rather than 31), the error disappears, but I don't get any days in my calendar past the 28th of course. The weird thing is, if I set the tally to max of 31, and I select from the 'Days2' CTE, I don't get an error, and all the days I expect to see are present. Something strange is going on under the hood of SQL.
Thursday, December 3, 2009
Kimball Design Tip #119 Updating the Date Dimension
I've always thought of the Time or Date dimension to be a static one, rather than slowly-changing, until I read this tip from Kimball U: "...there are some attributes you may add to the basic date dimension that will change over time. These include indicators such as IsCurrentDay, IsCurrentMonth, IsPriorDay, IsPriorMonth, and so on. IsCurrentDay obviously must be updated each day."
Creating and automatically updating these attributes of the Date dimension will save time (sorry, no pun intended), for report writers and data mart developers, and also standardize such calculations across the enterprise. For example, let's say you are responsible for the data warehouse of a catalog company, and you want to calculate lag time between ordering and shipping, a common metric of customer satisfaction. Furthermore, management decides to give everyone an extra holiday as a bonus for company profitability from last year, and so lets everyone take off Groundhog Day. To account for this in your metric, you are going to want to have an attribute like "BusinessDayOfYear" as an incrementing integer, so that the first business day in January gets assigned BusinessDayOfYear=1, the second day is 2, and so forth, until you have the highest number as the last business day in December (which is most likely Dec 30th). If the company now has Groundhog Day as a holiday, then the IsHoliday for Feb 2nd is changed from 0 to 1, and then the BusinessDayOfYear attribute is recalculated.
Calculating how many business days between ordering and shipping is then trivial. Of course, this does not account for orders placed in December that ship in January, so you might want to have an attribute BusinessDayLifetime, which starts at 1 the first day the company was in business, and increments freely forever.
Creating and automatically updating these attributes of the Date dimension will save time (sorry, no pun intended), for report writers and data mart developers, and also standardize such calculations across the enterprise. For example, let's say you are responsible for the data warehouse of a catalog company, and you want to calculate lag time between ordering and shipping, a common metric of customer satisfaction. Furthermore, management decides to give everyone an extra holiday as a bonus for company profitability from last year, and so lets everyone take off Groundhog Day. To account for this in your metric, you are going to want to have an attribute like "BusinessDayOfYear" as an incrementing integer, so that the first business day in January gets assigned BusinessDayOfYear=1, the second day is 2, and so forth, until you have the highest number as the last business day in December (which is most likely Dec 30th). If the company now has Groundhog Day as a holiday, then the IsHoliday for Feb 2nd is changed from 0 to 1, and then the BusinessDayOfYear attribute is recalculated.
Calculating how many business days between ordering and shipping is then trivial. Of course, this does not account for orders placed in December that ship in January, so you might want to have an attribute BusinessDayLifetime, which starts at 1 the first day the company was in business, and increments freely forever.
Monday, November 30, 2009
Building a Yahoo Finance input adapter for SQL Server StreamInsight
Here's an excellent SSC article by Johan Åhlén on building CEP (Complex Event Processing) solutions in SQL 2008. "What is StreamInsight and what is it good for? StreamInsight is a platform for developing and deploying applications that handle high-speed streaming data. It could be used for near real-time processing of data from production environments, structured data such as financial information and unstructured data such as Facebook, Twitter and blogs. Multiple sources can be combined and refined before they are being output. This technology is called CEP - complex event processing.
"Combine StreamInsight with data mining, and you can have real-time fraud detection, stock market prediction, you name it... Imagine a real-time Business Intelligence-application where the management can actually see the KPI gauges moving on the screen. I would say that monitoring real-time flow of information is one of the key success factors for tomorrow's Business Intelligence. This is also supported by Ralph Kimball, saying in an interview that Business Intelligence is moving from the strategic level towards operational level processes such as customer support, logistics and sales. At the operational level data must be continuously updated, not just once per day. I would add also that the new generation, that has grown up with Facebook and Twitter, will make it necessary to monitor new sources for successful Business Intelligence."
"Combine StreamInsight with data mining, and you can have real-time fraud detection, stock market prediction, you name it... Imagine a real-time Business Intelligence-application where the management can actually see the KPI gauges moving on the screen. I would say that monitoring real-time flow of information is one of the key success factors for tomorrow's Business Intelligence. This is also supported by Ralph Kimball, saying in an interview that Business Intelligence is moving from the strategic level towards operational level processes such as customer support, logistics and sales. At the operational level data must be continuously updated, not just once per day. I would add also that the new generation, that has grown up with Facebook and Twitter, will make it necessary to monitor new sources for successful Business Intelligence."
Monday, November 23, 2009
Data Vault Institute
While reading the comments on an SSC article about fault-tolerant ETL loading, I came across a link to the Data Vault Institute, and what sparked my attention was the comment by Dan Linstedt that "today's data warehouse has become a system of record. Due in part for the need of compliance." This struck me as an important point, given that many of the databases on which I've worked in the past were marketing databases, where bad data was generally disposed of. But in financial, health care, and governmental data warehouses, this approach would be completely unsatisfactory.
Monday, November 16, 2009
TSQL Challenge #17
"TSQL Challenge 17 - Creating cross rows references with inline hyperlinks" is now out. Due in - a couple of weeks?
Monday, November 9, 2009
Intelligent Keys
I'm engaged in an internal debate about the use of intelligent versus surrogate keys. Typically when this issue arises, the debate centers around whether we want to use a key that is already present in the data (such as SSN in an employee table - this is an intelligent key, also known as a natural key), or if it's better to generate a new meaningless key (such as an auto-incrementing integer - this is a surrogate key).
Now the internal debate isn't over that issue per se - I fall on the side that favors the surrogate key creation. The real debate I'm in is whether it's okay to create an intelligent surrogate key. The most typical surrogate as mentioned previously is an auto-incrementing integer identity - every time a row is inserted into the table, a new key is created by adding one to the max value. These keys have zero business meaning - that's their advantage, that they are decoupled from the business data. However, there are situations where it makes sense to create this value intelligently. One example is creating a time dimension in a data warehouse, whereby the primary key consists of an integer in the form "YYYYMMDD". Microsoft favors this method (as I discovered in their training kit for exam 70-448). A big advantage to this approach is that if you create a clustered index on that intelligent surrogate key, all of your data will be sorted in date order (of course, if you insert into that table by earliest date first, it will also be in that order - unless you add an earlier time period at a later date).
Now the internal debate isn't over that issue per se - I fall on the side that favors the surrogate key creation. The real debate I'm in is whether it's okay to create an intelligent surrogate key. The most typical surrogate as mentioned previously is an auto-incrementing integer identity - every time a row is inserted into the table, a new key is created by adding one to the max value. These keys have zero business meaning - that's their advantage, that they are decoupled from the business data. However, there are situations where it makes sense to create this value intelligently. One example is creating a time dimension in a data warehouse, whereby the primary key consists of an integer in the form "YYYYMMDD". Microsoft favors this method (as I discovered in their training kit for exam 70-448). A big advantage to this approach is that if you create a clustered index on that intelligent surrogate key, all of your data will be sorted in date order (of course, if you insert into that table by earliest date first, it will also be in that order - unless you add an earlier time period at a later date).
Tuesday, October 27, 2009
SQL/ETL/BI 101
I'm going to compile a list of articles that provide introductions to topics in SQL and BI.
Introduction to Indexes, by Gail Shaw
Index of MS-SQL Articles on DatabaseJournal.com, a gold mine of introductory articles.
"What a Data Warehouse is Not" by Bill Inmon.
Introduction to Indexes, by Gail Shaw
Index of MS-SQL Articles on DatabaseJournal.com, a gold mine of introductory articles.
"What a Data Warehouse is Not" by Bill Inmon.
Friday, October 23, 2009
SSIS Data Cleansing Webinar
I watched a good webinar on SSIS Data Cleansing, by Brian Knight and the good folks of Pragmatic Works of Jacksonville FL. "In this session with SQL Server MVP, Brian Knight, you'll learn how to cleanse your data and apply business rules to your data in SSIS. Learn how to solve complex data problems quickly in SSIS using simple techniques in the data flow. Brian will start by showing you the Data Profiling Task. Then he'll show how to use transforms like Fuzzy Grouping to de-duplicate your data and SSIS scripts to satisfy common scenarios he sees in the industry."
Help, my database is corrupt. Now what?
Found a good article on SSC about database corruption by Gail Shaw:
"What to do when the database is corrupt.
- Don't panic
- Don't detach the database
- Don't restart SQL
- Don't just run repair.
- Run an integrity check
- Afterwards, do a root-cause analysis"
Don't have a corrupt database, but still want to play in the sandbox? Click here and here for a couple of ways to corrupt your database. (Don't do this to production data!)
And if all else fails, there's a product called Recovery for SQL Server to help fix your files.
Thursday, October 22, 2009
SQL Server Execution Plans
Here's a link on SSC to a free PDF download of Grant Fitchley's "SQL Server Execution Plans", which is, funnily enough, about how to interpret and act upon an execution plan in SQL Server. "Every day, out in the various SQL Server forums, the same types of questions come up again and again: why is this query running slow? Why isn't my index getting used? And on and on. In order to arrive at the answer you have to ask the same return question in each case: have you looked at the execution plan? " – Grant Fritchey
Labels:
DBA,
Query Performance,
SSC
Tuesday, October 20, 2009
Kimball University: Six Key Decisions for ETL Architectures
SSC posted a link to a good article titled "Kimball University: Six Key Decisions for ETL Architectures", by Bob Becker. Although written for directors/managers, I think developers will also find many of the points useful to understanding the context of an ETL project.
Monday, October 19, 2009
The ‘Subscription List’ SQL Problem
SQL ServerCentral now has a "Stack-Overflow"-type forum for SQL questions, ask.sqlservercentral.com. Phil Factor posted an interesting problem here that asks developers to post solutions to a report on a subscription list.
Celko on Stats
Click here for a really good article by Joe Celko on statistics. He discusses the difference between causation and correlation, and how to compute such things in SQL.
T-SQL Challenge #15
T-SQL Challenge #15 is out. It requires the use of PIVOT; an example can be found on MSDN at the bottom of this entry.
Subscribe to:
Posts (Atom)