Category Archives: SQL

Concatenating strings in SQL Server, or undefined behaviour by design

We just ran into a funny problem here, using a “tried and true” technique in SQL Server to concatenate strings. I use the quotes advisedly. This technique is often suggested on blogs and sites such as Stack Overflow, but we found out (by painful experience) that it is not to be relied on.

Update, 9 Mar 2016: Bruce Gordon from Webucator has turned this into a great little 5 minute video. Thanks Bruce! I don’t know anything much about Webucator, but they are doing some good stuff with creating well-attributed videos about blog posts such as this one and apparently they do SQL Server training.

The problem

So, given the following setup:

CREATE TABLE BadConcat (
  BadConcatID INT NOT NULL,
  Description NVARCHAR(100) NOT NULL,
  SortIndex INT NOT NULL
  CONSTRAINT PK_BadConcat PRIMARY KEY CLUSTERED (BadConcatID)
)
GO

INSERT BadConcat 
  SELECT 1, 'First Item', 1 union all
  SELECT 2, 'Second Item', 2 union all
  SELECT 3, 'Third Item', 3
GO

We need to concatenate those Descriptions. I have avoided fine tuning such as dropping the final comma or handling NULLs for the purpose of this example. This example shows one of the most commonly given answers to the problem:

DECLARE @Summary NVARCHAR(100) = ''

SELECT @Summary = @Summary + ec.Description + ', '
FROM BadConcat ec
ORDER BY ec.SortIndex 

PRINT @Summary

And we get the following:

First Item, Second Item, Third Item, 

And that works fine. However, if we want to include a WHERE clause, even if that clause still selects everything, then we suddenly get something weird:

SET @Summary = ''

SELECT @Summary = @Summary + ec.Description + ', '
FROM BadConcat ec
WHERE ec.BadConcatID in (1,2,3)
ORDER BY ec.SortIndex 

PRINT @Summary

Now we get the following:

Third Item, 

What? What has SQL Server done? What’s happened to the first two items?

You’ll probably do what we did, which is to go through and make sure that you are selecting everything properly, which we are, and eventually come to the conclusion that “there must be a bug in SQL Server”.

The answer

It turns out that this iterative concatenation is unsupported functionality. Microsoft Knowledge Base article 287515 states:

You may encounter unexpected results when you apply any operators or expressions to the ORDER BY clause of aggregate concatenation queries.

Now, at first glance that does not directly apply. But we can extrapolate from that, as Microsoft developer support have done, in response to a bug report on SQL Server, to learn that:

The variable assignment with SELECT statement is a proprietary syntax (T-SQL only) where the behavior is undefined or plan dependent if multiple rows are produced

And again, in response to another bug report:

Using assignment operations (concatenation in this example) in queries with ORDER BY clause has undefined behavior. This can change from release to release or even within a particular server version due to changes in the query plan. You cannot rely on this behavior even if there are workarounds.

Some alternative solutions are given, also, in that second report:

The ONLY guaranteed mechanism are the following:

1. Use cursor to loop through the rows in specific order and concatenate the values
2. Use for xml query with ORDER BY to generate the concatenated values
3. Use CLR aggregate (this will not work with ORDER BY clause)

And the article “Concatenating Row Values in Transact-SQL” by Anith Sen goes through some of those solutions in detail. Sadly, none of them are as clean or as easy to understand as that original example.

Another example is given on Stack Overflow, which details how to safely use XML PATH to concatenate, without breaking on the XML special characters &, < and >. Applying that example into my problem code given above, we should use the following:

SELECT @Summary = (
  SELECT ec.Description + ', ' 
  FROM BadConcat ec 
  WHERE ec.BadConcatID in (1,2,3)
  ORDER BY ec.SortIndex 
  FOR XML PATH(''), TYPE
).value('.','varchar(max)')

PRINT @Summary

Voilà.

First Item, Second Item, Third Item, 

RIGHTeously tripping over T-SQL’s LEN function

We tripped over recently on our understanding of Microsoft SQL Server’s T-SQL LEN function.  The following conversation encapsulates in a nutshell what any sensible developer would assume about the function.

@marcdurdin I guess the answer is, removing the line would give the same result. Right? 🙂

— S Krupa Shankar (@tamil) April 23, 2013

Now, I wouldn’t be writing this blog post if that was the whole story.  Because, like so many of these things, it’s not quite that simple.

Originally, we were using RIGHT(string, LEN(string) – 2) to trim two characters off the front of our string.  Or something like that.  Perfectly legitimate and reasonable, one would think.  But we were getting strange results, trimming more characters than we expected.

It turns out that T-SQL’s LEN function does not return the length of the string.  It returns the length of the string, excluding trailing blanks.  That is, excluding U+0020, 0x20, 32, ‘ ‘, or whatever you want to call it.  But not tabs, new lines, zero width spaces, breaking or otherwise, or any other Unicode space category characters.  Just that one character.  This no doubt comes from the history of the CHAR(n) type, where fields were always padded out to n characters with spaces.  But today, this is not a helpful feature.

Of course, RIGHT(string) does not ignore trailing blanks

But here’s where it gets really fun.  Because a post about strings is never complete until you’ve done it in Unicode.  Some pundits suggest using DATALENGTH to get the actual length of the string.  This returns the length in bytes, not characters (remember that NCHAR is UTF-16, so 2 bytes per character… sorta!).  Starting with SQL Server 2012, UTF-16 supplementary pairs can be treated as a single character, with the _SC collations, so you can’t even use DATALENGTH*2 to get the real length of the string!

OK.  So how do you get the length of a string, blanks included, now that we’ve established that we can’t use DATALENGTH?  Here’s one simple way:

  SET @realLength = LEN(@str + ‘.’) – 1

Just to really do your head in, let’s see what happens when you use LEN with a bunch of garden variety SQL strings.  I should warn you that this is a display artefact involving the decidedly unrecommended use of NUL characters, and no more, but the weird side effects are too fun to ignore.

First, here’s a set of SQL queries for our default collation (Latin1_General_CI_AS):

Note the output.  Not exactly intuitive!  Now we run the Unicode version:

Not quite as many gotchas in that one?  Or are there?  Notice how the first line shows a wide space for the three NUL characters — but not quite as wide as the Ideographic space…

Now here’s where things go really weird.  Running in Microsoft’s SQL Server Management Studio, I switch back to the first window, and, I must emphasise, without running any queries, or making any changes to settings, take a look at how the Messages tab appears now!

That’s right, the first line in the messages has magically changed!  The only thing I can surmise is that switching one window into Unicode output has affected the whole program’s treatment of NUL characters.  Let that be a warning to you (and I haven’t even mentioned consuming said NULs in C programs).

A story, or will pay anybody five pounds to remove database from one computer to another

Today I needed to download a recent backup of a database from a server in the data centre for testing major changes to the database locally…  Things don’t really go as well as I expect:

  1. Backup is 12GB.  Yike, that’s going to take ages to download!
  2. OK, so let’s zip it.  20 minutes later… we have a 2GB zip file.  Well, that was worth doing.
  3. Now copy (encrypted) zip to a server with HTTP access (much faster than downloading over SSH).  About 1 minute (yep, nice fast network in the data centre)
  4. Download zip to database server on local network.  Takes about 45 minutes.
  5. So try to unzip on server with Win Server 2003 Zip tool.  Huh.  It doesn’t support files > 4GB (plus it’s running low on disk space, so a bit of a juggling act to have enough space to unzip anyway).
  6. OK, move the zip file to a client machine.  5 minutes on LAN
  7. Unzip locally then move 12GB backup back to server.  Best part of an hour.  Yes, it’s a slow client machine and the LAN is only 100mbit.
  8. Try to restore to SQL Server.  Oops, old version of SQL Server on that server.
  9. So install SQL Server 2008 on another server.  That takes 45 minutes (including adding .NET framework).
  10. Can’t do anything with it until we have patched SQL Server.  So download and run SQL Server 2008 SP3.  20 minutes to download.
  11. Installer starts while I’m not at computer, automatically cancels when screensaver starts (consent.exe)
  12. Drat.  Download SQL Server 2008 SP3 again as IE has deleted installer from cache.  This time I save the service pack installer.  20 minutes.
  13. Finally! Install SQL Server 2008 SP3.  10 minutes.
  14. Restore database to new server.  5 minutes.

And finally it works.

Why do the little things take so much energy?

UPDATE STATISTICS and hints in SQL Server

I was working today on a slow query in SQL Server — it was a simple query on a well-indexed table, and I
could not initially see why it would be so slow. So I did some tests, and was surprised by the results.

I have reproduced the situation I was working through with a test table, with the following structure:

CREATE TABLE HintTest (
 HintTestID INT NOT NULL IDENTITY(1,1),
 Col1 INT NULL,
 Col2 INT NULL,
 Col3 INT NULL,
 CONSTRAINT PK_HintTest PRIMARY KEY (HintTestID)
);

CREATE NONCLUSTERED INDEX HintTest_Col1 ON HintTest (Col1)
-- column 2 was not indexed
CREATE NONCLUSTERED INDEX HintTest_Col3 ON HintTest (Col3)

Then I copied a set of records from the original data I was working with. The following code will
not reproduce this (and you’ll see the reason later) but does very roughly mimic the distribution of the data:

I used a quick and dirty test harness:

DECLARE @v INT, @StartTime DATETIME
SET @StartTime = GETDATE()
SET @v = 0

DECLARE @prmRecordFound BIT, @prmCol1 INT, @prmCol2 INT, @prmCol3 INT

SET @prmCol1 = 50
SET @prmCol2 = 3
SET @prmCol3 = 1750

WHILE @v &lt; 10000
BEGIN
 SET @v = @v + 1

 -- Test Case Here
END

PRINT DATEDIFF(ms, @StartTime, GETDATE())

And here are the tests I wrote:

  -------------------------------------------
  -- TEST CASE 1: SELECT primary key and separate IF
  -------------------------------------------

  SET @prmRecordFound = 0

  DECLARE @HintTestID INT

  SELECT TOP 1 @HintTestID = HintTestID
  FROM
    HintTest
  WHERE
    Col1 = @prmCol1 AND
    Col2 = @prmCol2 AND
    Col3 = @prmCol3

  IF @HintTestID IS NOT NULL
    SET @prmRecordFound = 1

  -------------------------------------------
  -- TEST CASE 2: SELECT COUNT and separate IF
  -------------------------------------------

  SET @prmRecordFound = 0

  DECLARE @Count INT

  SELECT @Count = COUNT(*)
  FROM
    HintTest
  WHERE
    Col1 = @prmCol1 AND
    Col2 = @prmCol2 AND
    Col3 = @prmCol3

  IF @Count &gt; 0
    SET @prmRecordFound = 1

  -------------------------------------------
  -- TEST CASE 3: SELECT COUNT nested in IF
  -------------------------------------------

  SET @prmRecordFound = 0

  IF (SELECT COUNT(*)
  FROM
    HintTest
  WHERE
    Col1 = @prmCol1 AND
    Col2 = @prmCol2 AND
    Col3 = @prmCol3) &gt; 0
    SET @prmRecordFound = 1

  -------------------------------------------
  -- TEST CASE 4: SELECT COUNT with hint nest in IF
  -------------------------------------------

  SET @prmRecordFound = 0

  IF (SELECT COUNT(*)
  FROM
    HintTest WITH(INDEX(HintTest_Col1, HintTest_Col23))
  WHERE
    Col1 = @prmCol1 AND
    Col2 = @prmCol2 AND
    Col3 = @prmCol3) &gt; 0
    SET @prmRecordFound = 1

  -------------------------------------------
  -- TEST CASE 5: EXISTS SELECT * in IF
  -------------------------------------------

  SET @prmRecordFound = 0

  DECLARE @Count INT

  IF EXISTS(SELECT *
  FROM
    HintTest
  WHERE
    Col1 = @prmCol1 AND
    Col2 = @prmCol2 AND
    Col3 = @prmCol3)
    SET @prmRecordFound = 1

  -------------------------------------------
  -- TEST CASE 6: EXISTS SELECT * with hint in IF
  -------------------------------------------

  SET @prmRecordFound = 0

  DECLARE @Count INT

  IF EXISTS(SELECT *
  FROM
    HintTest WITH(INDEX(HintTest_Col1, HintTest_Col23))
  WHERE
    Col1 = @prmCol1 AND
    Col2 = @prmCol2 AND
    Col3 = @prmCol3)
    SET @prmRecordFound = 1

The first run results reproduced the situation quite well, and returned the following surprising statistics (10,000 iterations):

1 (SELECT primary key and separate IF)  846 ms
2 (SELECT COUNT and separate IF)        203 ms
3 (SELECT COUNT nested in IF)           3523 ms
4 (SELECT COUNT with hint nested in IF) 226 ms
5 (EXISTS SELECT * in IF)               3460 ms
6 (EXISTS SELECT * with hint in IF)     263 ms

I was puzzled why there would be such a difference between cases 2 and 3, given that they were so similar,
so I looked at the execution plan for the query. It turns out that the query optimizer was selecting a very
non-optimal plan (a clustered index scan) when the SELECT statement was nested in an IF statement, whereas for
case 2 it was using two index seeks followed by a merge join. Adding hints overrode the query optimizer, and
the results can be seen in cases 4 and 6.

I ran the following statement to refresh the data the query optimizer uses:

UPDATE STATISTICS HintTest

UPDATE STATISTICS will update information about the distribution of keys for indexes on the table, which is then used by the query optimizer. Then I re-ran the test (100,000 iterations shown below). Dramatic difference. Now the hinted queries were among the slowest:

1 (SELECT primary key and separate IF)  2266 ms
2 (SELECT COUNT and separate IF)        2313 ms
3 (SELECT COUNT nested in IF)           2500 ms
4 (SELECT COUNT with hint nested in IF) 2546 ms
5 (EXISTS SELECT * in IF)               2656 ms
6 (EXISTS SELECT * with hint in IF)     2706 ms

For me, the big lesson learned out of this was this:

Always run UPDATE STATISTICS before trying to optimize a query!

The second thing I learned was that the fastest query is often unexpected. I would have expected the
EXISTS condition (case 5) to be optimal for a simple boolean result but instead the fastest query
was consistently case 1 – the SELECT primary key method.