Category Archives: Unicode

RIGHTeously tripping over T-SQL’s LEN function

We tripped over recently on our understanding of Microsoft SQL Server’s T-SQL LEN function.  The following conversation encapsulates in a nutshell what any sensible developer would assume about the function.

@marcdurdin I guess the answer is, removing the line would give the same result. Right? 🙂

— S Krupa Shankar (@tamil) April 23, 2013

Now, I wouldn’t be writing this blog post if that was the whole story.  Because, like so many of these things, it’s not quite that simple.

Originally, we were using RIGHT(string, LEN(string) – 2) to trim two characters off the front of our string.  Or something like that.  Perfectly legitimate and reasonable, one would think.  But we were getting strange results, trimming more characters than we expected.

It turns out that T-SQL’s LEN function does not return the length of the string.  It returns the length of the string, excluding trailing blanks.  That is, excluding U+0020, 0x20, 32, ‘ ‘, or whatever you want to call it.  But not tabs, new lines, zero width spaces, breaking or otherwise, or any other Unicode space category characters.  Just that one character.  This no doubt comes from the history of the CHAR(n) type, where fields were always padded out to n characters with spaces.  But today, this is not a helpful feature.

Of course, RIGHT(string) does not ignore trailing blanks

But here’s where it gets really fun.  Because a post about strings is never complete until you’ve done it in Unicode.  Some pundits suggest using DATALENGTH to get the actual length of the string.  This returns the length in bytes, not characters (remember that NCHAR is UTF-16, so 2 bytes per character… sorta!).  Starting with SQL Server 2012, UTF-16 supplementary pairs can be treated as a single character, with the _SC collations, so you can’t even use DATALENGTH*2 to get the real length of the string!

OK.  So how do you get the length of a string, blanks included, now that we’ve established that we can’t use DATALENGTH?  Here’s one simple way:

  SET @realLength = LEN(@str + ‘.’) – 1

Just to really do your head in, let’s see what happens when you use LEN with a bunch of garden variety SQL strings.  I should warn you that this is a display artefact involving the decidedly unrecommended use of NUL characters, and no more, but the weird side effects are too fun to ignore.

First, here’s a set of SQL queries for our default collation (Latin1_General_CI_AS):

Note the output.  Not exactly intuitive!  Now we run the Unicode version:

Not quite as many gotchas in that one?  Or are there?  Notice how the first line shows a wide space for the three NUL characters — but not quite as wide as the Ideographic space…

Now here’s where things go really weird.  Running in Microsoft’s SQL Server Management Studio, I switch back to the first window, and, I must emphasise, without running any queries, or making any changes to settings, take a look at how the Messages tab appears now!

That’s right, the first line in the messages has magically changed!  The only thing I can surmise is that switching one window into Unicode output has affected the whole program’s treatment of NUL characters.  Let that be a warning to you (and I haven’t even mentioned consuming said NULs in C programs).

Mixing RTL and LTR: Plaintext vs HTML

Just a very short post today.  We still have some way to go in mixing RTL and LTR text.  For example, the following image, snipped from Outlook 2010, shows the issue:

The subject and body both say the same thing, but the display order is different.  Do you know why?  It’s because the subject is plain text and is assuming that the text is primarily right-to-left, whereas the body is HTML, and is assuming that the text is primarily left-to-right.

Note how the full stop in the subject appears to the left of the English text.  This is because the display renderer has assumed that the whole run of text is right-to-left, so punctuation is treated as right-to-left, and so displays after (in a right-to-left sense) the text.

The question is, of course, how do you determine directionality given an arbitrary plain text string?  It’s not really possible to do so reliably in the absence of other metadata.  The W3C article on directionality is helpful here: http://www.w3.org/TR/html4/struct/dirlang.html

Another view of the message:

Interestingly, Outlook Web Access does not do this, because its UI takes its directionality from the base HTML document:

Indy, TIdURI.PathEncode, URLEncode and ParamsEncode and more

Frequently in Delphi we come across the need to encode a string to stuff into a URL query string parameter (as per web forms).  One would expect that Indy contains well-tested functions to handle this.  Well, Indy contains some functions to help with this, but they may not work quite as you expect.  In fact, they may not be much use at all.

Indy contains a component called TIdURI.  It contains, among other things, the member functions URLEncode, PathEncode, and ParamsEncode. At first glance, these seem to do what you would need.  But in fact, they don’t.

URLEncode will take a full URL, split it into path, document and query components, encode each of those, and return the full string.  PathEncode is intended to handle the nuances of the path and document components of the URL, and ParamsEncode handles query strings.

Sounds great, right?  Well, it works until you have a query parameter that has an ampersand (&) in it.  Say my beloved end user want to search for big&little.  It seems that you could pass the following in:

s := TIdURI.URLEncode('http://www.google.com/search?q='+SearchText);

But then we get no change in our result:

s = 'http://www.google.com/search?q=big&little';

And you can already see the problem: little is now a separate parameter in the query string.  How can we work around this?  Can we pre-encode ampersand to %26 before you pass in the parameters?

s := TIdURI.URLEncode('http://www.google.com/search?q='+ReplaceStr(SearchText, '&', '%26'));

No:

s = 'http://www.google.com/search?q=big%25%26little';

And obviously we can’t do it ourselves afterwards, because we too won’t know which ampersands are which.  You could do correction of ampersand by encoding each parameter component separately and then post-processing the component for ampersand and other characters before final assembly using ParamsEncode. But you’ll soon find that it’s not enough anyway.  =, / and ? are also not encoded, although they should be.  Finally, URLEncode does not support internationalized domain names (IDN).

Given that these functions are not a complete solution, it’s probably best to avoid them altogether.

The problem is analogous to the Javascript encodeURI vs encodeURIComponent issue.

So to write your own…  I haven’t found a good Delphi solution online (and I searched a bit), so here’s a function I’ve cobbled together (use at your own risk!) to encode parameter names and values. You do need to encode each component of the parameter string separately, of course.

function EncodeURIComponent(const ASrc: string): UTF8String;
const
  HexMap: UTF8String = '0123456789ABCDEF';

  function IsSafeChar(ch: Integer): Boolean;
  begin
    if (ch >= 48) and (ch <= 57) then Result := True // 0-9
    else if (ch >= 65) and (ch <= 90) then Result := True // A-Z
    else if (ch >= 97) and (ch <= 122) then Result := True // a-z
    else if (ch = 33) then Result := True // !
    else if (ch >= 39) and (ch <= 42) then Result := True // '()* 
    else if (ch >= 45) and (ch <= 46) then Result := True // -.
    else if (ch = 95) then Result := True // _
    else if (ch = 126) then Result := True // ~
    else Result := False;
  end;
var
  I, J: Integer;
  ASrcUTF8: UTF8String;
begin
  Result := '';    {Do not Localize}

  ASrcUTF8 := UTF8Encode(ASrc);
  // UTF8Encode call not strictly necessary but
  // prevents implicit conversion warning

  I := 1; J := 1;
  SetLength(Result, Length(ASrcUTF8) * 3); // space to %xx encode every byte
  while I <= Length(ASrcUTF8) do
  begin
    if IsSafeChar(Ord(ASrcUTF8[I])) then
    begin
      Result[J] := ASrcUTF8[I];
      Inc(J);
    end
    else
    begin
      Result[J] := '%';
      Result[J+1] := HexMap[(Ord(ASrcUTF8[I]) shr 4) + 1];
      Result[J+2] := HexMap[(Ord(ASrcUTF8[I]) and 15) + 1];
      Inc(J,3);
    end;
    Inc(I);
  end;

  SetLength(Result, J-1);
end;

To use this, do something like the following:

function GetAURL(const param, value: string): UTF8String;
begin
  Result := 'http://www.example.com/search?'+
    EncodeURIComponent(param)+
    '='+
    EncodeURIComponent(value);
end;

Hope this helps. Sorry, I haven’t got an IDN solution in this post!

Updated 15 Nov 2018: Fixed bug with handling of space (should output %20, not +).

Updates to kmwString functions

A couple of weeks ago I published on this blog a set of functions for handling supplementary characters in JavaScript.  We have since fixed a few bugs in the code and have now made the kmwString script available on github at https://github.com/tavultesoft/keyman-tools

This update:

  • Fixes an error which could occur when only 1 parameter was passed to kmwSubstring and that parameter was negative
  • Fixed an incorrect NaN return value for kmwCodePointToCodeUnit when passed an offset at the very end of a string
  • Adds kmwCodeUnitToCodePoint function
  • Adds kmwCharAt function — which is basically just kmwSubstr(index,1)

Again, suggestions, comments, fixes and flames all gratefully received.

A more UTF-32 aware JavaScript String library

One of the hassles I regularly experience with JavaScript is that it does not have native support for supplementary characters.  Internally, JavaScript uses UCS-2 encoding (unlike most of the rest of the web, which uses UTF-8…)  While you can use surrogate pairs to represent Unicode characters between U+10000 and U+10FFFF, this makes string handling with these characters a pain.  In particular, functions such as indexOf and substr have to be very carefully used, both to account for the surrogate pairs in their index parameters, and to avoid cutting them in half when manipulating the string. Of course, when interfacing with third party services, you will need to be aware of how they handle text.  For instance, the Twitter 140 character limit counts Unicode code points, not UCS-2 code units.  But many other products, (for example, Microsoft SQL Server), use UTF-16 or UCS-2 internally and treat supplementary plane characters as 2 code units for the purposes of calculating field size.  Developer beware! Anyway, I have put together a small set of functions that treat surrogate pairs as a single code point, abstracting away surrogate pairs at the basic String level.  Adding support for surrogate pairs is not a complete solution — I haven’t done any work on regular expressions, for example, and this code also does not begin to address more complex requirements around grapheme clusters or normalisation, but this is just one less complexity to worry about. The functions do not replace any existing String functions. This code is not complete: I am missing some boundary conditions and edge cases, and I haven’t yet tested with isolated surrogate code units — but for what it’s worth, here it is.  The kmw prefix refers to KeymanWeb, which will shortly be using the functions (replacing the mishmash of code we currently use…) Some simple examples:

var str="Brave "+String.kmwFromCharCode(0x13027, 0x1314C, 0x1309C)+" world";

alert(str.indexOf("w"));    // Displays 13
alert(str.kmwIndexOf("w")); // Displays 10

alert(str.length);       // Displays 18
alert(str.kmwLength());  // Displays 15

alert(str.kmwSubstr(4,3));  // Displays e U+13027
alert(str.substr(4,3));  // Displays e U+D80C (half a supplementary pair!)

The license on this code is Mozilla Public License 1.1.  A couple of functions were lifted from the proposal Supplementary Characters for ECMAScript by Norbert Lindenberg and tweaked (back) to more closely mimic the functions they replace, warts and all.  These two functions are probably better tested than my ones! Version 1.0 of this library plus a rudimentary test script can be downloaded from http://durdin.net/blog-files/kmwString-0.1.zip. Comments, bug fixes, flames, suggestions much appreciated!

/**
  @preserve (C) 2012 Tavultesoft Pty Ltd
  
  Adds functions to treat supplementary plane characters in the same 
  way as basic multilingual plane characters in JavaScript.
  
  Version 0.1
  
  License
  
  The contents of this file are subject to the Mozilla Public License
  Version 1.1 (the "License"); you may not use this file except in
  compliance with the License. You may obtain a copy of the License at
  http://www.mozilla.org/MPL/

  Software distributed under the License is distributed on an "AS IS"
  basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
  License for the specific language governing rights and limitations
  under the License.

  The Original Code is (C) 2012 Tavultesoft Pty Ltd.

  The Initial Developer of the Original Code is Tavultesoft.
*/

/**
 * Constructs a string from one or more Unicode character codepoint values 
 * passed as integer parameters.
 * 
 * @param  {integer} cp0,...   1 or more Unicode codepoints, e.g. 0x0065, 0x10000
 * @return {String}            The new String object.
 */
String.kmwFromCharCode = function() {
  var chars = [], i;
  for (i = 0; i < arguments.length; i++) {
    var c = Number(arguments[i]);
	if (!isFinite(c) || c < 0 || c > 0x10FFFF || Math.floor(c) !== c) {
	  throw new RangeError("Invalid code point " + c);
	}
	if (c < 0x10000) {
	  chars.push(c);
	} else {
	  c -= 0x10000;
	  chars.push((c >> 10) + 0xD800);
	  chars.push((c % 0x400) + 0xDC00);
	}
  }
  return String.fromCharCode.apply(undefined, chars);
}

/**
 * Returns a number indicating the Unicode value of the character at the given 
 * code point index, with support for supplementary plane characters.
 * 
 * @param  {integer} codePointIndex  The code point index into the string (not 
                                     the code unit index) to return
 * @return {integer}                 The Unicode character value
 */
String.prototype.kmwCharCodeAt = function(codePointIndex) {
  var str = String(this);
  var codeUnitIndex = 0;
  
  if (codePointIndex < 0 || codePointIndex  >= str.length) {
    return NaN;
  }

  for(var i = 0; i < codePointIndex; i++) {
    codeUnitIndex = str.kmwNextChar(codeUnitIndex);
	if(codeUnitIndex == undefined) return NaN;
  }
  
  var first = str.charCodeAt(codeUnitIndex);
  if (first >= 0xD800 && first <= 0xDBFF && str.length > codeUnitIndex + 1) {
    var second = str.charCodeAt(codeUnitIndex + 1);
	if (second >= 0xDC00 && second <= 0xDFFF) {
	  return ((first - 0xD800) << 10) + (second - 0xDC00) + 0x10000;
	}
  }
  return first;  
}

/**
 * Returns the code point index within the calling String object of the first occurrence
 * of the specified value, or -1 if not found.
 * 
 * @param  {string}  searchValue    The value to search for
 * @param  {integer} fromIndex      Optional code point index to start searching from
 * @return {integer}                The code point index of the specified search value
 */
String.prototype.kmwIndexOf = function(searchValue, fromIndex) {
  var str = String(this);
  var codeUnitIndex = str.indexOf(searchValue, fromIndex);
  
  if(codeUnitIndex < 0) {
    return codeUnitIndex;
  }
  
  var codePointIndex = 0;
  for(var i = 0; i < codeUnitIndex; i = str.kmwNextChar(i), codePointIndex++);
  return codePointIndex;
}

/**
 * Returns the code point index within the calling String object of the last occurrence 
 * of the specified value, or -1 if not found.
 * 
 * @param  {string}  searchValue    The value to search for
 * @param  {integer} fromIndex      Optional code point index to start searching from
 * @return {integer}                The code point index of the specified search value
 */
String.prototype.kmwLastIndexOf = function(searchValue, fromIndex)
{
  var str = String(this);
  var codeUnitIndex = str.lastIndexOf(searchValue, fromIndex);
  
  if(codeUnitIndex < 0) {
    return codeUnitIndex;
  }
  
  var codePointIndex = 0;
  for(var i = 0; i < codeUnitIndex; i = str.kmwNextChar(i), codePointIndex++);
  return codePointIndex;
}

/**
 * Returns the length of the string in code points, as opposed to code units.
 * 
 * @return {integer}                The length of the string in code points
 */
String.prototype.kmwLength = function() {
  var str = String(this);
  
  if(str.length == 0) {
    return 0;
  }
  
  for(var i = 0, codeUnitIndex = 0; codeUnitIndex != undefined; i++, 
    codeUnitIndex = str.kmwNextChar(codeUnitIndex));
  return i;
}

/**
 * Extracts a section of a string and returns a new string.
 * 
 * @param  {integer} beginSlice    The start code point index in the string to 
 *                                 extract from
 * @param  {integer} endSlice      Optional end code point index in the string
 *                                 to extract to
 * @return {string}                The substring as selected by beginSlice and
 *                                 endSlice
 */
String.prototype.kmwSlice = function(beginSlice, endSlice) {
  var str = String(this);
  var beginSliceCodeUnit = str.kmwCodePointToCodeUnit(beginSlice);
  var endSliceCodeUnit = str.kmwCodePointToCodeUnit(endSlice);
  return str.slice(beginSliceCodeUnit, endSliceCodeUnit);
}

/**
 * Returns the characters in a string beginning at the specified location through
 * the specified number of characters.
 * 
 * @param  {integer} start         The start code point index in the string to 
 *                                 extract from
 * @param  {integer} length        Optional length to extract
 * @return {string}                The substring as selected by start and length
 */
String.prototype.kmwSubstr = function(start, length)
{
  var str = String(this);
  if(start < 0)
  {
    start = str.kmwLength() + start;
	if(start < 0) {
	  start = 0;
	}
  }
  var startCodeUnit = str.kmwCodePointToCodeUnit(start);
  var endCodeUnit = startCodeUnit;
  
  if(length == undefined) {
    endCodeUnit = str.length;
  } else {
    for(var i = 0; i < length; i++, endCodeUnit = str.kmwNextChar(endCodeUnit));
  }

  return str.substring(startCodeUnit, endCodeUnit);
}

/**
 * Returns the characters in a string between two indexes into the string.
 * 
 * @param  {integer} indexA        The start code point index in the string to 
 *                                 extract from
 * @param  {integer} indexB        The end code point index in the string to 
 *                                 extract to
 * @return {string}                The substring as selected by indexA and indexB
 */
String.prototype.kmwSubstring = function(indexA, indexB)
{
  var str = String(this);
  
  if(indexA > indexB) { var c = indexA; indexA = indexB; indexB = c; }
  
  var indexACodeUnit = str.kmwCodePointToCodeUnit(indexA);
  var indexBCodeUnit = str.kmwCodePointToCodeUnit(indexB);
  if(isNaN(indexBCodeUnit)) indexBCodeUnit = str.length;

  return str.substring(indexACodeUnit, indexBCodeUnit);
}

/*
  Helper functions
*/

/**
 * Returns the code unit index for the next code point in the string, accounting for
 * supplementary pairs 
 *
 * @param  {integer} codeUnitIndex   The code unit position to increment
 * @return {integer}                 The index of the next code point in the string,
 *                                   in code units
*/
String.prototype.kmwNextChar = function(codeUnitIndex) {
  var str = String(this);
  
  if(codeUnitIndex < 0 || codeUnitIndex >= str.length - 1) {
    return undefined;
  }
  
  var first = str.charCodeAt(codeUnitIndex);
  if (first >= 0xD800 && first <= 0xDBFF && str.length > codeUnitIndex + 1) {
    var second = str.charCodeAt(codeUnitIndex + 1);
	if (second >= 0xDC00 && second <= 0xDFFF) {
	  if(codeUnitIndex == str.length - 2) {
	    return undefined;
	  }
	  return codeUnitIndex + 2;
	}
  }
  return codeUnitIndex + 1;
}

/**
 * Returns the code unit index for the previous code point in the string, accounting
 * for supplementary pairs 
 *
 * @param  {integer} codeUnitIndex   The code unit position to decrement
 * @return {integer}                 The index of the previous code point in the
 *                                   string, in code units
*/
String.prototype.kmwPrevChar = function(codeUnitIndex) {
  var str = String(this);

  if(codeUnitIndex <= 0 || codeUnitIndex > str.length) {
    return undefined;
  }
  
  var second = str.charCodeAt(codeUnitIndex - 1);
  if (second >= 0xDC00 && first <= 0xDFFF && codeUnitIndex > 1) {
    var first = str.charCodeAt(codeUnitIndex - 2);
	if (first >= 0xD800 && second <= 0xDBFF) {
	  return codeUnitIndex - 2;
	}
  }
  return codeUnitIndex - 1;
}

/**
 * Returns the corresponding code unit index to the code point index passed
 *
 * @param  {integer} codePointIndex  A code point index in the string
 * @return {integer}                 The corresponding code unit index
*/
String.prototype.kmwCodePointToCodeUnit = function(codePointIndex) {
  var str = String(this);
  
  var codeUnitIndex = 0;

  if(codePointIndex < 0) {
    codeUnitIndex = str.length;
    for(var i = 0; i > codePointIndex; i--, 
	  codeUnitIndex = str.kmwPrevChar(codeUnitIndex));	
    return codeUnitIndex;
  }

  for(var i = 0; i < codePointIndex; i++,
    codeUnitIndex = str.kmwNextChar(codeUnitIndex));
  return codeUnitIndex;
}

Updated 11 May 2012: Removed script formatting code as site hosting it was unreliable. Back to good old <pre> for now.
Updated 29 May 2014: Fixed broken script text, damaged when inserted previously.