Category Archives: Unicode

Notes on a Khmer mobile keyboard for Keyman

While other Khmer keyboards exist for iOS and Android, I wanted to try playing with one myself, given I am currently learning Khmer. Creating a keyboard layout is a great way to rapidly become very familiar with a script!

Keyman Developer makes it easy to play around with the layout of a keyboard and rapidly iterate the design. I was able to turn the original desktop layout into a mobile-optimised layout in under two hours, using only the visual editor.

keyman-developer

The base keyboard comes from the khmer10 Keyman keyboard created by Andrew Cunningham, which is based on the NiDA Khmer keyboard layout. This is a desktop keyboard, which follows a phonetic-style input, with letters placed as far as possible on keys with a similar sound in the English alphabet. For example, ក is on the [k] key.

Design principles

I had a number of goals I wanted to achieve with this keyboard.

A design good for a language learner

I wanted the design to help me remember the script, the sounds and related letters. This may not be optimal for a person fluent in the language, but for me, the NiDA layout’s phonetic-style layout was a good starting point.

Reduce the number of keys …

As the original design is for a desktop keyboard, there are too many keys to fit on a normal mobile layout. As a mobile keyboard should ideally have no more than ten keys in a row, I started with reducing that as one goal.

… But don’t lose all relationship with the desktop keyboard

I tried to avoid moving keys around on the keyboard, or removing keys other than the ones described below. This way, once I do become familiar with the mobile layout, it is not a difficult transition to using the NiDA layout on a desktop computer.

Move symbols and numbers off base layers

A number of non-alphabetic symbols are placed in a seemingly haphazard fashion on both the unshifted and Shift layers. The position of these symbols is probably pragmatic – the keys were available and not used for any other purpose. On a touch layout, we don’t need to maintain this because we can have as many or as few keys as we wish.

I moved all non-alphabetic symbols and numbers to a numeric/symbol layer.

symbol-numeric-layer

Relate sub consonants to base consonants

ក្ក  ក្ខ  ក្គ  ក្ឃ  ក្ង  ក្ច  ក្ឆ  ក្ជ  ក្ឈ  ក្ញ  ក្ដ  ក្ឋ  ក្ឌ  ក្ឍ  ក្ណ  ក្ត  ក្ថ  ក្ទ  ក្ធ  ក្ន  ក្ប  ក្ផ  ក្ព  ក្ភ  ក្ម  ក្យ  ក្រ  ក្ល  ក្វ  ក្ឝ  ក្ឞ  ក្ស  ក្ហ  ក្ឡ  ក្អ

The original keyboard uses the key [j] as a prefix to create a sub consonant, by emitting the Unicode U+17D2 sub consonant marker character. This is a little obscure, and meant that the shapes of sub consonants were never visible on the keyboard.

I wanted to hide this encoding-based knowledge. I have added the sub consonants to the keyboard under a longpress menu for each base consonant, and added a rule to delete both the consonant and the prefix U+17D2 marker together when backspace is pressed. Thus, the average keyboard user need never know about the existence of U+17D2.

longpress-sub-consonant

Independent vowels

As these are less frequently typed than the dependent vowels, they really didn’t need their own key on the keyboard. Again, from a language learner point of view, placing the independent vowels under the related dependent vowel symbol, as longpress menus, helps me to learn the shapes more rapidly. It also means that I don’t confuse the independent vowel shapes with similarly shaped consonants on the layout.

Adding missing characters

The Khmer digits were already on the keyboard, but the Hindu-Arabic numerals were not. I added these as long-press under the Khmer digit keys on the numeric/symbol layer.

Things that are not yet right

This keyboard is nowhere near finished, but it’s now at a good point to start using it, before making further optimisations. After using it for a while, I will have a clearer understanding of what is uncomfortable or awkward to use, and will also probably have better ideas of how to resolve the issues.

Some of the issues I already know about are:

  • Still too many keys per row: some rows have 12 keys. I should try to reduce this to 10 keys.
  • There are vowel combinations that may or may not be necessary. These do not render correctly on most operating systems on the keyboard, but do when used in writing.
  • The backspace key should be either on the top row, or on the third row. It’s only on the second row at present because that was where there was space!
  • The keyboard is missing a number of symbols that I probably won’t be using for now, but should be available for a complete solution, such as the Lek Attak numeric divination symbols. Some archaic letters are also not yet present.
  • The Khmer OS fonts do not use the dotted circle for isolated diacritics, which makes them hard to read on the keyboard.
  • The layers are not precisely the same width, which leads to slightly disconcerting movement on layer switches.
  • The shift key on the shift layer is missing its icon.
  • I just realised the independent vowels are currently missing – I need to re-add those soon!

Issues with using the keyboard include:

  • iOS has rendering bugs with Khmer, for example ខ្ញុំ on iOS overlaps the two subscript marks.

khmer-on-ios-khnyom-render-bug

 

  • On Android, there are different rendering issues, for example the font is too large for the keys by default (showing the original keyboard as I don’t have an Android device handy for an up-to-date screenshot).

khmer-on-android-render-bug

Get the keyboard + source

Despite these limitations, the keyboard should be usable for typing most Khmer text today, as least on iOS. It certainly covers most of my needs as a language learner today. As such, I’ve made it available for install into Keyman for iPhone and iPad or Keyman for Android.

First, install the app from the appropriate link above. Then click the link below to add the keyboard to your device. I haven’t included a font in the keyboard, so you’ll need a device which already supports rendering the script.

Install the keyboard

I welcome any feedback, of course!

The source of the keyboard is available on GitHub at https://github.com/mcdurdin/experimental-keyboards.

An update for EncodeURIComponent

Way way back in the dark ages of Delphi XE2, I wrote a function to encode components of a URI. Now, this function has been updated for use on mobile platforms, by Nicolas Dusart, and I quote Nicolas:

I had to make some modifications on it to compile for the mobile platforms, as the strings are 0-based on these platforms.

I also modified it to escape non-ASCII characters using their UTF-8 encoding as the standards advices. For multi-bytes characters, each byte is percent-encoded as usual.

Here’s the code, maybe it could interests you and the future readers of that article 🙂

And here’s Nicolas’s updated function in all its glory:

function EncodeURIComponent(const ASrc: string): string;
const
  HexMap: string = '0123456789ABCDEF';

  function IsSafeChar(ch: Byte): Boolean;
  begin
    if (ch >= 48) and (ch <= 57) then Result := True    // 0-9
    else if (ch >= 65) and (ch <= 90) then Result := True  // A-Z
    else if (ch >= 97) and (ch <= 122) then Result := True  // a-z
    else if (ch = 33) then Result := True // !
    else if (ch >= 39) and (ch <= 42) then Result := True // '()*
    else if (ch >= 45) and (ch <= 46) then Result := True // -.
    else if (ch = 95) then Result := True // _
    else if (ch = 126) then Result := True // ~
    else Result := False;
  end;

var
  I, J: Integer;
  Bytes: TBytes;
begin
  Result := '';
    
  Bytes := TEncoding.UTF8.GetBytes(ASrc);
    
  I := 0;
  J := Low(Result);

  SetLength(Result, Length(Bytes) * 3); // space to %xx encode every byte

  while I < Length(Bytes) do
  begin
    if IsSafeChar(Bytes[I]) then
    begin
      Result[J] := Char(Bytes[I]);
      Inc(J);
    end
    else
    begin
      Result[J] := '%';
      Result[J+1] := HexMap[(Bytes[I] shr 4) + Low(ASrc)];
      Result[J+2] := HexMap[(Bytes[I] and 15) + Low(ASrc)];
      Inc(J,3);
    end;
    Inc(I);
  end;
    
  SetLength(Result, J-Low(ASrc));
end;

Many thanks, Nicolas 🙂

How to render combining marks consistently across platforms: a long story

I read today Richard Ishida’s notes on his experiences displaying combining characters in a consistent way across browsers, and recalled with fondness the pain we have experienced with this, both in browsers and in apps, over many years.

When we design a keyboard layout for Keyman, we will often want to show a diacritic mark or combining character from a complex script by itself on a key cap, or show the mark with a substituted placeholder symbol that is distinct from the script itself, to show how the combining character fits with the base letter.

A commonly used placeholder symbol is ◌ U+25CC DOTTED CIRCLE, but there are others that are preferred for some languages.

U+25CC

I show below three attempts at presenting the combining marks on our Open Source KeymanWeb platform keyboards, running in Chrome on Windows 8.1. The first shows a Lao keyboard, with each of the combining characters presented around a base  U+0E81 LAO LETTER KO on the keyboard. As you can see, this is not wonderful because the Ko Kai letter adds noise and makes it hard to pick out the combining characters.

Lao keyboard on keymanweb.com
Lao keyboard on keymanweb.com

The second keyboard is Thai. This relies on the system renderer and does not include a base character for the combining characters. In this case, that means that no base letter is shown and the marks show reasonably clearly. Unfortunately, this is not reliable across platforms, browsers or languages — it happens to work okay for Thai in Chrome on Windows 8.1.  On some other systems, the combining marks are shifted to the far left of the key cap, or even off the key entirely.

Thai keyboard on keymanweb.com
Thai keyboard on keymanweb.com

The third keyboard example is for Gujarati, and again relies on the system renderer and does not include a base character for the combining characters. Now this time you see that the Uniscribe renderer on Windows 8.1 automatically inserts a dotted circle glyph — this may or may not be the same as the U+25CC character. Again, this behaviour differs across platforms.

Gujarati keyboard on keymanweb.com
Gujarati keyboard on keymanweb.com

The Story Today

The story today is pretty unfortunate. Display of isolated combining marks is not implemented at all consistently across the plethora of rendering engines, operating systems, browsers and fonts. Some systems will automatically insert a placeholder glyph if no conforming base character is found prior to the combining character. Others don’t. What do I mean by a conforming base character? Well, that’s up to the implementation of the renderer. Some renderers will treat any base character that is not in the same script as the combining character as part of a separate run, and won’t attempt to combine. Others will combine with some outside-script characters but not others.

Things get even worse when working with multiple combining characters without a base character. Unless this has been specifically handled in the renderer, you will typically see poor layout or rendering of the characters that may not be representative of how they form above normal base characters.

The image below shows examples of the marks U+0EB5 LAO VOWEL SIGN II, U+0EB5 LAO VOWEL SIGN II + U+0EC8 LAO TONE MAI EK and U+0EB3 LAO VOWEL SIGN AM, rendered with and without U+25CC DOTTED CIRCLE, on various systems. A more comprehensive set of examples is included later in this blog.

Sample Lao combining character renders
Sample Lao combining characters rendered

You can see, in the image above, examples of:

  • Repeated dotted circles
  • Diacritics that don’t align over their base character correctly
  • Rendering errors leading to tofu
  • Diacritics that don’t stack
  • Marks that overlap incorrectly

None of the basic text-only solutions work consistently across all systems.

It’s like stuffing a live octopus into a box

Attempting to address the problem is not fun. Fixing the problem on one browser/renderer/OS/font system invariably creates a new problem on another system. It’s like stuffing a live octopus into a box: each time you manage to get one of those pesky tentacles into the box, two others have snuck out the other side.

A Solution

We worked hard to find a consistent cross-browser solution. We started with Lao, but we believe that the same principles can be applied to other scripts. The solution relies on web fonts, but falls back to an acceptable solution in the shrinking subset of systems where web fonts are not supported.

We started by creating a font that is designed for use specifically with the On Screen Keyboard — it would never be used for presentation or editing. While at first glance this meant we could easily solve the problem by placing all the required marks into the Private Use Area, doing so would mean that on systems where the font was not available, the keyboard would be unusable.

Our final solutionhack uses the OpenType GSUB feature to replace the  U+0E81 LAO LETTER KO with a placeholder mark when, and only when, it is used as a base character for a combining mark. The keyboard presentation code stores the Ko Kai letter as a base character for each combining character (or characters), and at render time the font substitutes its placeholder mark. When a key with this pair of characters is clicked, only the combining mark is displayed.

Lao keyboard with substituted base
Lao keyboard with substituted base

Doing this means we can use the Ko Kai letter by itself on the keyboard (on the D key) without a problem (for example on its own key), and in the case where the special font is not available, the user will see an acceptable fallback with the Ko Kai letter as the base character for combining characters.

It’s worth noting that many renderers require that the placeholder character be the same font and style as the combining characters. Today, this sadly precludes using a different colour for the placeholder character.

Examples

I have collected screenshots from a variety of browsers that illustrate some of the inconsistencies and frustrations, as well as the happily correct rendering. You can also visit the sample page to see how well the rendering works in your system.

The screenshots include three combinations:

  • A simple single U+0EB5 LAO VOWEL SIGN II
  • A pair of diacritic marks U+0EB5 LAO VOWEL SIGN II + U+0EC8 LAO TONE MAI EK
  • A third combining character U+0EB3 LAO VOWEL SIGN AM

The sample shows these three combinations with the default font for the system, a Lao OpenType font “Saysettha Web”, and the special font “SaysetthaX Web” which has our hack in it, along with three different base characters: a hard-coded U+25CC DOTTED CIRCLE, a U+00A0 NO-BREAK SPACE, and a U+0E81 LAO LETTER KO.

Initially, the dotted circle with the Saysettha Web font looks hopeful.  But in the end you will see that only one solution works on all browsers tested: the bottom right cell with the custom font and the Lao Letter Ko. The “hollow x” glyph that we use in this font is similar to the placeholder used in Lao language primers. However, any shape can of course be used here if you are creating your own font.

Windows 8.1 – Internet Explorer

Windows 8.1 - Internet Explorer 11

 

Internet Explorer 11 on Windows 8.1 does surprisingly badly here — worse than earlier versions of Windows.  Notice how the default Lao font (DokChampa) does not combine with dotted circle, nor does it combine the pair of diacritics correctly with no-break space.

Windows 8.1 – Chrome

Windows 8.1 - Chrome

 

While Chrome stacks the pair of diacritics correctly in all cases, the positioning for Dok Champa relative to the base character is incorrect for both dotted circle and no-break space. You can also see how the Saysettha Web and SaysetthaX Web fonts position their diacritics to the left of the cell in the case of the no-break space.

Windows 8.1 – Firefox

Windows 8.1 - Firefox

 

Firefox renderers similarly to Chrome. Note that it appears to be using Lao UI instead of DokChampa for the default font.

Windows 7 – Internet Explorer

Windows 7 - Internet Explorer 9

Windows 7 – Chrome

Windows 7 - Chrome

 

(Font smoothing was not enabled on the remote connection when I captured the screenshot)

Windows 7 – Firefox

Windows 7 - Firefox

(Font smoothing was not enabled on the remote connection when I captured the screenshot)

Windows XP – Firefox

Windows XP - Firefox

 

(Font smoothing was not enabled on the remote connection when I captured the screenshot)

Mac OS X – Safari

Mac OS X - Safari

 

Here we see the Saysettha Web solution with dotted circle, which worked consistently in Windows, fails in Mac OS X. Cross-platform support comes back to haunt us. Get back into that box, O tentacle!

Mac OS X – Chrome

Mac OS X - Chrome

Mac OS X – Firefox

Mac OS X - Firefox

It appears that Firefox is having trouble loading the Saysettha Web font, although the SaysetthaX Web font is loading, so this problem is probably resolvable.

iOS 8.1 – Safari

iOS 8.1 - Safari

 

This is at least consistent with the situation in Safari on Mac OS X 10.8!

 

Windows Phone 8.1 – Internet Explorer

Windows Phone 8.1 - Internet Explorer 11

 

This is similar, but not identical to Internet Explorer 9 on Windows 7. That would be too easy!

 

Android 4.3 – Android Browser

Android - Android Browser

 

Clearly, no Lao font was available on the system, so the web fonts were the only ones that worked.

Android 4.3 – Chrome

Android - Chrome

Clearly, no Lao font was available on the system, so the web fonts were the only ones that worked.

Android 4.3 – Firefox

Android - Firefox

Clearly, no Lao font was available on the system. It also appears that Firefox is having trouble loading the Saysettha Web font, although the SaysetthaX Web font is loading, so this problem is probably resolvable.

Conclusion

In summary, the only solution today for on screen keyboards (or character pickers) that works consistently across all browsers is to create a custom font with what can only be classed as a hack, and supply that font for use within those pickers / on screen keyboards.

I certainly agree with Richard that a more formalised solution based on U+00A0 no-break space or U+25CC dotted circle would be fantastic.

iOS 8 beta 1 — first bug reports

Like every other iOS developer, I have already downloaded and installed XCode 6 and the first beta 8.0 of iOS onto one of my test iDevices. And, like every other IOS developer, I immediately went to go and test one of my apps on the new build. And, unfortunately, as can be expected with a beta, I found a bug. I have dutifully filed a bug report via Apple’s bugreport.apple.com!

Given that bug reports are private, I have opted to make information public here because I have had many, many of my product users ask me about it: the bug first arose with iOS 7.1 and I had hoped that it had been addressed in 8.0. Most of my users are not technical enough to be able to navigate the bugreport.apple.com interface, so their only recourse is to complain to us!

Bug #1: Custom font profiles fail to register and work correctly after device restart

We have developed a number of custom font profiles for various languages, following the documentation on creating font profiles for iOS 7+ at https://developer.apple.com/library/ios/featuredarticles/iPhoneConfigurationProfileRef/iPhoneConfigurationProfileRef.pdf. Each of these profiles exhibits the same problem: after the font profile is installed, the specific language text usually displays in all apps, including Notes, Mail and more. However, as soon as the device is restarted, the font fails to display in any apps. In some cases, residual display of the font continues after the restart, but any edit to the text causes the display to revert to .notdef glyphs or similar.

Amharic text -- square boxes
Amharic text before the font profile is installed — square boxes
Amharic-Text-Notes-Success
Amharic text after the font profile is installed: now readable.  But not for long.

Even before the device is restarted, font display is sometimes inconsistent. For example, if you shutdown mail and restart it, fonts will sometimes display correctly and sometimes incorrectly.

The samples given are using the language Amharic.  The font profile can be installed through my Keyman app, available at http://keyman.com/iphone.

A sample of text in Amharic is ጤና ይስጥልን (U+1324 U+1293 U+0020 U+12ED U+1235 U+1325 U+120D U+1295).  This text displays correctly when the font profile is first installed, in some situations, and always displayed correctly in iOS 7.0.   The issue first arose in iOS 7.1 and has continued into the iOS 8.0 beta.

References:

Bug #2: Touches on fixed elements in Safari are offset vertically

In Safari in iOS 8.0 beta 1, I have found that touching fixed elements often results in a touch which is 200-odd pixels north of the actual location I touch.  No doubt plenty of people will report this one!

Even charset geeks can be fooled by character spoofing

I was preparing a new git repository today for a website, on my Windows machine, and moving a bunch of existing files over for addition.  When I ran git add ., I ran into a weird error:

C:\tavultesoft\website\help.keyman.com> git add .
fatal: unable to stat 'desktop/docs/desktop_images/usage-none.PNG': No such file or directory

How could a file be there — and not there?  I fired up Explorer to find the file and there it was, looked fine.  I’d just copied there, so of course it was there!

usage.png seems to be there just fine

For a moment, I scratched my head, trying to figure out what could be wrong.  The file looked fine.  It was in alphabetical order, so it seemed that the letters were of the correct script.

Being merely a bear of little brain, it took me some time to realise that I could just examine the character codepoints in the filename.  When this finally sunk in, I quickly pulled out my handy charident tool and copied the filename text to the clipboard:

usage-none-selection

And pasted it into the Character Identifier:

usage-none-charident

With a quick scan of the Unicode code points, I quickly noticed that, sure enough, the letter ‘g‘ (highlighted) was not what was expected.  It turns out that U+0261 is LATIN SMALL LETTER SCRIPT G, not quite what was anticipated (U+0067 LATIN SMALL LETTER G).  And in the Windows 8.1 fonts used in Explorer, the ‘ɡ‘ and ‘g‘ characters look identical!

g-g
I checked some of the surrounding files as well.  And looking at usage-help.PNG, I could see no problems with it:

usage-help-charident

So why did git get so confused?  OK, so git is a tool ported from the another world (“Linux”).  It doesn’t quite grok Windows character set conventions for filenames.  This is kinda what it saw when looking at the file (yes, that’s from a dir command):

usa[]e

But then somewhere in the process, a normalisation was done on the original filename, converting ɡ to g, and thus it found a mismatch, and reported a missing usage-none.PNG.

Windows does a similar compatibility normalisation and so confuses the user with seemingly sensible sort orders.  But it doesn’t prevent you from creating two files with visually identical names, thus:

double-usage-none

I’m sure there’s a security issue there somewhere…

Delphi’s TJSONString.ToString is broken, and how to fix it

As per several QC reports, Data.DBXJSON.TJSONString.ToString is still very broken. Which means, for all intents and purposes, TJSONAnything.ToString is also broken. Fortunately, you can just use TJSONAnything.ToBytes for a happy JSON outcome.

The following function will take any Delphi JSON object and convert it to a string:

function JSONToString(obj: TJSONAncestor): string;
var
  bytes: TBytes;
  len: Integer;
begin
  SetLength(bytes, obj.EstimatedByteSize);
  len := obj.ToBytes(bytes, 0);
  Result := TEncoding.ANSI.GetString(bytes, 0, len);
end;

Because TJSONString.ToBytes escapes all characters outside U+0020-U+007F, we can assume that the end result is 7-bit clean, so we can use TEncoding.ANSI.  You could instead stream the TBytes to a file or do other groovy things with it.

RIGHTeously tripping over T-SQL’s LEN function

We tripped over recently on our understanding of Microsoft SQL Server’s T-SQL LEN function.  The following conversation encapsulates in a nutshell what any sensible developer would assume about the function.

@marcdurdin I guess the answer is, removing the line would give the same result. Right? 🙂

— S Krupa Shankar (@tamil) April 23, 2013

Now, I wouldn’t be writing this blog post if that was the whole story.  Because, like so many of these things, it’s not quite that simple.

Originally, we were using RIGHT(string, LEN(string) – 2) to trim two characters off the front of our string.  Or something like that.  Perfectly legitimate and reasonable, one would think.  But we were getting strange results, trimming more characters than we expected.

It turns out that T-SQL’s LEN function does not return the length of the string.  It returns the length of the string, excluding trailing blanks.  That is, excluding U+0020, 0x20, 32, ‘ ‘, or whatever you want to call it.  But not tabs, new lines, zero width spaces, breaking or otherwise, or any other Unicode space category characters.  Just that one character.  This no doubt comes from the history of the CHAR(n) type, where fields were always padded out to n characters with spaces.  But today, this is not a helpful feature.

Of course, RIGHT(string) does not ignore trailing blanks

But here’s where it gets really fun.  Because a post about strings is never complete until you’ve done it in Unicode.  Some pundits suggest using DATALENGTH to get the actual length of the string.  This returns the length in bytes, not characters (remember that NCHAR is UTF-16, so 2 bytes per character… sorta!).  Starting with SQL Server 2012, UTF-16 supplementary pairs can be treated as a single character, with the _SC collations, so you can’t even use DATALENGTH*2 to get the real length of the string!

OK.  So how do you get the length of a string, blanks included, now that we’ve established that we can’t use DATALENGTH?  Here’s one simple way:

  SET @realLength = LEN(@str + ‘.’) – 1

Just to really do your head in, let’s see what happens when you use LEN with a bunch of garden variety SQL strings.  I should warn you that this is a display artefact involving the decidedly unrecommended use of NUL characters, and no more, but the weird side effects are too fun to ignore.

First, here’s a set of SQL queries for our default collation (Latin1_General_CI_AS):

Note the output.  Not exactly intuitive!  Now we run the Unicode version:

Not quite as many gotchas in that one?  Or are there?  Notice how the first line shows a wide space for the three NUL characters — but not quite as wide as the Ideographic space…

Now here’s where things go really weird.  Running in Microsoft’s SQL Server Management Studio, I switch back to the first window, and, I must emphasise, without running any queries, or making any changes to settings, take a look at how the Messages tab appears now!

That’s right, the first line in the messages has magically changed!  The only thing I can surmise is that switching one window into Unicode output has affected the whole program’s treatment of NUL characters.  Let that be a warning to you (and I haven’t even mentioned consuming said NULs in C programs).

Mixing RTL and LTR: Plaintext vs HTML

Just a very short post today.  We still have some way to go in mixing RTL and LTR text.  For example, the following image, snipped from Outlook 2010, shows the issue:

The subject and body both say the same thing, but the display order is different.  Do you know why?  It’s because the subject is plain text and is assuming that the text is primarily right-to-left, whereas the body is HTML, and is assuming that the text is primarily left-to-right.

Note how the full stop in the subject appears to the left of the English text.  This is because the display renderer has assumed that the whole run of text is right-to-left, so punctuation is treated as right-to-left, and so displays after (in a right-to-left sense) the text.

The question is, of course, how do you determine directionality given an arbitrary plain text string?  It’s not really possible to do so reliably in the absence of other metadata.  The W3C article on directionality is helpful here: http://www.w3.org/TR/html4/struct/dirlang.html

Another view of the message:

Interestingly, Outlook Web Access does not do this, because its UI takes its directionality from the base HTML document:

Indy, TIdURI.PathEncode, URLEncode and ParamsEncode and more

Frequently in Delphi we come across the need to encode a string to stuff into a URL query string parameter (as per web forms).  One would expect that Indy contains well-tested functions to handle this.  Well, Indy contains some functions to help with this, but they may not work quite as you expect.  In fact, they may not be much use at all.

Indy contains a component called TIdURI.  It contains, among other things, the member functions URLEncode, PathEncode, and ParamsEncode. At first glance, these seem to do what you would need.  But in fact, they don’t.

URLEncode will take a full URL, split it into path, document and query components, encode each of those, and return the full string.  PathEncode is intended to handle the nuances of the path and document components of the URL, and ParamsEncode handles query strings.

Sounds great, right?  Well, it works until you have a query parameter that has an ampersand (&) in it.  Say my beloved end user want to search for big&little.  It seems that you could pass the following in:

s := TIdURI.URLEncode('http://www.google.com/search?q='+SearchText);

But then we get no change in our result:

s = 'http://www.google.com/search?q=big&little';

And you can already see the problem: little is now a separate parameter in the query string.  How can we work around this?  Can we pre-encode ampersand to %26 before you pass in the parameters?

s := TIdURI.URLEncode('http://www.google.com/search?q='+ReplaceStr(SearchText, '&', '%26'));

No:

s = 'http://www.google.com/search?q=big%25%26little';

And obviously we can’t do it ourselves afterwards, because we too won’t know which ampersands are which.  You could do correction of ampersand by encoding each parameter component separately and then post-processing the component for ampersand and other characters before final assembly using ParamsEncode. But you’ll soon find that it’s not enough anyway.  =, / and ? are also not encoded, although they should be.  Finally, URLEncode does not support internationalized domain names (IDN).

Given that these functions are not a complete solution, it’s probably best to avoid them altogether.

The problem is analogous to the Javascript encodeURI vs encodeURIComponent issue.

So to write your own…  I haven’t found a good Delphi solution online (and I searched a bit), so here’s a function I’ve cobbled together (use at your own risk!) to encode parameter names and values. You do need to encode each component of the parameter string separately, of course.

function EncodeURIComponent(const ASrc: string): UTF8String;
const
  HexMap: UTF8String = '0123456789ABCDEF';

  function IsSafeChar(ch: Integer): Boolean;
  begin
    if (ch >= 48) and (ch <= 57) then Result := True    // 0-9
    else if (ch >= 65) and (ch <= 90) then Result := True  // A-Z
    else if (ch >= 97) and (ch <= 122) then Result := True  // a-z
    else if (ch = 33) then Result := True // !
    else if (ch >= 39) and (ch <= 42) then Result := True // '()*
    else if (ch >= 45) and (ch <= 46) then Result := True // -.
    else if (ch = 95) then Result := True // _
    else if (ch = 126) then Result := True // ~
    else Result := False;
  end;
var
  I, J: Integer;
  ASrcUTF8: UTF8String;
begin
  Result := '';    {Do not Localize}

  ASrcUTF8 := UTF8Encode(ASrc);
  // UTF8Encode call not strictly necessary but
  // prevents implicit conversion warning

  I := 1; J := 1;
  SetLength(Result, Length(ASrcUTF8) * 3); // space to %xx encode every byte
  while I <= Length(ASrcUTF8) do
  begin
    if IsSafeChar(Ord(ASrcUTF8[I])) then
    begin
      Result[J] := ASrcUTF8[I];
      Inc(J);
    end
    else if ASrcUTF8[I] = ' ' then
    begin
      Result[J] := '+';
      Inc(J);
    end
    else
    begin
      Result[J] := '%';
      Result[J+1] := HexMap[(Ord(ASrcUTF8[I]) shr 4) + 1];
      Result[J+2] := HexMap[(Ord(ASrcUTF8[I]) and 15) + 1];
      Inc(J,3);
    end;
    Inc(I);
  end;

  SetLength(Result, J-1);
end;

To use this, do something like the following:

function GetAURL(const param, value: string): UTF8String;
begin
  Result := 'http://www.example.com/search?'+
    EncodeURIComponent(param)+
    '='+
    EncodeURIComponent(value);
end;

Hope this helps. Sorry, I haven't got an IDN solution in this post!

Updates to kmwString functions

A couple of weeks ago I published on this blog a set of functions for handling supplementary characters in JavaScript.  We have since fixed a few bugs in the code and have now made the kmwString script available on github at https://github.com/tavultesoft/keyman-tools

This update:

  • Fixes an error which could occur when only 1 parameter was passed to kmwSubstring and that parameter was negative
  • Fixed an incorrect NaN return value for kmwCodePointToCodeUnit when passed an offset at the very end of a string
  • Adds kmwCodeUnitToCodePoint function
  • Adds kmwCharAt function — which is basically just kmwSubstr(index,1)

Again, suggestions, comments, fixes and flames all gratefully received.