Category Archives: Bugs

When ញ៉ាំ meets ញ៉ំា

The Khmer script was added to the Unicode standard in September 1999. Today, nearly 18 years later, operating system renderers still get it wrong.

This is a quick post to document the difference in how several Khmer words are wrongly rendered on different current operating systems. I ran these tests on Windows 10 (10.0.14393), Android 7.1.1 Nougat, iOS 10.2.1, Mac OS X Sierra <> and Ubuntu 16.04 LTS with Firefox 47. The good news is that Windows 10 and Ubuntu passed all the tests (bar a font style issue with Leelawadee UI). Android passed nearly everything, except the bad encoding test.

Now, admittedly, the rules around triisap (U+17CA) and muusikatoan (U+17C9) are very complex. The Unicode standard description covers most of the difficulties, but not all of them.

Muusiaktoan is also sometimes called ធ្មេញកណ្ដុរ /tmɨɲ kɑndao/ – rat’s teeth, which is a fun name.

On to the words. In every case, the DauhPenh rendering is correct.

ញ៉ាំ /ɲam/ To eat
U+1789 U+17C9 U+17B6 U+17C6

ស៊ី /sii/ To eat (for young)
U+179F U+17CA U+17B8

As of Mac OS X Sierra, /sii/ now displays correctly. But contrast with /ʔum/, /ʔom/ below.

អ៊ំ /ʔum/, /ʔom/ Uncle, aunt
U+17A2 U+17CA U+17C6

Note how Leelawadee UI renders this wrongly; but that is a font rather than a renderer bug.

ប៊ី /bii/ A type of egg roll
U+1794 U+17CA U+17B8

ប៉ី /pəy/ A type of wind instrument
U+1794 U+17C9 U+17B8

As of Mac OS X Sierra, /pəy/ now displays correctly. But contrast with /bii/ above!

Yum yum yum

ញ៉ាំ /ɲam/ To eat

I’d like to pull out the word ញ៉ាំ for further analysis. Every operating system has some trouble with this word, because it could be encoded in several different ways. The correct way works on everything except iOS and Mac OS X. The incorrect encodings should really display wrongly, but none of the renderers complain about both invalid forms!

Correct order (ញ៉ាំ)

U+1789 U+17C9 U+17B6 U+17C6

Incorrect order (ញ៉ំា)

U+1789 U+17C9 U+17C6 U+17B6

Incorrect vowel (ញុំា)

U+1789 U+17BB U+17C6 U+17B6

In this instance, The DauhPenh rendering is appropriate for the first and second lines; the Apple rendering is ironically most appropriate for the third line!

Many thanks to Makara for his suggestion on the second incorrect rendering; I updated this post shortly after initial posting to include the extra example. There are other possible letter orders which may or may not display “correctly”; I will leave finding these as an exercise for the reader.

ZWNJ FTW

Here’s one I’ll examine in detail another time. Some words can be written in two different ways, neither really incorrect. The Unicode standard caters for these by allowing for insertion of a Zero Width Non Joiner (U+200C) to force the superscripted form of triisap (៊) or muusikatoan (៉). Windows 10’s Leelawadee UI font gets this one wrong (but its DauhPenh font doesn’t).

អ៊‌ី or អ៊ី /ʔii/ An exclamation of surprise
U+17A2 U+17CA U+17B8
ZWNJ
U+17A2 U+17CA U+200C U+17B8

Don’t forget to navigate to about:blank when embedding IWebBrowser2

Today I spent several hours trying to figure out why an embedded web browser component (in this case TEmbeddedWB) in a Delphi test app never received the appropriate IHttpSecurity and IWindowForBindingUI QueryService requests.

I was doing this in order to provide more nuanced handling of self-signed certificates in an intranet context. We all do this, right? Here the term “nuanced” means “Of course I trust self signed certificates on my intranet, don’t you?” Feel free to rant and rave on this. 😉

But no matter what I did, what incantations I tried, or what StackOverflow posts I perused, I was unable to find an answer. Until finally I stumbled on a side comment in a thread from 2010. Igor Tandetnik notes that:

Right after creating the control, navigate it to about:blank. Right after that, navigate it to the page you wanted to go to. It’s a known problem that IServiceProvider doesn’t work for the very first navigation.

And this was something that I kinda knew in the back of my head, but of course had forgotten. Thank you Igor.

This post would not be complete without some splendiferous code. Just for reference, it’s so simple if you don’t blank out and forget about:blank.

unit InsecureBrowser;

interface

uses
  Winapi.Windows,
  Winapi.Messages,
  Winapi.Urlmon,
  Winapi.WinInet,
  System.SysUtils,
  System.Variants,
  System.Classes,
  Vcl.Graphics,
  Vcl.Controls,
  Vcl.Forms,
  Vcl.Dialogs,
  Vcl.OleCtrls,
  Vcl.StdCtrls,
  SHDocVw_EWB,
  EwbCore,
  EmbeddedWB;

type
  TInsecureBrowserForm = class(TForm, IHttpSecurity, IWindowForBindingUI)
    web: TEmbeddedWB;
    cmdGoInsecure: TButton;
    procedure webQueryService(Sender: TObject; const [Ref] rsid,
      iid: TGUID; var Obj: IInterface);
    procedure FormCreate(Sender: TObject);
    procedure cmdGoInsecureClick(Sender: TObject);
  private
    { IWindowForBindingUI }
    function GetWindow(const guidReason: TGUID; out hwnd): HRESULT; stdcall;

    { IHttpSecurity }
    function OnSecurityProblem(dwProblem: Cardinal): HRESULT; stdcall;
  end;

var
  InsecureBrowserForm: TInsecureBrowserForm;

implementation

{$R *.dfm}

function TInsecureBrowserForm.GetWindow(const guidReason: TGUID;
  out hwnd): HRESULT;
begin
  Result := S_FALSE;
end;

function TInsecureBrowserForm.OnSecurityProblem(dwProblem: Cardinal): HRESULT;
begin
  if (dwProblem = ERROR_INTERNET_INVALID_CA) or
     (dwProblem = ERROR_INTERNET_SEC_CERT_CN_INVALID)
    then Result := S_OK
    else Result := E_ABORT;
end;

procedure TInsecureBrowserForm.webQueryService(Sender: TObject;
  const [Ref] rsid, iid: TGUID; var Obj: IInterface);
begin
  if IsEqualGUID(IID_IWindowForBindingUI, iid) then
    Obj := Self as IWindowForBindingUI
  else if IsEqualGUID(IID_IHttpSecurity, iid) then
    Obj := Self as IHttpSecurity;
end;

procedure TInsecureBrowserForm.cmdGoInsecureClick(Sender: TObject);
begin
  web.Navigate('https://evil.intranet.site/');
end;

procedure TInsecureBrowserForm.FormCreate(Sender: TObject);
begin
  web.Navigate('about:blank');
end;

end.

 

My favourite debugging story

This was some years ago, when I was living in Vientiane, the capital of the Lao Peoples’ Democratic Republic. It was 1994 or thereabouts – just prior to the release of Windows 95. I had written a piece of software called “Keyman” which was being increasingly used to type in Lao in Windows 3.1, overloading characters in the 128-255 range of the standard US English character set at the time (I don’t want to be too technical here). Before Unicode.

The owner of a local computer store had had reports of an issue with Keyman from one of their clients in the provincial capital of Savannakhet, about a one hour flight from Vientiane in a small plane. The technical minutiae of the problem escape me now, but it was something to do with a certain set of keystrokes which gave the wrong output in some applications – I think Excel. The report had been communicated over the telephone to the computer shop technical staff, and then translated into English for my benefit – as my Lao was probably not good enough to really get the detail. So as you can imagine, Chinese Whispers is a good way to describe the final report I received.

I tried to diagnose the problem from the description, and tried to reproduce it on my computer, but could not figure it out.

Now it is important to remember that Laos in 1994 was still pretty much unknown to the outside world. There were few tourists; it was (and is) a communist country, at least in principle. Things there didn’t work quite the same as in Australia. There was no Internet access in Laos at that time, telephones were unreliable and the use of modems was technically illegal. This meant that remote diagnosis was only possible by means of telephoned conversations over noisy phone lines, by fax, or with posted letters. The post often took weeks, even in-country.

So after a few days of fruitless telephoning back and forth, the owner of the computer shop suggested I accompany him on a trip down to Savannakhet. (From memory, he was already planning to visit). I was but a callow youth, of 17, and so this was a fantastic opportunity!

When we met up at the airport, the first thing I remember was standing in the security line behind a Lao businessman who caused a bit of a ruckus at the hand luggage screening, because his briefcase had two pistols in it. This seemed a little unusual, even in Laos. After some discussion, his pistols were removed from his hand luggage and given to a guard, who told him he could not take them on the plane because there was not a separate luggage hold. I don’t know what happened to them after that as we were ushered through the security.

When our plane started boarding, a second problem arose. It appeared that the flight had been overbooked, or perhaps they’d substituted a smaller aircraft. The plane we could see was a Xian Y-7, a Chinese clone of a Russian Antonov An-24. The Wikipedia page linked above shows a picture, coincidentally enough of a Lao Aviation plane, perhaps the very plane we were to fly on (they only had 4).

But, as I said, the plane was overbooked, and we ended up in the group of about 10 that didn’t get onto the plane. Pistol-man was in the group that boarded the Y-7, and we didn’t see him again.

Fortunately for us, Lao Aviation had a solution to the problem. They simply rolled a second plane out of the hanger, a Harbin Y-12 this time, and fired it up.

Well, they tried to fire it up. It coughed and spluttered, and lots of black smoke poured out of the engines, but it didn’t start. Boh pen nyang. They pushed it back into the hanger, and rolled out yet another Y-12.

At this point I was feeling a little nervous.

The thired plane coughed and spluttered, poured out lots of black smoke, but it started! After a minute, they shut off the engines and asked us to board.

You can see in the picture below how part of the engine cowling is painted black. You can also see, if you look closely, how there are black smudge marks around that black painted area. Yeah. Smoke. I guess that the smoke mustn’t be a big problem, but it wasn’t inspiring at the time.

Lao_Aviation_Harbin_Y-12_Sibille-1

https://upload.wikimedia.org/wikipedia/commons/4/4b/Lao_Aviation_Harbin_Y-12_Sibille-1.jpg

Image © 2000 Regis Sibille, used under CCSA. (Another picture: http://www.airliners.net/photo/Iran—Revolutionary/Harbin-Y12-II/1503896/L/)

We rolled out and took off moments after the larger first plane. For a while, we could see the larger plane ahead and slightly above us – I don’t know why they didn’t go straight up to cruising altitude as the Y-7 is a lot faster than the Y-12. But eventually the Y-7 was out of sight. The scenery was in places spectacular. As I recall, the plane stayed in Lao airspace for the whole flight, despite this making the flight significantly longer.

Arriving in Savannakhet, we first travelled to the house of a friend of the computer store owner. This man happened to be one of the richest men in southern Laos. He had a beautiful house, filled with beautifully carved tables, paintings and collected antiques. After a brief meeting there, we were escorted by this man to a café in the city for a coffee. Well, some of us drank coffee; I didn’t. I was but 17 and at that age drank far too much Pepsi.

At this sidewalk café, an interesting encounter occurred, which has stuck with me. A street sweeper stopped and ordered a drink, and sat at the same table as this very wealthy man, and they struck up conversation. For some time, all at the table talked. The friendly interactions between two very different social classes was remarkable to me at the time – especially coming from Thailand where the social strata were clearly delineated.

Finally, social requirements met, we made our way to the computer with the problem. It was about 3 or 4 flights up dusty stairs in one of the tallest buildings in the city. There was a lift, but use of it was definitely not recommended. The problem was demonstrated to me, and I was able to observe there what had flummoxed me from afar. After just a few minutes, I realised what the problem was, and had enough information to fix it. I didn’t have my laptop with me so I had to write down some notes – and then we left them, a little sad that we couldn’t fix the problem immediately.

What was the problem? I actually don’t remember the detail. I just remember how we got there and back again!

We took a ferry across the river to Thailand, and took a bus to Nakhon Phanom. There were two reasons: first, my friend the computer shop owner wanted to visit some relatives there, and second, the roads in Laos were at the time in such poor condition that travel on them was best avoided if an alternative was available. Thai roads were busy but generally in excellent condition.

Once in Nakhon Phanom, it was a short motorcycle taxi ride to the relatives’ house by the river. We stopped there for a couple of hours and drank tea with them (yes, even I, Mr Pepsi Boy drank tea).

But when we tootled back to the bus stop, we found our bus had left a few minutes earlier than we had planned!

This was not a big problem. A bystander offered to chase down the bus in his pickup truck. This pickup was sparkling clean, had bright alloy wheels with a thin slick of rubber spray painted on, and a lowered chassis, so much so that the wheels would have been superfluous if they hadn’t been required for their locomotive capability.

pickup

It wasn’t this car, but it could well have been its older brother. Now in Thailand, the buses moved fast. Especially once they got out onto the open road. I’ve been on Thai buses doing 150 km/h or more.

So our intrepid young driver started chasing down the bus, flying down the city streets at over double the speed limit, braking hard for corners and easily avoiding the (fortunately) light traffic, and once he got out onto the open road he was eager to show us what his car would do. So here we were, flying down a Thai highway in a stranger’s car doing a ridiculous speed, chasing down a bus driver that didn’t know we existed. I’m pretty sure my parents would not have been pleased.

At the speed we were doing, we caught up to the bus pretty easily. After some vigorous flashing of the headlights, the bus’s left indicator came on, and it slowed and finally stopped. My friend paid the pickup race driver a token of our appreciation, and with slightly wobbly legs we climbed onto the bus, and off we went to Nong Khai.

We arrived in Nong Khai well after midnight, and ended up at a noisy Thai transit hotel, in a room without windows a couple of floors above the nightclub. One bed: a queen bed.

Now, remember, I was a callow 17 year old youth. The idea of sharing a bed with any man was pretty terrifying. But my friend, who I guess was in his 40s at that stage, kindly noted my abject and unwarranted fear, and gave me the whole bed – I think he sat in the chair and dozed. Not fair, I know.

At about 4am the nightclub finally quietened down, and at about 6am we were awoken by the daily noise that accompanies the start of every day – car doors slamming, shouts, trucks reversing. So we gave up on sleep, got up and made our way back across the border to Vientiane.

Upon returning to Vientiane, I fixed the problem in the Keyman code in a few minutes, prepared a new version on a floppy disk, and rode my bicycle over to the computer shop to deliver it. The computer shop sent the disk on to the user in Savannakhet, and as far as I know, that was that.

That’s how debugging used to work in the nearly olden days. None of these fancy remote desktop VPN SSH thingies. It was a lot more fun.

Windows 10 shows a blank screen after login

Quick post for future reference and in case it helps other users.

Got to work this morning, and couldn’t login to my Windows 10 machine. Windows Updates had been installed overnight (oh dear).

The symptoms were slightly different to most of the other reports I’ve seen online (Google Windows 10 blank screen after login). I could login on one account all fine, but my primary account would just go to a black screen (with cursor). Waited quite some time (10 minutes or more) with no change. Rebooted. No change.

Pressing Ctrl+Alt+Del would bring up Windows Security, and from there I could switch user or go back to the lock screen, but attempting to log out would fail: the “wait for logout” spin appeared but never logged out. Other options from that screen had no effect.

I vaguely tried multi monitor options (Windows+P key combination), typing password into the blank screen, and other magic mummery without effect (and without expectation of success).

Eventually I logged in on the working account, turned off Windows firewall, enabled Remote Registry, and got to work with SysInternals tools. I logged back out of that account and used the command pslist \\machine -t from a working computer to see what processes were running.


pslist v1.3 - Sysinternals PsList
Copyright (C) 2000-2012 Mark Russinovich
Sysinternals - www.sysinternals.com

Process information for <machine>:

Name                             Pid Pri Thd  Hnd      VM      WS    Priv
Idle                               0   0   4    0      64       4       0
  System                           4   8 141 1217  393012  271644    1560
    smss                         316  11   3   49    4864     592     376
csrss                            428  13  11  415  163348    3240    1620
wininit                          500  13   4   91   43984    2792    1232
  services                       572   9   9  294   35020    7900    4648
    svchost                      304   8  35  956  184400   23016   13240
    svchost                      356   8  17  321   82168    6432    3248
    svchost                      496   8  26 1043  204540   24268   20616
    svchost                      748   8  30  773   87432   16040   10172
      ShellExperienceHost       7032   8  44  876  369236   35880   58392
      WmiPrvSE                  8876   8   6  144   33264    8172    1880
      SearchUI                  9820   8  27  900  362948   37788   48896
    svchost                      812   8  18  528   59364   11544    9764
    svchost                      936   8  52 1346 1417096   26592   22524
    svchost                      968   8  78 4722 1321944   71912  797808
      taskhostw                 3952   8   8  291  275572   10268    6492
    svchost                     1064   8  37 1130  719392   25880   12960
      WUDFHost                  1816   8   8  476   38964    3852    2212
      dasHost                   2084   8  13  302   37880    7968    3764
    mvbtrcsvcx64                1108   8   3  107   58624    1608    1124
    svchost                     1344   8  14  527  217200   29352   56716
    svchost                     1348   8  29  605  167772   23016   18484
    spoolsv                     1792   8  24  576   91808   13972   11384
    svchost                     2156   8  48  485 1200004   19240    9308
    MsMpEng                     2268   8  45 1122  523848  127360  144964
    svchost                     2684   8  15 3980  303696   53068   95100
    NisSrv                      3084   8  10  287   71416   10276   23812
    SearchIndexer               5800   8  50 1045  434516   77688   95752
      SearchProtocolHost        9568   4  10  353   74552   12456    2240
      SearchFilterHost          9856   4   4  114   35096    6560    1208
    officeclicktorun            8500   8  18 3570  213028   27932   48116
    svchost                    11928   8   5  128   23372    6500    1276
  lsass                          580   9  14 1272   58540   17008    9876
csrss                           3356  13  15  634  160220   16276    2520
winlogon                        3396  13   6  229   65460   12380    2472
  dwm                           3520  13  12  332  282440   34416   62088

Nothing obviously wrong that I could see. I then ran pskill \\machine 3396 to kill winlogon, in the hope that at least that would cause the account to log out.

And of course it did.

The surprise was that after that I was able to login…

Unfortunately, I have not yet traced what caused the lock but at least this was an answer for me, which hopefully may help someone else one day.

Possibly related: I did find this in the Event Log:

Application pop-up: ShellExperienceHost.exe – Application Error : The instruction at 0x00007FFE569D50CB referenced memory at 0x0000000000000000. The memory could not be read.

And, an unhelpful error from win32k:

The description for Event ID 267 from source Win32k cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

The specified resource type cannot be found in the image file

Fixing the incorrect client size for Delphi VCL Forms that use styles

Delphi XE2 and later versions have a robust theming system that has a frustrating flaw: the client width and height are not reliably preserved when the theme changes the border widths for dialog boxes.

For forms that are sizeable this is not typically a problem, but for dialogs laid out statically this can look really ugly, as shown in this Stack Overflow question.

The problem in pictures

Here’s a little form, shown in the Delphi form designer. I’ve placed 4 buttons right in the corners of the form. I’m going to populate the Memo with notes on the form size at runtime.

Design time form with four buttons at corners

When I have no custom style set to the project (i.e. “Windows” style), I can run on a variety of platforms and see the buttons are where they should be. Shown here on Windows 10, Windows 7 and Windows XP (just because):
Windows theme form on Windows 10Windows theme form on Windows 7

Windows theme form on Windows XP

But when I apply a custom style to the project — I chose “Glossy” — then my dialog appears like so, instead:

Glow theme form on Windows 7

You’ll note that the vertical is adjusted but the horizontal is not: Button2 and Button4 are now chopped off on the right. Because we are using themes, the form looks identical on all platforms.

This problem has not been addressed as of Delphi XE8.

The workaround

For my needs, I found a workaround using a class helper, which can be applied to the forms which need to maintain their design-time ClientWidth and ClientHeight. This is typically the case for dialog boxes.

This workaround should be used with care as it has been designed to address a single issue and may have side effects.

  • It will trigger resize events at load time
  • Setting AutoScroll = true means that ClientWidth and ClientHeight are not stored in the form .dfm, and so this does not work.
  • It may not address other layout issues such as scaled elements scaling wrongly (I haven’t tested this).
type
  TFormHelper = class helper for Vcl.Forms.TCustomForm
  private
    procedure RestoreDesignClientSize;
  end;

procedure TfrmTestSize.FormCreate(Sender: TObject);
begin
  RestoreDesignClientSize;
end;

{ TFormHelper }

procedure TFormHelper.RestoreDesignClientSize;
begin
  if BorderStyle in [bsSingle, bsDialog] then
  begin
    if Self.FClientWidth > 0 then ClientWidth := Self.FClientWidth;
    if Self.FClientHeight > 0 then ClientHeight := Self.FClientHeight;
  end;
end;

After adding in this little snippet, the form is now restored to its design-time size, like thus:

Fixed glow theme form on Windows 7

Success 🙂

Concatenating strings in SQL Server, or undefined behaviour by design

We just ran into a funny problem here, using a “tried and true” technique in SQL Server to concatenate strings. I use the quotes advisedly. This technique is often suggested on blogs and sites such as Stack Overflow, but we found out (by painful experience) that it is not to be relied on.

Update, 9 Mar 2016: Bruce Gordon from Webucator has turned this into a great little 5 minute video. Thanks Bruce! I don’t know anything much about Webucator, but they are doing some good stuff with creating well-attributed videos about blog posts such as this one and apparently they do SQL Server training.

The problem

So, given the following setup:

CREATE TABLE BadConcat (
  BadConcatID INT NOT NULL,
  Description NVARCHAR(100) NOT NULL,
  SortIndex INT NOT NULL
  CONSTRAINT PK_BadConcat PRIMARY KEY CLUSTERED (BadConcatID)
)
GO

INSERT BadConcat 
  SELECT 1, 'First Item', 1 union all
  SELECT 2, 'Second Item', 2 union all
  SELECT 3, 'Third Item', 3
GO

We need to concatenate those Descriptions. I have avoided fine tuning such as dropping the final comma or handling NULLs for the purpose of this example. This example shows one of the most commonly given answers to the problem:

DECLARE @Summary NVARCHAR(100) = ''

SELECT @Summary = @Summary + ec.Description + ', '
FROM BadConcat ec
ORDER BY ec.SortIndex 

PRINT @Summary

And we get the following:

First Item, Second Item, Third Item, 

And that works fine. However, if we want to include a WHERE clause, even if that clause still selects everything, then we suddenly get something weird:

SET @Summary = ''

SELECT @Summary = @Summary + ec.Description + ', '
FROM BadConcat ec
WHERE ec.BadConcatID in (1,2,3)
ORDER BY ec.SortIndex 

PRINT @Summary

Now we get the following:

Third Item, 

What? What has SQL Server done? What’s happened to the first two items?

You’ll probably do what we did, which is to go through and make sure that you are selecting everything properly, which we are, and eventually come to the conclusion that “there must be a bug in SQL Server”.

The answer

It turns out that this iterative concatenation is unsupported functionality. Microsoft Knowledge Base article 287515 states:

You may encounter unexpected results when you apply any operators or expressions to the ORDER BY clause of aggregate concatenation queries.

Now, at first glance that does not directly apply. But we can extrapolate from that, as Microsoft developer support have done, in response to a bug report on SQL Server, to learn that:

The variable assignment with SELECT statement is a proprietary syntax (T-SQL only) where the behavior is undefined or plan dependent if multiple rows are produced

And again, in response to another bug report:

Using assignment operations (concatenation in this example) in queries with ORDER BY clause has undefined behavior. This can change from release to release or even within a particular server version due to changes in the query plan. You cannot rely on this behavior even if there are workarounds.

Some alternative solutions are given, also, in that second report:

The ONLY guaranteed mechanism are the following:

1. Use cursor to loop through the rows in specific order and concatenate the values
2. Use for xml query with ORDER BY to generate the concatenated values
3. Use CLR aggregate (this will not work with ORDER BY clause)

And the article “Concatenating Row Values in Transact-SQL” by Anith Sen goes through some of those solutions in detail. Sadly, none of them are as clean or as easy to understand as that original example.

Another example is given on Stack Overflow, which details how to safely use XML PATH to concatenate, without breaking on the XML special characters &, < and >. Applying that example into my problem code given above, we should use the following:

SELECT @Summary = (
  SELECT ec.Description + ', ' 
  FROM BadConcat ec 
  WHERE ec.BadConcatID in (1,2,3)
  ORDER BY ec.SortIndex 
  FOR XML PATH(''), TYPE
).value('.','varchar(max)')

PRINT @Summary

Voilà.

First Item, Second Item, Third Item, 

The case of the UAC that Just Wouldn’t

One of my dev machines has long had a weird anomaly where file operations in Explorer that should prompt for UAC, such as copying a file into C:\Program Files, would instead silently fail.

This led to all sorts of issues, from being unable to delete certain files — they’d just obstinately sit there, no matter how much I pressed that Del key — to trying to move folders containing a hidden Thumbs.db file and being unable to move the folder.

My UAC settings were the Windows defaults. Nothing weird these. So I’d always treated put this issue into my “too busy to solve this now” basket. The classic basket case. But today I finally got fed up.

After a quick search for the symptoms on Dr Google returned no results of significance, I decided I needed to trace the cause myself.

Process Monitor to the Rescue

It was time to pull out Process Monitor out of my toolbox again! Process Monitor is a tool from the SysInternals Suite by Microsoft that monitors and logs details on a bunch of different operations on your computer. I use Process Monitor, Procmon for short, all the time to solve problems big and little. But for some reason, it hadn’t crossed my mind until today that I could apply Procmon to this problem.

First, I configured Procmon to filter all events except for those generated by Explorer.exe and Consent.exe. I wasn’t sure if Consent.exe was involved in the problem (Consent.exe being the UAC elevation prompter), but it wouldn’t hurt to include it to start with. Note that all those Exclude filters are default filters setup by Procmon to exclude itself and its friends, removing that confusion from the logs.

Procmon filter

Then I went ahead and tried to copy a file into C:\Program Files (x86). It was just an innocent little text file, but Explorer of course acted like a Buckingham Palace Guard and silently and stolidly ignored its existence.

Source folder  ➔  Dest folder

I used the clipboard Ctrl+C and Ctrl+V to copy and paste (or attempt to paste) the file. I didn’t think the clipboard was at fault because all other UAC-required file operations also failed silently. I could have dragged and dropped, it would have had the same effect.

But now, with procmon, I had captured the communication that went on behind the scenes. All those secret coded winks and nose scratches that told Explorer to fob off any attempts to trigger a UAC prompt. Here’s what I was presented with in the Procmon log.

Procmon start

I searched for the name of my text file (test.txt), and used Procmon’s Highlight tool to highlight every reference to it in the Path column. This made it easy to spot nearby interactions that may have been related, even if they didn’t directly reference the test.txt file itself. You can see below two of the highlighted test.txt lines.

Procmon highlighting

Because there was a lot going on, I filtered out a lot of Operations that I thought were not relevant, such as CloseFile, RegCloseKey, RegQueryKey, ReadFile and WriteFile, among others. This reduced the log considerably and made it easier to spot differences (my screen capture below shows the filtering after it was reset, however — I forgot to capture the filtered trace, sorry).

I decided to also capture a trace on a machine where UAC prompting worked. I then compared the two logs. After scrolling back and forth around the many references to test.txt, I saw that on my dev machine, there was an additional interaction, right before the point where the prompt dialog was presented:

TortoiseShell in Procmon

That’s right, I had a program called TortoiseCVS installed on this machine which hooked into Explorer in a variety of ways. After the FileOperationPrompt references on my second machine, there was no reference to TortoiseCVS. Here’s what it looked like on the other machine:

Procmon on clean machine without TortoiseShell

That was the only visible difference of significance in the logs.

Now for those of you who just knee-jerked into “why on earth are you using CVS?!?”, calm down! This is a story, and I’m telling the story.

I decided that I didn’t really need TortoiseCVS installed and decided to try uninstalling it.

Uninstall Tortoise CVS

Sadly, uninstalling it required a reboot, no doubt to remove its old fashioned hooks into Explorer.

After the reboot, I tried to copy my innocent little text file again.

CopyWorks

Success! I was now presented with the prompt I wanted!

Another case closed thanks to SysInternals Suite and Mark Russinovich!

 

 

A workaround for a record.property getter bug in the Delphi XE2 compiler

We recently ran into a nasty little bug in the Delphi XE2 compiler, that arises with a complex set of conditions:

  1. A record with a string property that has a get function;
  2. A call to this getter that has a constant string appended to the result;
  3. This result passed directly to another arbitrary function.

When these conditions are met (see code below for examples), the compiler generates code that crashes.

Reproducing the bug

The following program is enough to reproduce the bug.

program ustrcatbug;

type
  TWrappedString = record
  private
    FValue: string;
    function GetValue: string;
  public
    property Value: string read GetValue;
  end;

{ TWrappedString }

function TWrappedString.GetValue: string;
begin
  Result := FValue;
end;

function GetWrappedString: TWrappedString;
begin
  Result.FValue := 'Something';
end;

begin
  writeln(GetWrappedString.Value + ' Else');
end.

Looking at the last line of code from that sample in the disassembler, we see the following code:

ustrcatbug.dpr.28: writeln(GetWrappedString.Value + ' Else');
0040712A 8D45EC           lea eax,[ebp-$14]
0040712D E816E9FFFF       call GetWrappedString
00407132 8D55EC           lea edx,[ebp-$14]
00407135 B8C0BB4000       mov eax,$0040bbc0
0040713A 8B0DE8594000     mov ecx,[$004059e8]
00407140 E883D8FFFF       call @CopyRecord
00407145 8D55E8           lea edx,[ebp-$18]
00407148 B8C0BB4000       mov eax,$0040bbc0
0040714D E8D6E8FFFF       call TWrappedString.GetValue
00407152 8D45E8           lea eax,[ebp-$18]
00407155 BAB4714000       mov edx,$004071b4
0040715A E899D6FFFF       call @UStrCat
0040715F 8B10             mov edx,[eax]
00407161 A12C884000       mov eax,[$0040882c]
00407166 E86DC6FFFF       call @Write0UString
0040716B E868C7FFFF       call @WriteLn
00407170 E867BCFFFF       call @_IOTest
ustrcatbug.dpr.29: end.
00407175 33C0             xor eax,eax

The problem arises with these two lines:

0040715A E899D6FFFF       call @UStrCat
0040715F 8B10             mov edx,[eax]

The problem is that eax is not preserved through function calls with Delphi’s default register calling convention. And, as _UStrCat is a procedure, we can make no assumptions about the return value (which is passed in eax):

procedure _UStrCat(var Dest: UnicodeString; const Source: UnicodeString);

The issue does not arise if we avoid using the property:

  writeln(GetWrappedString.GetValue + ' Else');

From this, the compiler generates:

ustrcatbug.dpr.28: writeln(GetWrappedString.GetValue + ' Else');
0040712A 8D45E8           lea eax,[ebp-$18]
0040712D E816E9FFFF       call GetWrappedString
00407132 8D55E8           lea edx,[ebp-$18]
00407135 B8C0BB4000       mov eax,$0040bbc0
0040713A 8B0DE8594000     mov ecx,[$004059e8]
00407140 E883D8FFFF       call @CopyRecord
00407145 B8C0BB4000       mov eax,$0040bbc0
0040714A 8D55EC           lea edx,[ebp-$14]
0040714D E8D6E8FFFF       call TWrappedString.GetValue
00407152 8D45EC           lea eax,[ebp-$14]
00407155 BAB4714000       mov edx,$004071b4
0040715A E899D6FFFF       call @UStrCat
0040715F 8B55EC           mov edx,[ebp-$14]
00407162 A12C884000       mov eax,[$0040882c]
00407167 E86CC6FFFF       call @Write0UString
0040716C E867C7FFFF       call @WriteLn
00407171 E866BCFFFF       call @_IOTest
ustrcatbug.dpr.29: end.
00407176 33C0             xor eax,eax

Where we now see that edx is reloaded from the stack, as it should be:

0040715A E899D6FFFF       call @UStrCat
0040715F 8B55EC           mov edx,[ebp-$14]

Workarounds

There are a number of possible workarounds:

  1. Upgrade to Delphi XE7 – this problem appears to be resolved, though I could not find a QC or RSP report relating to the bug when I searched. Moving to XE7 is not an option for us in the short term: too many code changes, too many unknowns.
  2. Don’t use properties in records. This means changing code to call functions instead: no big deal for the Get, but annoying for the Set half of the function pair.
  3. Patch around the problem by modifying UStrCat to return the address of the Dest parameter.

In the end, I wrote option (3) but we went with option (2) in our code base.

Because UStrCat is implemented in System.pas, it’s difficult to build your own version of the unit. One way to skin the option 3 UStrCat is to copy the implementation of UStrCat and its dependencies (a bunch of memory and string manipulation functions), and monkeypatch at runtime.

In the copied functions, we need to preserve eax through the function calls. This results in the addition of 4 lines of assembly to the UStrCat and UStrAsg functions, pushing the eax register onto the stack and popping it before exit. I haven’t reproduced the code here because the original is copyrighted to Embarcadero, but here are the changes required:

  1. In UStrCat, Add push eax just after the conditional jump to UStrAsg, and pop eax just before the ret instruction.
  2. In UStrAsg, wrap the call to FreeMem with a push eax and pop eax.

The patch function is also pretty straightforward:

procedure MonkeyPatch(OldProc, NewProc: PBYTE);
var
  pBase, p: PBYTE;
  oldProtect: Cardinal;
begin
  p := OldProc;
  pBase := p;

  // Allow writes to this small bit of the code section
  VirtualProtect(pBase, 5, PAGE_EXECUTE_WRITECOPY, oldProtect);

  // First write the long jmp instruction.
  p := pBase;
  p^ := $E9;  // long jmp opcode
  Inc(p);
  PDWord(p)^ := DWORD(NewProc) - DWORD(p) - 4;  // address to jump to, relative to EIP

  // Finally, protect that memory again now that we are finished with it
  VirtualProtect(pBase, 5, oldProtect, oldProtect);
end;

function GetUStrCatAddr: Pointer; assembler;
asm
  lea  eax,System.@UStrCat
end;

initialization
  MonkeyPatch(GetUStrCatAddr, @_UStrCatMonkey);
end.

This solution does give me the heebie jeebies, because we are patching the symptoms of the problem as we’ve seen them arise, without being able to either understand or address the root cause within the compiler. It’s not really possible to guarantee that there won’t be some other code path that causes this solution to come unstuck without really digging deep into the compiler’s code generation.

As noted, this problem appears to be resolved in Delphi XE5 or possibly earlier; however it is unclear if the root cause of the problem has been addressed, or we just got lucky. The issue has been reported as RSP-10255.

IE11, Windbg, JavaScript = joy

I was trying to debug a web application today, which is running in an Internet Explorer MSHTML window embedded in a thick client application. Weirdly, the app would fail, silently, without any script errors, on the sixth refresh — pressing F5 six times in a row would crash it each and every time.  Ctrl+F5 or F5, either would do it.

I was going slightly mad trying to trace this, with no obvious solutions. Finally I loaded the process up in Windbg, not expecting much, and found that three (handled) C++ exceptions were thrown for each of the first 6 loads (initial load, + 5 refreshes):

(958.17ac): C++ EH exception - code e06d7363 (first chance)
(958.17ac): C++ EH exception - code e06d7363 (first chance)
(958.17ac): C++ EH exception - code e06d7363 (first chance)

However, on the sixth refresh, this exception was happening four times:

(958.17ac): C++ EH exception - code e06d7363 (first chance)
(958.17ac): C++ EH exception - code e06d7363 (first chance)
(958.17ac): C++ EH exception - code e06d7363 (first chance)
(958.17ac): C++ EH exception - code e06d7363 (first chance)

This gave me something to look at! At the very least there was an observable difference between the refreshes within the debugger. A little more spelunking revealed that the very last exception thrown was the relevant one, and it returned the following trace (snipped):

(958.17ac): C++ EH exception - code e06d7363 (first chance)
ChildEBP RetAddr  
001875e8 75d3359c KERNELBASE!RaiseException+0x58
00187620 72ba1df2 msvcrt!_CxxThrowException+0x48
00187664 72d659cb jscript9!Js::JavascriptExceptionOperators::ThrowExceptionObjectInternal+0xb4
0018767c 72ba1cda jscript9!Js::JavascriptExceptionOperators::ThrowExceptionObject+0x15
001876a8 72bb5175 jscript9!Js::JavascriptExceptionOperators::Throw+0x6e
001876f4 6e9acd03 jscript9!CJavascriptOperations::ThrowException+0x9c
0018771c 6f3029f5 MSHTML!CFastDOM::ThrowDOMError+0x70
00187750 72b68e9d MSHTML!CFastDOM::CWebSocket::DefaultEntryPoint+0xe2
001877b8 72cff6a2 jscript9!Js::JavascriptExternalFunction::ExternalFunctionThunk+0x165
00187810 72bb849c jscript9!Js::JavascriptFunction::CallAsConstructor+0xc7
00187830 72bb8537 jscript9!Js::InterpreterStackFrame::NewScObject_Helper+0x39
0018784c 72bb858c jscript9!Js::InterpreterStackFrame::ProfiledNewScObject_Helper+0x53
0018786c 72bb855e jscript9!Js::InterpreterStackFrame::OP_NewScObject_Impl<Js::OpLayoutCallI_OneByte,1>+0x24
00187ba8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x3c44
00187dbc 132f1ae1 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
WARNING: Frame IP not in any known module. Following frames may be wrong.
00187dc8 72c372dd 0x132f1ae1
00187ed0 138be456 jscript9!Js::JavascriptFunction::EntryApply+0x265
00187f38 72b6669e 0x138be456
00188288 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
001883bc 132f0681 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
001883c8 72b6669e 0x132f0681
00188718 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018887c 132f1b41 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
00188888 72b6669e 0x132f1b41
00188bd8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
00188d1c 132f1c21 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
00188d28 72bb7642 0x132f1c21
00189080 72bd501f jscript9!Js::InterpreterStackFrame::Process+0x2175
001890b8 72bd507e jscript9!Js::InterpreterStackFrame::OP_TryCatch+0x49
001893f8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x4e84
00189594 132f0081 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
00189624 72d0134c 0x132f0081
00189650 72bf3de5 jscript9!Js::JavascriptOperators::GetProperty_Internal<0>+0x48
00189678 72b677a7 jscript9!Js::InterpreterStackFrame::OP_ProfiledLoopBodyStart<0,1>+0xa2
001899b8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x24f7
00189b34 132f0089 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
00189b40 72b6669e 0x132f0089
00189e98 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018a004 132f0099 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018a010 72b6669e 0x132f0099
0018a368 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018a4bc 132f1c69 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018a4c8 72b6669e 0x132f1c69
0018a818 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018a95c 132f1c71 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018a968 72b6669e 0x132f1c71
0018acb8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018ade4 132f1fb1 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018adf0 72c372dd 0x132f1fb1
0018ae90 72b60685 jscript9!Js::JavascriptFunction::EntryApply+0x265
0018aee0 72bd4fcc jscript9!Js::JavascriptFunction::CallFunction<1>+0x88
0018b220 72bd501f jscript9!Js::InterpreterStackFrame::Process+0x442e
0018b258 72bd507e jscript9!Js::InterpreterStackFrame::OP_TryCatch+0x49
0018b598 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x4e84
0018b71c 132f1ed9 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018b728 72b6669e 0x132f1ed9
0018ba78 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018bbac 132f1ca1 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018bbb8 72bb7642 0x132f1ca1
0018bf00 72bd501f jscript9!Js::InterpreterStackFrame::Process+0x2175
0018bf38 72bd507e jscript9!Js::InterpreterStackFrame::OP_TryCatch+0x49
0018c278 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x4e84
0018c3ac 132f1e21 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018c3b8 72bb7642 0x132f1e21
0018c700 72bd501f jscript9!Js::InterpreterStackFrame::Process+0x2175
0018c738 72bd507e jscript9!Js::InterpreterStackFrame::OP_TryCatch+0x49
0018ca78 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x4e84
0018cbb4 132f1e21 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018cbc0 72b6669e 0x132f1e21
0018cf08 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018d03c 132f1e51 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018d048 72b6669e 0x132f1e51
0018d398 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018d4bc 132f0341 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018d4c8 72bb7642 0x132f0341
0018d810 72bd501f jscript9!Js::InterpreterStackFrame::Process+0x2175
0018d848 72bd507e jscript9!Js::InterpreterStackFrame::OP_TryCatch+0x49
0018db88 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x4e84
0018dd44 132f1f99 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018dd50 72b60685 0x132f1f99
0018dd98 72bd4fcc jscript9!Js::JavascriptFunction::CallFunction<1>+0x88
0018e0d8 72bd501f jscript9!Js::InterpreterStackFrame::Process+0x442e
0018e110 72bd507e jscript9!Js::InterpreterStackFrame::OP_TryCatch+0x49
0018e450 72c5bfce jscript9!Js::InterpreterStackFrame::Process+0x4e84
0018e488 72c5bf0f jscript9!Js::InterpreterStackFrame::OP_TryFinally+0xb1
0018e7c8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x68eb
0018e8f4 132f0351 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018e900 72b6669e 0x132f0351
0018ec48 72b66548 jscript9!Js::InterpreterStackFrame::Process+0xbd7
0018ed94 132f1cf9 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018eda0 72b66ce9 0x132f1cf9
0018f0f8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x1cd7
0018f264 132f1d01 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018f270 72b66ce9 0x132f1d01
0018f5c8 72b66548 jscript9!Js::InterpreterStackFrame::Process+0x1cd7
0018f70c 132f1d49 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
0018f718 72b60685 0x132f1d49
0018f760 72b6100e jscript9!Js::JavascriptFunction::CallFunction<1>+0x88
0018f7cc 72b60f60 jscript9!Js::JavascriptFunction::CallRootFunction+0x93
0018f814 72b60ee7 jscript9!ScriptSite::CallRootFunction+0x42
0018f83c 72b6993c jscript9!ScriptSite::Execute+0x6c
0018f898 72b69878 jscript9!ScriptEngineBase::ExecuteInternal<0>+0xbb
0018f8b0 6eaab78f jscript9!ScriptEngineBase::Execute+0x1c
0018f964 6eaab67c MSHTML!CListenerDispatch::InvokeVar+0x102
0018f990 6eaab21e MSHTML!CListenerDispatch::Invoke+0x61
0018fa28 6eaab385 MSHTML!CEventMgr::_InvokeListeners+0x1a2
0018fb98 6e8a98bb MSHTML!CEventMgr::Dispatch+0x35a
0018fbc0 6e92d0c0 MSHTML!CEventMgr::DispatchEvent+0x8c
0018fd18 6e92da30 MSHTML!CXMLHttpRequest::Fire_onreadystatechange+0x8b
0018fd2c 6e9343ea MSHTML!CXMLHttpRequest::DeferredFire_onreadystatechange+0x52
0018fd5c 6e8a9472 MSHTML!CXMLHttpRequest::DeferredFire_progressEvent+0x10
0018fdac 773f62fa MSHTML!GlobalWndProc+0x1cd
0018fdd8 773f6d3a USER32!InternalCallWinProc+0x23
0018fe50 773f77c4 USER32!UserCallWinProcCheckWow+0x109
0018feb0 773f788a USER32!DispatchMessageWorker+0x3bc

So now I knew that somewhere deep in my Javascript code there was something happening that was bad. Okay… I guess that helps, maybe?

But then, after some more googling, I discovered the following blog post, published a mere 9 days ago: Finally… JavaScript source line info in a dump. Magic! (In fact, I’m just writing this blog post for my future self, so I remember where to find this info in the future.)

After following the specific incantations outlined within that post, I was able to get the exact line of Javascript that was causing my weird error. All of a sudden, things made sense.

0:000> k
ChildEBP RetAddr  
00186520 76bd22b4 KERNELBASE!RaiseException+0x48
00186558 5900775d msvcrt!_CxxThrowException+0x59
00186588 590076a3 jscript9!Js::JavascriptExceptionOperators::ThrowExceptionObjectInternal+0xb7
001865b8 5901503d jscript9!Js::JavascriptExceptionOperators::Throw+0x77
00186604 6365e79b jscript9!CJavascriptOperations::ThrowException+0x9c
0018662c 63e609aa mshtml!CFastDOM::ThrowDOMError+0x70
00186660 58f1f9a3 mshtml!CFastDOM::CWebSocket::DefaultEntryPoint+0xe2
001866c8 58f29994 jscript9!Js::JavascriptExternalFunction::ExternalFunctionThunk+0x165
00186724 58f2988c jscript9!Js::JavascriptFunction::CallAsConstructor+0xc7
00186744 58f29a4a jscript9!Js::InterpreterStackFrame::NewScObject_Helper+0x39
00186760 58f29a7a jscript9!Js::InterpreterStackFrame::ProfiledNewScObject_Helper+0x53
00186780 58f29aa9 jscript9!Js::InterpreterStackFrame::OP_NewScObject_Impl<Js::OpLayoutCallI_OneByte,1>+0x24
00186b58 58f26510 jscript9!Js::InterpreterStackFrame::Process+0x402b
00186d6c 14ea1ae1 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
WARNING: Frame IP not in any known module. Following frames may be wrong.
00186d78 58faa407 js!Anonymous function [http://marcdev2010:4131/app/main/main-controller.js @ 251,7]
00186e70 15511455 jscript9!Js::JavascriptFunction::EntryApply+0x267
00186ed8 58f268e6 js!d [http://marcdev2010:4131/lib/angular/angular.min.js @ 34,479]
001872c8 58f26510 jscript9!Js::InterpreterStackFrame::Process+0x899
001873f4 14ea0681 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
00187400 58f268e6 js!instantiate [http://marcdev2010:4131/lib/angular/angular.min.js @ 35,101]
001877e8 58f26510 jscript9!Js::InterpreterStackFrame::Process+0x899
00187944 14ea1b41 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
00187950 58f268e6 js!Anonymous function [http://marcdev2010:4131/lib/angular/angular.min.js @ 67,280]
00187d38 58f26510 jscript9!Js::InterpreterStackFrame::Process+0x899
00187e74 14ea1c21 jscript9!Js::InterpreterStackFrame::InterpreterThunk<1>+0x1e8
...

When I looked at the line in question, line 251 of main-controller.js, I found:

// TODO: We need to handle the web socket connection going down -- it really should be wrappered
//  try {
      $scope.watcher = new WebSocket("ws://"+location.host+"/");
//  } catch(e) {
//    $scope.watcher = null;
//  }
    
    $scope.$on('$destroy', function() {
      if($scope.watcher) {
        $scope.watcher.close();
        $scope.watcher = null;
      }
    });

It turns out that if I press F5, or call browser.navigate from the container app, the $destroy event was never called, and what’s worse, MSHTML wasn’t shutting down the web socket, either.  And by default, Internet Explorer has a maximum of six concurrent WebSocket connections. So six page loads and we’re done. The connection error is not triggering a script error in the MSHTML host, sadly, which is why it was so hard to track down.

Yeah, if I had already written the error handling for the WebSocket connection in the JavaScript code, this issue wouldn’t have been so hard to trace. But at least I’ve learned a useful new debugging technique!

Speculation and further investigation:

  • I’m still working on why the web socket is staying open between page loads when embedded.  It’s not straightforward, and while onbeforeunload is a workaround that, well, works, I hate workarounds when I don’t have a clear answer as to why they are needed.
  • It is possible that there is a circular reference leak or similar, but I did kinda think MS had sorted most of those with recent releases of IE.
  • This problem is only reproducible within the MSHTML embedded browser, and not within Internet Explorer itself. Changing $scope.watcher to window.watcher made no difference.
  • TWebBrowser (Delphi) and TEmbeddedWB (Delphi) both exhibited the same symptoms.
  • A far simpler document with a web socket call did not cause the problem.

Updated a few minutes later:

Yes, I have found the cause. It’s a classic Internet Explorer leak: a function within a closure that references the DOM. Simplified, it looks something like:

  window.addEventListener( "load", function(event) {
    var aa = document.getElementById('aa');
    var watcher = new WebSocket("ws://"+location.host+"/");
    watcher.onmessage = function() {
      aa.innerHTML = 'message received';
    }
    aa.innerHTML = 'WebSocket started';
  }, false);

This will persist the connection between  page loads, and quickly eat up your available WebSocket connections.  But it still only happens in the embedded web browser. Why that is, I am still exploring.

Updated 30 Nov 2015:

The reason this happens only in the embedded web browser is that by default, the embedded web browser runs in an emulation mode for IE7. Even with the X-UA-Compatible meta tag, you still need to add your executable to the FEATURE_BROWSER_EMULATION registry key.

Risks with third party scripts on Internet Banking sites

This morning, Firefox stalled while loading the login page for the ANZ Internet Banking website. Looking at the status bar, I could see that Firefox was attempting to connect to a website, australianewzealandb.tt.omtrdc.net. This raised immediate alarm bells, because I didn’t recognise the website address, and it certainly wasn’t an official anz.com sub-domain.

ANZ login delay - note the status message at the bottom of the window
ANZ login delay – note the status message at the bottom of the window

The connection eventually started, and the page finished loading — just one of those little glitches loading web pages that everyone encounters all the time, right? But before I entered my ID and password, I decided I wasn’t comfortable to continue without knowing what that website was, and what resources it was providing to the ANZ site.

And here’s where things got a little scary.

It turns out that australianewzealandb.tt.omtrdc.net is a user tracking web site run by marketing firm Omniture, now a part of Adobe.  The ANZ Internet Banking login page is requesting a Javascript script from the server, which currently returns the following tiny piece of code:

if (typeof(mboxFactories) !== 'undefined') {mboxFactories.get('default').get('SiteCatalyst: event', 0).setOffer(new mboxOfferDefault()).loaded();}

The Scare Factor

This script is run within the context of the Internet Banking login page. What can scripts that run within that context do? At worst, a script can be used to watch your login and password and send them (pretty silently) to a malicious host. This interaction may even be undetectable by the bank, and it would be up to you and your computer to be aware of this and block it — a big ask!

At worst, a script can be used to watch your login and password and send them to a malicious host.

The Relief

Now, this particular script is fortunately not malicious! In fact, as the mboxFactories variable actually is undefined on this page, this script does nothing at all. In other words, it’s useless and doesn’t even need to be there!  (It’s defintely possible that the request for the script is being used on the server side to log client statistics, given the comprehensive parameters that are passed in the HTTPS request for the script.)

What are the risks?

So what’s the big deal with running third party script on a website?

The core issue is that scripts from third party sites can be changed at any time, without the knowledge of the ANZ Internet Banking team. In fact, different scripts can be served for different clients — a smart hacker would serve the original script for IP addresses owned by ANZ Bank, and serve a malicious script only to specific targeted clients. There would be no reliable way for the ANZ Internet Banking security team to detect this.

Scripts from third party sites can be changed at any time, without the knowledge of the Internet Banking team.

Another way of looking at this: it’s standard practice in software development to include code developed by other organisations in applications or websites. This is normal, sensible, and in fact unavoidable. The key here is that any code must be vetted and validated by a security team before deployment. If the bank hosts this code on their own servers, this is a straightforward part of the deployment process. When the bank instead references a third party site, this crucial security step is impossible.

Banking websites are among the most targeted sites online, for obvious reasons. I would expect their security policies to be conservative and robust. My research today surprised me.

Trust

How could third party scripts go wrong?

First, australianewzealandb.tt.omtrdc.net is not controlled by ANZ Bank.  It’s controlled by a marketing organisation in another country.  We don’t know how much emphasis they place on security. We are required to trust a third party from another country in order to login to our Internet Banking.

This means we need to trust that none of their employees are malicious, that they have strong procedures in place for managing updates to the site, the servers and infastructure, and that their organisation’s aims are coincident with the tight security requirements of Internet Banking. They need to have the same commitment to security that you would expect your bank to have. That’s a big ask for a marketing firm.

Security

The ANZ Internet Banking website is of course encrypted, served via HTTPS, the industry standard method of serving encrypted web pages.

We can tell, just by looking at the address bar, that anz.com uses an Extended Validation certificate.

With a little simple detective work, we can also see that anz.com serves those pages using the TLS_RSA_WITH_AES_256_CBC_SHA encryption suite, using 256-bit keys.  This is a good strong level of encryption, today.

However, australianewzealandb.tt.omtrdc.net does not measure up. In fact, this site uses 128-bit RC4+SHA encryption and integrity and does not have an Extended Validation certificate. RC4 is not a good choice today, and neither is SHA. This suggests immediately that security is not their top concern, which should then be an immediate concern to us!

ANZ vs Omtrdc Security
ANZ vs Omtrdc Security

I should qualify this a little: Extended Validation certificates are not available for wildcard domains, which is the type of certificate that tt.omtrdc.net is using. This is for a good reason: “in order to ensure that EV SSL Certificates are not issued fraudulently or misused after issuance.” It’s worth thinking through that reason and seeing how it applies to this context.

Malicious Actors

So how could a nasty person steal your money?

In theory, if a nasty person managed to hack into the Adobe server, they could simply supply a script to the Internet Banking login page that captures your login details and sends them to a server, somewhere, anywhere, on the Internet. This means that we have to trust (there’s that word again) that the marketing firm will be proactive in updating and patching not only their Internet-facing servers, but their infrastructure behind those servers as well.

If a bad actor has compromised a certificate authority, as has happened several times recently, they can target these third party servers . Together with a DNS cache poisoning or Man-In-The-Middle (MITM) attack, even security-savvy users will be unlikely to notice fraudulent certificates on the script servers.

heartbleedSecurity flaws like Heartbleed are exacerbated by this setup. Not only do the bank security team have to patch their own servers, they also have to push the third party vendors to patch theirs as a priority.

Protect Yourself

As a user, run security software. That’s an important first step. Security software is regularly refreshed with blacklists of known malicious sites, and this will hopefully minimize any window of opportunity that an untargeted attack may have. I’m not going to recommend any particular brand, because I pretty much hate them all.

If you want to unleash your inner geek and be aware of how sites are using third party script servers, you can use Developer Tools included in your browser — press F12 in Internet Explorer, Chrome or Firefox, and look for the Network tab to see a list of all resources referenced by the site. You may need to press Ctrl+F5 to trigger a ‘hard’ refresh before the list is fully populated.

I’ve shown below the list of resources, filtered for Javascript, for the National Australia Bank Internet Banking site.  You can see two scripts are loaded from one site — again, a market research firm.

nab-resources

Simplistic Advice for Banks

Specifically to mitigate this risk, banks should consider the following:

  • Serve all scripts from your own domain and vet any third party scripts that you serve before deployment.
  • In particular, check third party scripts for back end communication, via AJAX or other channels.
  • Minimize the number of third party scripts anywhere that secure content must be presented.
  • Use the Content-Security-Policy HTTP header to prevent third party scripts on supported browsers (most browsers today support this).

There are of course other mitigations, such as Two Factor Authentication (2FA), which do reduce risk. However, even 2FA should not be considered a silver bullet: it is certainly possible to modify the login page to take over your current login in real time — all the user would see is a message that they’d mistyped their password, and as they login again, the malicious hacker is actively draining money from their account.

A final thought on 2FA: do you really want a hacker to have your banking password, even if they don’t have access to your phone? Why do we have these passwords in the first place?

Browser Developers

I believe that browser vendors could mitigate the situation somewhat by warning users if secure sites reference third party sites for resources, in particular where these secure sites have lower quality protection than the first party site. This protection is already in place where content is requested over HTTP from a HTTPS site, known as mixed content warnings.

There is no value in an Extended Validation certificate if any of the resources requested by the site are served from a site with lower quality encryption! Similarly, if a bank believes that 256-bit AES encryption is needed for their banking website, a browser could easily warn the user that resources are being served with lower quality 128-bit RC4 encryption.

Australian Banks

After this little investigation, I took a quick look at the big four Australian banking sites — ANZ, Commonwealth Bank, National Australia Bank, and Westpac.  Here’s what I found; this is a very high-level overview and contains only information provided by any web browser!

Bank Bank site security # 3rd party scripts Third party sites Third party security
ANZ Bank 256-bit AES (EV certificate) 1 australianewzealandb.tt.omtrdc.net 128-bit RC4
NAB 256-bit AES (EV certificate) 2 survey.112.2o7.net 256-bit AES
Westpac 128-bit AES (EV certificate) None!
Commonwealth Bank 128-bit AES (EV certificate) 9! ssl.google-analytics.com 128-bit AES
commonwealthbankofau.tt.omtrdc.net 128-bit RC4
google-analytics.com 128-bit AES
d1wscoizcbxzhp.cloudfront.net 128-bit AES
cba.demdex.net 128-bit AES

Do you see how the *.tt.omtrdc.net subdomains are used by two different banking sites? In fact, this domain is used by a large number of banking websites. That would make it an attractive target, wouldn’t you think?

I reached out to all 4 banks via Twitter (yeah, I know, I know, “reached out”, “Twitter”, I apologise already), and NAB was the first, and so far only, bank to respond:

Kudos are due to NAB and Westpac — NAB for responding so promptly, and Westpac, for not having the issue in the first place!

Updates (6:10am, 9 Sep 2014), with thanks, in no particular order:

Many thanks to Troy Hunt for suggesting I write this, then tweeting it — and for his continual and tireless work in websec!

Stefano Di Paola mentioned a previous Omniture vulnerability and referenced 3rd party script risks in his blog:

hillbrad⚡ mentioned a W3C project to make validation of sub-resource integrity possible:

Erlend Oftedal reminded me that this is not a new issue and mentioned his blog post from 2009: