Category Archives: Bugs

The Fragile Abstract Factory Delphi Pattern

When refactoring a large monolithic executable written in Delphi into several executables, we ran into an unanticipated issue with what I am calling the fragile abstract factory (anti) pattern, although the name is possibly not a perfect fit.  To get started, have a look at the following small program that illustrates the issue.

program FragileFactory;

uses
  MyClass in 'MyClass.pas',
  GreenClass in 'GreenClass.pas',
  //RedClass in 'RedClass.pas',
  SomeProgram in 'SomeProgram.pas';

var
  C: TMyClass;
begin
  C := TMyClass.GetObject('TGreenClass');
  writeln(C.ToString);
  C.Free;

  C := TMyClass.GetObject('TRedClass');  // oh dear… that’s going to fail
  writeln(C.ToString);
  C.Free;
end.
unit MyClass;

interface

type
  TMyClass = class
  protected
    class procedure Register;
  public
    class function GetObject(ClassName: string): TMyClass;
  end;

implementation

uses
  System.Contnrs,
  System.SysUtils;

var
  FMyClasses: TClassList = nil;

{ TMyObjectBase }

class procedure TMyClass.Register;
begin
  if not Assigned(FMyClasses) then
    FMyClasses := TClassList.Create;
  FMyClasses.Add(Self);
end;

class function TMyClass.GetObject(ClassName: string): TMyClass;
var
  i: Integer;
begin
  for i := 0 to FMyClasses.Count – 1 do
    if FMyClasses[i].ClassNameIs(ClassName) then
    begin
      Result := FMyClasses[i].Create;
      Exit;
    end;

  Result := nil;
end;

initialization
finalization
  FreeAndNil(FMyClasses);
end.
unit GreenClass;

interface

uses
  MyClass;

type
  TGreenClass = class(TMyClass)
  public
    function ToString: string; override;
  end;

implementation

{ TGreenClass }

function TGreenClass.ToString: string;
begin
  Result := 'I am Green';
end;

initialization
  TGreenClass.Register;
end.

What happens when we run this?

C:\src\fragilefactory>Win32\Debug\FragileFactory.exe
I am Green
Exception EAccessViolation in module FragileFactory.exe at 000495A4.
Access violation at address 004495A4 in module ‘FragileFactory.exe’. Read of address 00000000.

Note the missing TRedClass in the source.  We don’t discover until runtime that this class is missing.  In a project of this scale, it is pretty obvious that we haven’t linked in the relevant unit, but once you get a large project (think hundreds or even thousands of units), it simply isn’t possible to manually validate that the classes you need are going to be there.

There are two problems with this fragile factory design pattern:

  1. Delphi use clauses are fragile (uh, hence the name of the pattern).  The development environment frequently updates them, typically they are not organised, sorted, or even formatted neatly.  This makes validating changes to them difficult and error-prone.  Merging changes in version control systems is a frequent cause of errors.
  2. When starting a new Delphi project that utilises your class libraries, ensuring you use all the required units is a hard problem to solve.

Typically this pattern will be used with in a couple of ways, somewhat less naively than the example above:

  1. There will be an external source for the identifiers.  In the project in question, these class names were retrieved from a database, or from linked resources.
  2. The registered classes will be iterated over and each called in turn to perform some function.

Of course, this is not a problem restricted to Delphi or even this pattern.  Within Delphi, any unit that does work in its initialization section is prone to this problem.  More broadly, any dynamically linking registry, such as COM, will have similar problems.  The big gotcha with Delphi is that the problem can only be resolved by rebuilding the project, which necessitates rollout of an updated executable — much harder than just re-registering a COM object on a client site for example.

How then do we solve this problem?  Well, I have not identified a simple, good clean fix.  If you have one, please tell me!  But here are a few things that can help.

  1. Where possible, use a reference to the class itself, such as by calling the class’s ClassName function, to enforce linking the identified class in.  For example:
    C := TMyClass.GetObject(TGreenClass.ClassName);
    C := TMyClass.GetObject(TRedClass.ClassName);
  2. When the identifiers are pulled from an external resource, such as a database, you have no static reference to the class.  In this case, consider building a registry of units, automatically generated during your build if possible.  For example, we automatically generate a registry during build that looks like this:
    unit MyClassRegistry;
    
    // Do not modify; automatically generated 
    
    initialization
    procedure AssertMyClassRegistry;
    
    implementation
    
    uses
      GreenClass,
      RedClass;
    
    procedure AssertMyClassRegistry;
    begin
      Assert(TGreenClass.InheritsFrom(TMyClass));
      Assert(TRedClass.InheritsFrom(TMyClass));
    end; 
    
    end.

    These are not assertions that are likely to fail but they do serve to ensure that the classes are linked in.  The AssertMyClassRegistry function is called in the constructor of our main form, which is safer than relying on a use clause to link it in.

  3. Units that can cause this problem can be identified by searching for units with initialization sections in your project (don’t forget units that also use the old style begin instead of initialization — a helpful grep regular expression for finding these units is (?s)begin(?:.(?!end;))+\bend\.).  This at least gives you a starting point for making sure you test every possible case.  Static analysis tools are very helpful here.
  4. Even though it goes against every encapsulation design principle I’ve ever learned, referencing the subclass units in the base class unit is perhaps a sensible solution with a minimal cost.  We’ve used this approach in a number of places.
  5. Format your use clauses, even sorting them if possible, with a single unit reference on each line.  This is a massive boon for source control.  We went as far as building a tool to clean our use clauses, and found that this helped greatly.

In summary, then, the fragile abstract factory (anti) pattern is just something to be aware of when working on large Delphi projects, mainly because it is hard to test for: the absence of a unit only comes to light when you actually call upon that unit, and due to the fragility of Delphi’s use clauses, unrelated changes are likely to trigger the issue.

How not to re-raise an exception in Delphi

Just a quick exception management tip for today.

I was debugging a weird cascade of exceptions in an application today — it started with an EOleException from a database connection issue, and rapidly degenerated into a series of EAccessViolation and EInvalidPointer exceptions: often a good sign of Use-After-Free or Double-Free scenarios.  Problem was, I could not see any place where we could be using a object after freeing it, even in the error case.

Here’s a shortened version of the code:

procedure ConnectToDatabase;
begin
  try
    CauseADatabaseException;  // yeah... bear with me here.
  except
    on E: EDatabaseError do
    begin
      E.Message := 'Failed to connect to database; the '+ 
                   'error message was: ' + E.Message;
      raise(E);
    end;
  end;
end;

Can you spot the bug?

I must admit I read through the code quite a few times before I spotted it.  It’s not an in-your-face-look-at-me type of bug!  In fact, the line in question appears to be completely logical and plausible.

Okay, enough waffle.  The bug is in the last line of the exception handler:

      raise(E);

When we call raise(E) we are telling Delphi that here is a shiny brand new exception object that we want to raise.  After Delphi raises the exception object, the original exception object is freed, and … wait a minute, that’s the exception object we were raising!  One delicious serve of Use-After-Free or Double-Free coming right up!

We should be doing this instead:

      raise;  // don't reference E here!

Another thing to take away from this is: remember that you don’t control the lifetime of the exception object.  Don’t store references to it and expect it to survive.  If you want to maintain knowledge of an inner exception object, use Delphi’s own nested exception management, available since Delphi 2009.

The Southern Ocean… According to Apple

Just like everyone else, I found Apple’s new train wreck release of iOS 6 maps bizarre. It’s rather unlike Apple to publish such an unpolished feature. The straight-line vector graphics look amateur when zoomed in and generally a step back from the Google maps. But worse than the lack of polish is of course the outright data errors. I can understand a misplaced town in an insignificant country such as the United Kingdom. But to misplace a whole ocean? I mean, it’s not like there’s that many of them to keep track of.

Here’s the evidence. You’ll also note the comparatively minor issue of a missing river. Minor compared to an ocean, anyway (on Google Maps):




However, before you hanker for the good old Google Maps, let’s take a quick look at the same location:

Looks good, river staying right where it belongs like a good little river, and the Southern Ocean has decided to skip town.  But there’s something different about that Derwent Park Rd: it seems to be a major through-road on Google Maps but is a dead end on Apple’s map:

Google should know it’s not a major through road.  Their Street View car even found the evidence. It’s a private road.  With a gate.

What’s the moral of the story?  Use Bing Maps?  Nope.  They include our little dried-up ocean, fortunately nameless, but also perpetuate that little issue with the closed road (and yes, both Bing and Google will happily direct you to crash through the gate, if you ask them for directions).

Still, I think Apple’s maps take home the prize!

Delphi, Win32 and leaky exceptions

Have you ever written any Delphi code like the following snippet?  That is, have you ever raised a Delphi exception in a Win32 callback?  Or even just failed to handle potential exceptions in a callback?

    function MyWndEnumFunc(hwnd: HWND; lParam: LPARAM): BOOL; stdcall;
    begin
      if hwnd = TForm4(lParam).Handle then
        raise Exception.Create('Raise an exception just to demonstrate the issue');
      Result := True;
    end;

    procedure TForm4.FormCreate(Sender: TObject);
    begin
      try
        EnumWindows(@MyWndEnumFunc, NativeInt(Self));
      except
        on E:Exception do
          ShowMessage('Well, we unwound that exception.  But does Win32 agree?');
      end;
    end;

Raymond Chen has posted a great blog today about why that approach will eventually end in tears.  You may get away with it for a while, or you may end up with horrific stack or heap corruption. The moral of the story is, any time you have a Win32 callback function, you need to make sure no exceptions leak.  Like this:

function MyWndEnumFunc(hwnd: HWND; lParam: LPARAM): BOOL; stdcall;
begin
  try
    if hwnd = TForm4(lParam).Handle then
      raise Exception.Create('Raise an exception just to demonstrate the issue');
    Result := True;
  except
    // Handle the exception, perhaps pass it to Application.HandleException,
    // or log it, or abort your app.  Just don’t let it bubble through any
    // Win32 code.  You need to write the HandleAllExceptions function!
    HandleAllExceptions;
    Result := False;
  end;
end;

If you use Delphi’s AllocateHwnd procedure, remember that it also does not handle exceptions for you (I’ve just reported this in QualityCentral as QC108653 as this caveat should at least be documented).  So you need to do it:

procedure TForm4.MyAllocatedWindowProc(var Message: TMessage);
begin
  try
    if Message.Msg = WM_USER then
      raise Exception.Create('Go wild!');
  except
    Application.HandleException(Self);  // Or whatever, just don’t let it leak
  end;
  with Message do Result := DefWindowProc(FMyHandle, Msg, wParam, lParam);
end;

procedure TForm4.FormCreate(Sender: TObject);
begin
  FMyHandle := AllocateHwnd(MyAllocatedWindowProc);
  SendMessage(FMyHandle, WM_USER, 0, 0);
end;

Why does switching languages hang my app?

This is a very short follow-up to yesterday’s blog about WM_INPUTLANGCHANGEREQUEST, WaitForSingleObject and applications hanging.

While I thought, yesterday, that this bug was no longer such a problem on Windows Vista and Windows 7, it turns out that if you assign a hotkey to a language (e.g. Alt+Shift+2), the KLF_SETFORPROCESS flag will be set when that hotkey is pressed.  This means that this issue is still a problem today.

This is not just a problem for Delphi.  I’ve found references to Outlook hanging, .NET 4.5 WPF applications hanging, and VC++ applications hanging.

Moral of the story (again): thread-safe code is hard.  Don’t believe that your code suddenly becomes thread-safe because you synchronize it to the main thread (no matter what the Delphi documentation tells you!)

WaitForSingleObject. Why you should never use it.

Updated 30/1/2013: Fixed a typo and formatting issue or two.  Added a sentence for clarity.  See also the short follow-up post.

I’ve really struggled with how to frame this post.  It could be about the dangers of WaitForSingleObject and WaitForMultipleObjects.  Or about how Delphi’s TThread.Synchronize seems so handy, and yet because it must use WaitForSingleObject, is so fraught with complications.  Or yet, about how pressing Alt+Left Shift to switch languages could hang an application.  In the end, it’s about all three of these things.

Some people, when confronted with a problem, think, “I know, I’ll use threads,” and then two they hav erpoblesms.

Let’s start with TThread.Synchronize.  It seems a handy little function: it will, in a thread-safe manner, call a procedure in the context of the main VCL thread (the thread that owns the Delphi VCL GUI windows), and wait, using the Windows API WaitForSingleObject(INFINITE), for the procedure to return.  Simpler than fiddling with synchronisation primitives, right?  Everyone knows threading is hard, so use the well-tested thread utilities where you can?

Except that WaitForSingleObject and its big brother WaitForMultipleObjects are dangerous.  The basic problem is that these calls can cause deadlocks, if you ever call them from a thread that has its own message loop and windows.  That’s okay, you say, I don’t have any UI except in my main thread.  But any thread that uses COM can have hidden COM helper windows when doing RPC (and see below for more on this).  And other libraries can create their own windows as well, such as the ADO libraries.

So what causes the deadlock? Well, I’ll illustrate with the scenario we ran into.  Some old code (that I wrote, okay, okay) created a thread (we’ll call it BackgroundThread).  BackgroundThread used TThread.Synchronize to periodically update the UI status about a background database process it was running.  It doesn’t really matter what it was doing, but the use of Microsoft’s ADO database library meant that this thread was creating a hidden window, with the class ADODB.AsyncEventMessenger.  Behind the scenes, a second window was automatically created by Windows once we had the first window in the thread, and this one had the class name IME.

Every now and then, our BackgroundThread would call Synchronize(RefreshStatus).  This would signal an event which the main thread would check periodically from its message loop.  Eventually it would call the RefreshStatus procedure.  BackgroundThread would in the meantime have called WaitForSingleObject(INFINITE) to wait for an event to be signaled by the main thread indicating that the RefreshStatus procedure had finished.

Where are we going?  Well, if the main thread receives a message that it then decides to send on to other windows in the process, while BackgroundThread is getting ready to synchronize, we can end up in a deadlock.  And, it turns out that in Windows XP, this can happen when WM_INPUTLANGCHANGEREQUEST (0x50) is received, e.g. when the user presses Alt+Left Shift.  (Note, for this scenario to play, the Input Method Manager must be enabled — install Far Eastern language support in Windows XP).  Remember, this is just one possible scenario which can trigger a deadlock.

Let’s pull this apart.  I’ve loaded the stalled process into WinDbg, and am now looking at two call stacks which have deadlocks, in user mode.  First the main VCL GUI thread:

   0  Id: bf8.cb8 Suspend: 1 Teb: 7ffde000 Unfrozen
ChildEBP RetAddr
0012fadc 7e4194be ntdll!KiFastSystemCallRet
0012fb30 7e43652f USER32!NtUserMessageCall+0xc
0012fb50 7e418734 USER32!EditWndProcW+0x5d
0012fb7c 7e418816 USER32!InternalCallWinProc+0x28
0012fbe4 7e42a013 USER32!UserCallWinProcCheckWow+0x150
0012fc14 7e42a039 USER32!CallWindowProcAorW+0x98
0012fc34 004c0e7d USER32!CallWindowProcW+0x1b
0012fda4 004c0d80 audit4_home!Vcl.Controls.TWinControl.DefaultHandler+0xdd
0012fdf0 004c03d3 audit4_home!Vcl.Controls.TWinControl.WndProc+0x5b8
0012fe20 00467b3e audit4_home!Vcl.Controls.TWinControl.MainWndProc+0x2f
0012fe38 7e418734 audit4_home!System.ClassesStdWndProc+0x16
0012fe64 7e418816 USER32!InternalCallWinProc+0x28
0012fecc 7e4189cd USER32!UserCallWinProcCheckWow+0x150
0012ff2c 7e418a10 USER32!DispatchMessageWorker+0x306
0012ff3c 005a6980 USER32!DispatchMessageW+0xf
0012ff58 005a69c3 audit4_home!Vcl.Forms.TApplication.ProcessMessage+0xf8
0012ff7c 00d26d60 audit4_home!Vcl.Forms.TApplication.HandleMessage+0xf
0012ff9c 016e6903 audit4_home!S4s.Ui.Session.Appsession_main.TAppSession_Main.Run+0xcc
0012ffc0 7c817077 audit4_home!Audit4_home.initialization+0xc3
0012fff0 00000000 kernel32!BaseProcessStart+0x23

And then our BackgroundThread:

   9  Id: bf8.f30 Suspend: 1 Teb: 7ffd5000 Unfrozen
ChildEBP RetAddr
0494fdb8 7c90df5a ntdll!KiFastSystemCallRet
0494fdbc 7c8025db ntdll!NtWaitForSingleObject+0xc
0494fe20 7c802542 kernel32!WaitForSingleObjectEx+0xa8
0494fe34 0042d4b3 kernel32!WaitForSingleObject+0x12
0494fe80 00408a19 audit4_home!System.Sysutils.WaitForSyncWaitObj+0x7
0494fed0 004654f2 audit4_home!System.TMonitor.Wait+0x25
0494fedc 00c9f3a2 audit4_home!System.Classes.TThread.Synchronize+0x2e
0494fef8 00c9ef29 audit4_home!S4s.Br.Backgroundclasses.Backgroundclass.TBackgroundClass.SetStatus+0x8a
0494ff70 00464b11 audit4_home!S4s.Br.Backgroundclasses.Backgroundclass.TBackgroundClass.Execute+0x18d
0494ffa0 00409752 audit4_home!System.Classes.ThreadProc+0x45
0494ffb4 7c80b729 audit4_home!SystemThreadWrapper+0x2a
0494ffec 00000000 kernel32!BaseThreadStart+0x37

We can see that the main thread has sent a message somewhere.  It turns out it has sent a message to a window in the same thread (The window handle 0006040c is just the Edit window):

0012fb30 7e43652f 0006040c 00000050 00000001 USER32!NtUserMessageCall+0xc
0012fb50 7e418734 0006040c 00000050 00000001 USER32!EditWndProcW+0x5d

So why is it stalling?  It’s hard to determine here, because everything bad is happening in kernel mode, behind that KiFastSystemCallRet call. That must mean it’s time to step into kernel mode!  In WinDbg, press Ctrl+K, select Local.  I’m learning here, so this is as much for my own documentation as to explain to anyone else (in other words, if it doesn’t make sense to you, it won’t make sense to me, either, in 6 weeks time).  First, we find the process details:

lkd> !process 0 0 audit4_home.exe
PROCESS 894c3710  SessionId: 0  Cid: 0bf8    Peb: 7ffdf000  ParentCid: 0e0c
DirBase: 0a5c0700  ObjectTable: 00000000  HandleCount:   0.
Image: audit4_home.exe

PROCESS 893277e0  SessionId: 0  Cid: 0d94    Peb: 7ffd8000  ParentCid: 0e0c
DirBase: 0a5c0720  ObjectTable: 00000000  HandleCount:   0.
Image: audit4_home.exe

PROCESS 8950d9a8  SessionId: 0  Cid: 0f10    Peb: 7ffdb000  ParentCid: 0e0c
DirBase: 0a5c0740  ObjectTable: 00000000  HandleCount:   0.
Image: audit4_home.exe

PROCESS 89566758  SessionId: 0  Cid: 0520    Peb: 7ffde000  ParentCid: 0e0c
DirBase: 0a5c0840  ObjectTable: e127a330  HandleCount: 467.
Image: audit4_home.exe

We’ll look at the last process listed, being the one with the problem (the other three are defunct):

lkd> !process 89566758 2
PROCESS 89566758  SessionId: 0  Cid: 0520    Peb: 7ffde000  ParentCid: 0e0c
DirBase: 0a5c0840  ObjectTable: e127a330  HandleCount: 467.
Image: audit4_home.exe

THREAD 8953a920  Cid 0520.009c  Teb: 7ffdd000 Win32Thread: e1210eb0 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8953aabc  Semaphore Limit 0x2

THREAD 8963bda0  Cid 0520.0a98  Teb: 7ffd9000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8963bf3c  Semaphore Limit 0x2

THREAD 8953b9a0  Cid 0520.03c8  Teb: 7ffd8000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8953bb3c  Semaphore Limit 0x2

THREAD 895a2c08  Cid 0520.021c  Teb: 7ffd7000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
895a2da4  Semaphore Limit 0x2

THREAD 898f8cc8  Cid 0520.0bd8  Teb: 7ffd4000 Win32Thread: e2d75870 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
898f8e64  Semaphore Limit 0x2

THREAD 893287f8  Cid 0520.0eec  Teb: 7ff4f000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89328994  Semaphore Limit 0x2

THREAD 89308c30  Cid 0520.00f0  Teb: 7ff4e000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89308dcc  Semaphore Limit 0x2

THREAD 89ab4020  Cid 0520.01b0  Teb: 7ff4d000 Win32Thread: e2ed4260 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89ab41bc  Semaphore Limit 0x2

THREAD 893269b0  Cid 0520.0830  Teb: 7ff4c000 Win32Thread: e23bf2c0 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89326b4c  Semaphore Limit 0x2

THREAD 89895778  Cid 0520.0ce0  Teb: 7ff4b000 Win32Thread: e2fa06e8 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89895914  Semaphore Limit 0x2

THREAD 8933ba80  Cid 0520.0e08  Teb: 7ff4a000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8933bc1c  Semaphore Limit 0x2

THREAD 89322728  Cid 0520.0f28  Teb: 7ff49000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
893228c4  Semaphore Limit 0x2

THREAD 89449960  Cid 0520.0ab0  Teb: 7ff48000 Win32Thread: e12d5590 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89449afc  Semaphore Limit 0x2

THREAD 8952d020  Cid 0520.09dc  Teb: 7ffdc000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable
SuspendCount 1
b8c197d4  SynchronizationEvent

And we can see the first thread has the address 8953a920.  So let’s look at that (flag 16 means show the full stack in the process context, with parameters).

lkd> !thread 8953a920 16
THREAD 8953a920  Cid 0520.009c  Teb: 7ffdd000 Win32Thread: e1210eb0 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8953aabc  Semaphore Limit 0x2
Not impersonating
DeviceMap                 e1050c08
Owning Process            0       Image:
Attached Process          89566758       Image:         audit4_home.exe
Wait Start TickCount      213083         Ticks: 67800 (0:00:17:39.375)
Context Switch Count      20200  NoStackSwap    LargeStack
UserTime                  00:00:00.781
KernelTime                00:00:01.484
Win32 Start Address audit4_home!Audit4_home.initialization (0x016e6840)
Start Address kernel32!BaseProcessStartThunk (0x7c810705)
Stack Init b7c1f000 Current b7c1e840 Base b7c1f000 Limit b7c19000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0 DecrementCount 16
ChildEBP RetAddr  Args to Child
b7c1e858 80503864 8953a990 8953a920 804fb094 nt!KiSwapContext+0x2f (FPO: [Uses EBP] [0,0,4])
b7c1e864 804fb094 8953aa8c 8953a920 8953a954 nt!KiSwapThread+0x8a (FPO: [0,0,0])
b7c1e88c 80502fa0 00000000 00000005 00000000 nt!KeWaitForSingleObject+0x1c2 (FPO: [5,5,4])
b7c1e8a4 804ff8e0 00000000 00000000 00000000 nt!KiSuspendThread+0x18 (FPO: [3,0,0])
b7c1e8ec 80503882 00000000 00000000 00000000 nt!KiDeliverApc+0x124 (FPO: [3,10,0])
b7c1e904 804fb094 00000240 e1210eb0 00000000 nt!KiSwapThread+0xa8 (FPO: [0,0,0])
b7c1e92c bf802f15 00000000 0000000d 00000001 nt!KeWaitForSingleObject+0x1c2 (FPO: [5,5,4])
b7c1e968 bf835eb7 00000200 00000000 00000000 win32k!xxxSleepThread+0x192 (FPO: [3,5,4])
b7c1ea04 bf8141d2 bbe83720 00000287 00000019 win32k!xxxInterSendMsgEx+0x7f6 (FPO: [Non-Fpo])
b7c1ea50 bf80ecd9 bbe83720 00000287 00000019 win32k!xxxSendMessageTimeout+0x11f (FPO: [7,7,0])
b7c1ea74 bf92b42e bbe83720 00000287 00000019 win32k!xxxSendMessage+0x1b (FPO: [4,0,0])
b7c1eaa4 bf92c675 e2d75870 e13f10e0 e13f10e0 win32k!xxxImmActivateLayout+0x5b (FPO: [2,3,4])
b7c1ec08 bf8696d9 00000004 00000000 e13f10e0 win32k!xxxImmActivateThreadsLayout+0x10c (FPO: [3,82,4])
b7c1ec48 bf86862f e13f10e0 00000100 bbe88228 win32k!xxxInternalActivateKeyboardLayout+0xb7 (FPO: [3,8,4])
b7c1ec70 bf80b5a7 8989a2d0 04090c09 00000100 win32k!xxxActivateKeyboardLayout+0x4c (FPO: [4,3,0])
b7c1ecd4 bf80ec9f bbe88228 00000050 00000001 win32k!xxxRealDefWindowProc+0x56d (FPO: [4,16,0])
b7c1ecec bf81c176 bbe88228 00000050 00000001 win32k!xxxWrapRealDefWindowProc+0x16 (FPO: [5,0,0])
b7c1ed08 bf80eee6 bbe88228 00000050 00000001 win32k!NtUserfnNCDESTROY+0x27 (FPO: [7,0,0])
b7c1ed40 8054168c 0006040c 00000050 00000001 win32k!NtUserMessageCall+0xae (FPO: [7,3,0])
b7c1ed40 7c90e514 0006040c 00000050 00000001 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7c1ed64)
0012fadc 7e4194be 7e428e0d 0006040c 00000050 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0012fb30 7e43652f 0006040c 00000050 00000001 USER32!NtUserMessageCall+0xc
0012fb50 7e418734 0006040c 00000050 00000001 USER32!EditWndProcW+0x5d (FPO: [4,0,4])
0012fb7c 7e418816 7e4364cf 0006040c 00000050 USER32!InternalCallWinProc+0x28
0012fbe4 7e42a013 00000000 7e4364cf 0006040c USER32!UserCallWinProcCheckWow+0x150 (FPO: [Non-Fpo])
0012fc14 7e42a039 7e4364cf 0006040c 00000050 USER32!CallWindowProcAorW+0x98 (FPO: [6,0,0])
0012fc34 004c0e7d 7e4364cf 0006040c 00000050 USER32!CallWindowProcW+0x1b (FPO: [5,0,0])
0012fda4 004c0d80 02940d99 02bb9430 fffffffe audit4_home!Vcl.Controls.TWinControl.DefaultHandler+0xdd
0012fdf0 004c03d3 0012fe04 004c03eb 0012fe20 audit4_home!Vcl.Controls.TWinControl.WndProc+0x5b8
0012fe20 00467b3e 00000050 00000001 04090c09 audit4_home!Vcl.Controls.TWinControl.MainWndProc+0x2f
0012fe38 7e418734 0006040c 00000050 00000001 audit4_home!System.ClassesStdWndProc+0x16
0012fe64 7e418816 02940d99 0006040c 00000050 USER32!InternalCallWinProc+0x28
0012fecc 7e4189cd 00000000 02940d99 0006040c USER32!UserCallWinProcCheckWow+0x150 (FPO: [Non-Fpo])
0012ff2c 7e418a10 0012ff60 00000000 0006040c USER32!DispatchMessageWorker+0x306 (FPO: [Non-Fpo])
0012ff3c 005a6980 0012ff60 00120100 0012ff9c USER32!DispatchMessageW+0xf (FPO: [1,0,0])
0012ff4c 7c910222 0000000f 028458e0 005a69c3 audit4_home!Vcl.Forms.TApplication.ProcessMessage+0xf8
0012ff9c 016e6903 0012ffb0 016e6916 0012ffc0 ntdll!RtlpAllocateFromHeapLookaside+0x42 (FPO: [Non-Fpo])
0012ffc0 7c817077 7c910222 0000000f 7ffde000 audit4_home!Audit4_home.initialization+0xc3
0012fff0 00000000 016e6840 00000000 00000018 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])

Highlighted above there is a call to win32k!xxxSendMessage.  This is passed the address of a window at bbe83720.  And when we look at that, we see the following:

lkd> dd bbe83720
bbe83720  000d01fe 00000006 e2d75870 8998ae50

And window 000d01fe is the IME window for the other, deadlocked thread!  Why is the kernel sending a message to another thread here and now?  And without a timeout?  That’s our deadlock.  We know how it happens, but I at least still don’t know why!

Still, that’s a step forward.  Next was to figure out why we were getting that.  After a fair bit of exploration, I looked at little harder at the call stack, and decided to investigate the parameters for xxxActivateKeyboardLayout:

8989a2d0 04090c09 00000100 win32k!xxxActivateKeyboardLayout+0x4c 

That third parameter in xxxActivateKeyboardLayout corresponds to the Flags parameter for ActivateKeyboardLayout.  Perusing the documentation for the Windows ActivateKeyboardLayout function, we can see that 0x100 is KLF_SETFORPROCESS.  Bingo!  That sounds pretty suspicious!  I worked through the assembly code for xxxActivateKeyboardLayout and xxxRealDefWindowProc, and sure enough, that was it: when xxxRealDefWindowProc processes WM_INPUTLANGCHANGEREQUEST, it sets the Flags parameter to 0x0100:

win32k!xxxRealDefWindowProc+0x44d:
8211f1dd 56              push    esi
8211f1de 6800010000      push    100h
8211f1e3 ff7514          push    dword ptr [ebp+14h]
8211f1e6 6a00            push    0
8211f1e8 e8dfe40100      call    win32k!_GetProcessWindowStation (8213d6cc)
8211f1ed 50              push    eax
8211f1ee e838bdf9ff      call    win32k!xxxActivateKeyboardLayout (820baf2b)

In Windows Vista and later versions, this problem does not arise as frequently, because the Text Services Framework takes over the Alt+Left Shift command (before DefWindowProc gets a look-in) and calls ActivateKeyboardLayout without the KLF_SETFORPROCESS flag set.  However, the issue can still arise if you ever use that flag in your own code when calling ActivateKeyboardLayout.

To summarize, it’s helpful to remember that technically, we were using the functions as designed.  The primary issue is combination of design flaws, first, with how that KLF_SETFORPROCESS flag is handled: either the kernel code should be using a timeout when it sends the message to the other thread, or it should be queuing an event for the other thread to handle when it gets around to it; and second, with TThread.Synchronize, which unfortunately by design cannot be robust against deadlocks.

In any case, to make your own code more robust:

  1. Determine, when using synchronisation calls, if there is any chance that windows could be created by your thread, and if so, use MsgWaitForMultipleObjects and a message loop instead of WaitForSingleObject or WaitForMultipleObjects.  Check the libraries you are using, and if you are using COM, it’s safest to assume that windows will be created.
  2. Don’t use the TThread.Synchronize procedure.  Recent versions of Delphi include TThread.Queue, which is asynchronous, and so avoids this deadlock.
  3. Think carefully about whether a thread is the right solution to the problem.
  4. Don’t use the KLF_SETFORPROCESS flag!

Mixing RTL and LTR: Plaintext vs HTML

Just a very short post today.  We still have some way to go in mixing RTL and LTR text.  For example, the following image, snipped from Outlook 2010, shows the issue:

The subject and body both say the same thing, but the display order is different.  Do you know why?  It’s because the subject is plain text and is assuming that the text is primarily right-to-left, whereas the body is HTML, and is assuming that the text is primarily left-to-right.

Note how the full stop in the subject appears to the left of the English text.  This is because the display renderer has assumed that the whole run of text is right-to-left, so punctuation is treated as right-to-left, and so displays after (in a right-to-left sense) the text.

The question is, of course, how do you determine directionality given an arbitrary plain text string?  It’s not really possible to do so reliably in the absence of other metadata.  The W3C article on directionality is helpful here: http://www.w3.org/TR/html4/struct/dirlang.html

Another view of the message:

Interestingly, Outlook Web Access does not do this, because its UI takes its directionality from the base HTML document:

When characters go astray: diagnosing missing characters when printing with IE9

I was recently landed with a real doozy of a support case: on some systems, under some circumstances, print jobs would be missing random characters.  Like the following image:

print-1

As opposed to how it should appear:

print-2

This example is missing the following letters:  Å ê Ö q Q

In this particular example, you may be misled into thinking the problem is just with non-ASCII characters.  But note that both upper case Q and lower case q are missing as well.  Other example documents that we collected had no non-ASCII characters in them at all.  I just used this example because it was one of the few we captured with fake data as opposed to real customer data.

This problem was not particularly printer dependent, and could be replicated (thank goodness!) with a PDF print driver.  But the circumstances behind replicating the problem were a little strange: it only occurred, in the second document printed, when more than one print job was sent to the printer in quick succession.  Some combinations of documents did not reflect the issue.  If we sent the documents in the opposite order, they would often succeed.  We were unable to replicate the problem on the printer in our office.

After many tries, I finally managed to replicate the issue on my development machine.  Now I could break out the debugging tools and start some tracing.

I first used Process Monitor to capture details about temporary files in the print process.  A lot of temporary files were generated.  I examined many of them, but the files appeared intact: the problem was not reflected at this point in the print job.

print-3

Next I attached a debugger to the process to try and figure out if I could spot any issues during the print itself.  I put a breakpoint on GDI32!StartDocW and when that fired, enabled a breakpoint on GDI32!ExtTextOutW.  This showed me all the strings that were being sent to the page during the print job.

Unfortunately, Internet Explorer 9 uses a process of printing to XPS and then converting that output to GDI commands for printing.  The library which does the conversion from XPS to GDI used the flag ETO_GLYPH_INDEX when calling ExtTextOutW.  This meant that examining the data in the debugger was pretty much a non-starter, without any easy to spot strings.  So eventually I decided to log the whole lot to a text file, and pull the TrueType font glyph indices in the hope that the information in the TTF file would be sufficient to reconstruct the original text strings.

As it turned out, I was trivially able to match that data up to the XPS data by generating the report as an XPS.  These are just a ZIP file with fonts, images and XML rendering data embedded.  Easy to spelunk.  It certainly seems that the correct glyph indices were being sent to the print engine.  I matched these up between the data captured in the debugger as follows.

The text string:

Dr Åbrique T Ömbê

The XPS element:

Pulling out just the Glyph indices from the XPS element:

39;85;3;99;69;85;76;84;88;72;3;55;3;103;80;69;114

Converting those glyph indices to hexadecimal:

0027 0055 0003 0063 0045 0055 004c 0054
0058 0048 0003 0037 0003 0067 0050 0045
0072

Matching the data captured from ExtTextOutW:

233a4fe0  00550027 00630003 00550045 0054004c
233a4ff0  00480058 00370003 00670003 00450050
233a5000  abab0072

Note that the DWORDs are reversed because Intel uses little endian byte order (and abab is just garbage past the end of the array).

This suggested that the font data itself is being corrupted.  Next, to check the font data in the .ps file as it is generated!  Helpfully, PrimoPDF stores a .ps file in a temporary folder during the conversion to PDF.  So it was just a matter of grabbing that file and looking at that.  Well, of course PostScript and RTF are near the top of the list of painfully unreadable file formats, but a little trial and error (and GhostScript) found me the following:

779 1636 M [66 31 23 66 46 30 26 46 45 41 23 56 23 66 71 46  0]xS

Of which, 080B03380E0B1B3905090322033A0D0E3B should translate to Dr Åbrique T Ömbê:

08 0B 03 38 0E 0B 1B 39 05 09 03 22 03 3A 0D 0E 3B
D  r     Å  b  r  i  q  u  e     T     Ö  m  b  ê

Interestingly, the missing glyphs are all in the range 38-3B in this example.  This line is identical in a version of the file that printed without errors, but the font data is quite different.

Right, so we know data corruption is occurring in the font data.  But we knew that data corruption was the issue already.  So where are we now?  Have I accomplished anything by going down this track?  Yes!  I know now that it is happening in the conversion of an XPS document to GDI, in the print process.  Given this and that it happens only when 2 documents are printed, it is almost certainly a threading issue in the XPS to GDI library.

Armed with that new-found knowledge, I set out to build a minimal test application.  Here’s the very hacky test case I ended up with:

#include "stdafx.h"

int Print(LPWSTR inFile, LPWSTR outFile, LPWSTR printer);

int _tmain(int argc, _TCHAR* argv[]) {
  if(argc < 2) {
    printf("Usage: XpsThread Apartment|Multi [PathToXPSFiles] [Printer]\n");
    return 1;
  }

  if(_wcsicmp(argv[1], L"Apartment") == 0)
    CoInitializeEx(0, COINIT_APARTMENTTHREADED);
  else
    CoInitializeEx(0, COINIT_MULTITHREADED);

  /* This assumes 2 files named 0.xps and 1.xps.  No need to make more complex at this stage. */

  WCHAR inFile[MAX_PATH], outFile[MAX_PATH];

  PWCHAR printer = argc > 3 ? argv[3] : L"PrimoPDF";

  for(int i = 0; i < 2; i++)
  {
    wsprintf(inFile, L"%s%s%d.xps", argc > 2 ? argv[2] : L"", argc > 2 ? (*(wcschr(argv[2],0)-1) == '\\' ? L"" : L"\\") : L"", i);
    wsprintf(outFile, L"%s%s%d.ps", argc > 2 ? argv[2] : L"", argc > 2 ? (*(wcschr(argv[2],0)-1) == '\\' ? L"" : L"\\") : L"", i);

    Print(inFile, outFile, printer);
  }

  getchar();  // Manually wait for print jobs to finish.

  CoUninitialize();
  // Examine the .ps files in GSview or whatever takes your fancy.  The second file generated will often be corrupt if COINIT_MULTITHREADED is used.
  return 0;
}

int Print(LPWSTR inFile, LPWSTR outFile, LPWSTR printer) {
  IXpsPrintJobStream *docStream = NULL;
  HRESULT hr;

  hr = StartXpsPrintJob(printer, NULL, outFile, 0, 0, NULL, 0, NULL, &docStream, NULL);
  if(SUCCEEDED(hr)) {
    FILE *fp = _wfopen(inFile, L"rb");
    BYTE buf[512];
    int n = fread(buf, 1, 512, fp);
    ULONG sz;
    while(n > 0) {
      docStream->Write(buf, n, &sz);
      n = fread(buf, 1, 512, fp);
    }
    fclose(fp);

    hr = docStream->Close();
    if(SUCCEEDED(hr)) {
      wprintf(L"Succeeded: %s\n", inFile);
      return 0;
    }
  }

  wprintf(L"Failed to print %s: %x\n", inFile, hr);
  return 1;
}

It may have been a little hacky, but it proved the bug.  It turns out, printing multiple documents at once with COM initialised with the flag COINIT_MULTITHREADED will consistently fail, even though that is the mode used in the example in MSDN.

Okay, so let’s take that information back to our application in Windbg, and see if we can find anything in MSHTML’s printing.  I popped in a couple of breakpoints:

bp coinitializeex ".echo ---CoInitializeEx---; ~.; dd esp+8 L1; gc"
bp gdi32!startdocw ".echo ---StartDocW---; ~.;"

The first would show me the thread ID and the COINIT flags issued every time CoInitializeEx was called.  The second would break when StartDoc was called to start a print job, from the XPS printing code (XpsGdiConverter.dll).  This would tell me the threading mode for the thread.  And as I was now expecting, MSHTML uses COINIT_MULTITHREADED (= 00000000).

---CoInitializeEx---
.  2  Id: 1464.f80 Suspend: 1 Teb: 7efd7000 Unfrozen
Start: mshtml!ShowModelessHTMLDialog+0x5460 (6dc0d50e)
Priority: 0  Priority class: 32  Affinity: f
189ef0b4  00000000
---StartDocW---
.  2  Id: 1464.f80 Suspend: 1 Teb: 7efd7000 Unfrozen
Start: mshtml!ShowModelessHTMLDialog+0x5460 (6dc0d50e)
Priority: 0  Priority class: 32  Affinity: f

Interestingly, when I was researching XPS printing in MSDN, I came across a page in the .NET Framework documentation which contains the following detail:

The three-parameter AddJob(String, String, Boolean) overload of AddJob must run in a single thread apartment whenever the Boolean parameter is false, which it must be when a non-XPSDrv printer is being used. However, the default apartment state for Microsoft .NET is multiple thread. This default must be reversed since the example assumes a non-XPSDrv printer

There are two ways to change the default. One way is to simply add the STAThreadAttribute (that is, “[System.STAThreadAttribute()]”) just above the first line of the application’s Main method (usually “static void Main(string[] args)”). However, many applications require that the Main method have a multi-threaded apartment state, so there is a second method: put the call to AddJob(String, String, Boolean) in a separate thread whose apartment state is set to STA with SetApartmentState. The example below uses this second technique.

This little detail is not in the Win32 XPS documentation. I can’t prove that this is the cause of the problem but it certainly seems suspicious that this goes wrong with MSHTML.  I tried a very naughty test whereby I overrode the COINIT flags whenever CoInitializeEx was called (WinDbg: bp ole32!coinitializeex “ed esp+8 2; gc”), and the problem “went away” … it’s a nice finger-in-the-wind test but certainly not conclusive!

The problem does not occur on the printer in our office, because it is an XPSDrv printer.  The problem only occurs with drivers that do not support XPS, because XPSDrv does not require apartment threading.

Anyway, I think it’s time to take this case to Microsoft.  In the meantime, the safest workaround is to wait for the first print job to finish before starting a subsequent one.  We’ll probably put that fix in for now, even though it makes the print tediously slow!

Second Rant: Why oh why does Adobe Reader need to restart?

Updated: 3:50pm — It gets worse — see below!
 
Have you ever restarted Windows (e.g. after installing Windows updates) and then been presented with the following dialog:

 And of course, once you click that Install button, you eventually end up with the following:

(For the pedantic reader, yes, I captured different versions of Adobe Reader Updater on different machines.  I’ve had to deal with this on 3 machines so far today. Get over it!)

Here’s why Adobe Reader needs to restart, even if you have nothing apparently running: the installer does not shut down the background processes AdobeARM.exe and reader_sl.exe (and possibly others).

For example on my Windows 7 x64 test machine, in Event Viewer, you can see the following events.

Product: Adobe Reader 9.5.2. The file C:\Program Files (x86)\Adobe\Reader 9.0\Reader\reader_sl.exe is being used by the following process: Name: reader_sl , Id 3084.

Product: Adobe Reader 9.5.2. The file C:\Program Files (x86)\Common Files\Adobe\ARM\1.0\AdobeARM.exe is being used by the following process: Name: AdobeARM , Id 3096.

Both these processes are part of Adobe Reader, and they should be shut down by the installer.  But the updater does not do this.  So, you are forced to restart.  This happens even if you don’t open your web browser or any other applications!

So before clicking that fateful Install button, go and kill those Adobe processes hanging around so you can avoid that restart.  Also, you will need to close your web browser(s), Office applications, and anything else that you can think of that may also be involved.  It’s a pain, but it’s still better than restarting.

Update: Killing AdobeARM.exe at the wrong time is a bit of an issue.  Because AdobeARM.exe presents the front end to the updater!  This means that if you kill AdobeARM.exe, the update dialog also disappears.  The tricky fix then, is to kill AdobeARM.exe just after clicking the Install button.  You get no feedback on the install, but you can see msiexec.exe (up to 3 instances during the install) running in Task Manager or Process Explorer.  When it finishes, two of those will close down (the last instance hangs around for a while).  It will usually leave a happy message in the Event Log.

A little extra rant?  How poor is that design? Pushing updates for the updater in such a way that it forces a restart…  A little bit of forethought could have avoided that one, Adobe.

Detecting the Citadel Trojan with an Application Failure

An application that I do some development on was having trouble starting on a Windows XP machine.  In this application, we hook some of the Windows API functions internally to extend functionality in print preview (long story).  However, on this machine, the hook was failing.

The machine had fully up-to-date antivirus software from a major vendor; they’d run various anti-malware programs, but nothing was coming up.

No weird looking processes, DLLs or drivers were visible using Process Explorer; it looked like a pretty clean machine.  So after doing the usual diagnostics, we decided to take a full dump of the process when it threw up its error message.  A debug message told us that the hook function was failing to hook RegisterClassW, RegisterClassExW, RegisterClassA and RegisterClassExA.

I loaded the dump up in windbg and took a look at the functions.

0:000> u registerclassw
user32!RegisterClassW:
7e41a39a 6849d31300      push    13D349h
7e41a39f c3              ret
7e41a3a0 ec              in      al,dx
7e41a3a1 308b45085657    xor     byte ptr [ebx+57560845h],cl
7e41a3a7 6a09            push    9
7e41a3a9 59              pop     ecx
7e41a3aa 8d7004          lea     esi,[eax+4]
7e41a3ad 8b00            mov     eax,dword ptr [eax]

Whoa.  Push what?  That’s definitely not what we normally see!  Here’s what we’d expect to see:

USER32!RegisterClassW:
7e41a39a 8bff            mov     edi,edi
7e41a39c 55              push    ebp
7e41a39d 8bec            mov     ebp,esp
7e41a39f 83ec30          sub     esp,30h
7e41a3a2 8b4508          mov     eax,dword ptr [ebp+8]
7e41a3a5 56              push    esi
7e41a3a6 57              push    edi
7e41a3a7 6a09            push    9

So that push followed by ret is going to jump to address 13D349.  Let’s look to see which module owns that address space:

0:000> !address 13D349
Usage:
Allocation Base:        00000000
Base Address:           00130000
End Address:            00169000
Region Size:            00039000
Type:                   00000000
State:                  00000000
Protect:                00000000

So, not in the image space of a normally loaded module then.  So who is allocating that memory?  Let’s look for some strings longer than 3 characters in that block:

0:000> s -sa 00130000 00169000

This returned a stack of garbage data , but some strings stood out:

00131fb8  "Coded by BRIAN KREBS for persona"
00131fd8  "l use only. I love my job & wife"
00131ff8  "."

...

00136f08  "http://%02x%02x%02x%02x%02x%02x%"
00136f28  "02x%02x.com/%02x%02x%02x%02x/%02"
00136f48  "x%02x%02x%02x.php"

...

00138eac  "http://www.google.com/webhp"

...

00139424  "facebook.com"
00139438  "%BOTID%"
00139440  "%BOTNET%"

...

That first one was a dead giveaway.  Brian Krebs is a security researcher.  A Google search found that the Citadel Trojan embedded that string.

To help us diagnose similar situations more rapidly in the future, we captured a few other visible details on the file system and in the registry.  Images below for your entertainment!

There was not a lot of point going any further.  We understood why the problem was occurring, and how to resolve it (rebuild!).  Given that no antivirus vendors appear to support removal of this trojan, we advised that the client rebuild the computer as the safest and most cost-effective way forward.