Category Archives: Windows

The case of the stack trace that wouldn’t (trace)

One of my favourite Windows tools is Procmon.  I pull it out regularly, often as a first port of call when diagnosing complicated and opaque problems in the software I develop.  Or in anyone’s software, really.

Procmon captures a trace of key I/O activities on your computer, including file, registry and network activity, and makes it really easy to spot operations that have failed or that may be causing problems.  It’s great for spotting authentication problems, sharing violations, missing files and more (… malware).  Procmon goes as far as recording a stack trace for nearly every operation it captures!

Today, we were trying to diagnose a problem with a process that was taking 15 seconds or longer to start on a Windows XP computer.  The normal start time for that process should have been 1-2 seconds.  None of the usual culprits came forward and admitted fault, so it was time to pull out Procmon again.

We quickly spotted a big fat delay in the trace.  Note the time stamps in the two selected rows  in the screen capture below.

A big time gap between 4:10:04 and 4:10:17

Now it was time to try and find out what was causing this.  So we examined the stack trace for each entry, except … there were no symbols.  Easy enough to fix — copy dbghelp.dll from a version of Microsoft’s Debugging Tools for Windows onto the system temporarily, fixup the symbol path in Procmon’s options, and … nope, still no symbols.  Now this is one area where Procmon falls down a little bit.  If symbol loading fails, it just silently fails.  No warnings, errors or hints as to what might be going on.

This issue was occurring on a client’s computer, so it was time to take the investigation elsewhere for examination.  Before we could really examine the captured trace, we needed to get symbols going.  But how?

Procmon to the rescue!

That’s right, we realised we can use Procmon to diagnose itself!  I booted up a clean new Windows XP virtual machine, loaded Procmon onto it, ran a basic capture of some random events.

A trace on XP
A trace on XP
No symbols showing, only exports
No symbols showing, only exports
Configuring symbols for Procmon
Configuring symbols for Procmon

Even after configuring symbols, they still silently failed to load.  So I stopped the capture, saved it and immediately opened the saved capture, to stop this instance of Procmon from capturing events on the local computer.  I then started a second instance of Procmon, removed the Procmon exclusion from the filtering, and instead, added a filter to include Procmon (I also filtered specifically for the PID of the original Procmon, later):

Configuring Procmon to watch itself
Configuring Procmon to watch itself

Then I started the trace, switched back to the first Procmon, and tried to examine the stack.  Of course, still no symbols, but now it was time to switch back to the second, active Procmon process and see what we found.

And what we found was that dbghelp.dll was looking for symsrv.dll in order to download its symbols.  So we copied that also into the folder with procmon.exe  and suddenly everything worked!

dbghelp wants symsrv to help it as well
dbghelp wants symsrv to help it as well
The undecorated stack for the symsrv load request
The undecorated stack for the symsrv load request

Update 19 Sep 2013: Oops.  Forgot to attach the decorated stack (sorry!):

The call stack with symbols
The call stack with symbols loaded.  Note how the function names differ from the original stack.

So that’s the first takeaway from this story: when you want symbols, copy both dbghelp.dll and symsrv.dll from your copy of Debugging Tools for Windows.  We found no other dependencies, even with the latest version of these files.

A diversion

One curious anomaly we spotted: Procmon (or possibly Dbghelp) is looking in some strange places for debug symbols, including appending a SRV*path*url style symbol path to procmon’s parent path, and looking there, without much success:

Some weirdness in symbol loading
Some weirdness in symbol loading

I leave that one for you to solve.

Backtrack to the stack (trace)

Back to the original trace.  We loaded up the saved trace, and found that we now got kernel mode symbols just fine, but no user mode symbols would load.  In fact, Procmon doesn’t even appear to be looking for symbols for these user mode frames — either on the local drive or on the network.  And this time Procmon isn’t able to give us any more detail.  However, when we debugged the call that Procmon made to SymFindFileInPath when viewing a call stack in this log vs another new log, we found that Procmon wasn’t even providing the necessary identifying information.

What information is this?  The identifying information that the symbol servers use is the TimeDateStamp and the SizeOfImage fields from the PE header of the executable file (slightly different for .pdb files).

I surmise that this identifying information is missing from our original trace because this trace was captured before we copied version 6.0 or later of dbghelp.dll onto the client’s computer — meaning that the version that Procmon used when capturing the trace did not record this identifying information.

Therefore, the second takeaway of the story is: always copy a recent version of dbghelp.dll and symsrv.dll into the folder with procmon.exe, before starting a trace.  Even if you intend to analyze the trace later, you’ll find that without these, you won’t get full stack traces.

(Dear Microsoft, please can you consider including these in the Procmon and Procexp downloads, given that you now own Sysinternals?  Saves a lot of hassle!)

I am happy to announce that the missing characters have been found!

Last year I published a blog post about characters missing from print jobs with Internet Explorer 9 and my adventures in tracing and reporting the issue to Microsoft.

We saw situations where a letter would be printed with random letters missing, as per the following example:

Here’s how it should have been printed (oh yes, that’s a completely fictional name we used for testing non-English Latin script letters in printing):

Today, I was notified that Microsoft have finally publicly published a hotfix!  We received the hotfix from them a few weeks ago, before it was publicly available, and it certainly solved the problem on our test machines and end user computers.

Download the hotfix from:

http://support.microsoft.com/kb/2853777 (Windows 7 SP1, Windows Server 2008 SP1) 

http://support.microsoft.com/kb/2855336 (Windows 8, Windows Server 2012)

(The first link notes that the hotfix is included in the rollup in the second link, though the second link doesn’t mention the hotfix!)

Sadly, they’ve only published a solution for Windows 7 Service Pack 1 onwards, and not for Vista, which could be an issue for some users. 

It was interesting to see that the issue was indeed a race condition (two threads ending up with the same random seed, because, and I quote:

This issue occurs because a conflict causes the text that uses the font to become corrupted when the two threads try to install the font at the same time. The name of the font is generated by the RAND function together with a SEEDvalue whose time value is set to srand(time(NULL)). When the two documents are printed at the same time, the SEEDvalue for the font is the same in both documents. Therefore, the conflict occurs.

So not related to the threading model, which was a little bit of misdirection on my part 😉  That’s par for the course though when debugging complex issues without source…

Fantastic to finally have the fix 🙂

RSA Key Exchange between CryptoAPI and CNG (BCrypt)

Microsoft, a few years ago, wrote a new cryptography API for Windows Vista called Cryptography Next Generation (CNG).  This replaces the existing cryptography API in Windows XP and earlier versions which was simply called CryptoAPI.

Now, compatibility between the two APIs is touted as a feature of the new CNG API, but a somewhat more sombre reality tends to set in pretty quickly once you start working with them.  A motley collection of loosely documented data structures, endianness differences, and “Not Supported” footnotes pile up as quickly as you can open MSDN documentation tabs in your browser.  In my research, I also found a dearth of concrete examples of how to interoperate between the two APIs online.  In fact, there was only one page which really helped, buried in Microsoft’s Output Protection Manager documentation.

There is something magical about all your bits lining up in a row when mixing crypto libraries of any ilk, and I wanted to share!  Hence this story.

My requirement was pretty simple: I wanted to generate an RSA key pair using CryptoAPI in my client application, hand off the public key to another application, which happens to use CNG, wave my hands a little, and magically transact a key exchange.  Here’s how to do it (with some error handling elided for your own reading sanity).  You should be able to load this into any recent version of Visual Studio, add bcrypt.lib to your library includes and build.  The code should be reasonably self-explanatory, so I’ll leave it there.

The usual disclaimers apply. Source on GitHub.

Delphi, Win32 and leaky exceptions

Have you ever written any Delphi code like the following snippet?  That is, have you ever raised a Delphi exception in a Win32 callback?  Or even just failed to handle potential exceptions in a callback?

    function MyWndEnumFunc(hwnd: HWND; lParam: LPARAM): BOOL; stdcall;
    begin
      if hwnd = TForm4(lParam).Handle then
        raise Exception.Create('Raise an exception just to demonstrate the issue');
      Result := True;
    end;

    procedure TForm4.FormCreate(Sender: TObject);
    begin
      try
        EnumWindows(@MyWndEnumFunc, NativeInt(Self));
      except
        on E:Exception do
          ShowMessage('Well, we unwound that exception.  But does Win32 agree?');
      end;
    end;

Raymond Chen has posted a great blog today about why that approach will eventually end in tears.  You may get away with it for a while, or you may end up with horrific stack or heap corruption. The moral of the story is, any time you have a Win32 callback function, you need to make sure no exceptions leak.  Like this:

function MyWndEnumFunc(hwnd: HWND; lParam: LPARAM): BOOL; stdcall;
begin
  try
    if hwnd = TForm4(lParam).Handle then
      raise Exception.Create('Raise an exception just to demonstrate the issue');
    Result := True;
  except
    // Handle the exception, perhaps pass it to Application.HandleException,
    // or log it, or abort your app.  Just don’t let it bubble through any
    // Win32 code.  You need to write the HandleAllExceptions function!
    HandleAllExceptions;
    Result := False;
  end;
end;

If you use Delphi’s AllocateHwnd procedure, remember that it also does not handle exceptions for you (I’ve just reported this in QualityCentral as QC108653 as this caveat should at least be documented).  So you need to do it:

procedure TForm4.MyAllocatedWindowProc(var Message: TMessage);
begin
  try
    if Message.Msg = WM_USER then
      raise Exception.Create('Go wild!');
  except
    Application.HandleException(Self);  // Or whatever, just don’t let it leak
  end;
  with Message do Result := DefWindowProc(FMyHandle, Msg, wParam, lParam);
end;

procedure TForm4.FormCreate(Sender: TObject);
begin
  FMyHandle := AllocateHwnd(MyAllocatedWindowProc);
  SendMessage(FMyHandle, WM_USER, 0, 0);
end;

Why does switching languages hang my app?

This is a very short follow-up to yesterday’s blog about WM_INPUTLANGCHANGEREQUEST, WaitForSingleObject and applications hanging.

While I thought, yesterday, that this bug was no longer such a problem on Windows Vista and Windows 7, it turns out that if you assign a hotkey to a language (e.g. Alt+Shift+2), the KLF_SETFORPROCESS flag will be set when that hotkey is pressed.  This means that this issue is still a problem today.

This is not just a problem for Delphi.  I’ve found references to Outlook hanging, .NET 4.5 WPF applications hanging, and VC++ applications hanging.

Moral of the story (again): thread-safe code is hard.  Don’t believe that your code suddenly becomes thread-safe because you synchronize it to the main thread (no matter what the Delphi documentation tells you!)

WaitForSingleObject. Why you should never use it.

Updated 30/1/2013: Fixed a typo and formatting issue or two.  Added a sentence for clarity.  See also the short follow-up post.

I’ve really struggled with how to frame this post.  It could be about the dangers of WaitForSingleObject and WaitForMultipleObjects.  Or about how Delphi’s TThread.Synchronize seems so handy, and yet because it must use WaitForSingleObject, is so fraught with complications.  Or yet, about how pressing Alt+Left Shift to switch languages could hang an application.  In the end, it’s about all three of these things.

Some people, when confronted with a problem, think, “I know, I’ll use threads,” and then two they hav erpoblesms.

Let’s start with TThread.Synchronize.  It seems a handy little function: it will, in a thread-safe manner, call a procedure in the context of the main VCL thread (the thread that owns the Delphi VCL GUI windows), and wait, using the Windows API WaitForSingleObject(INFINITE), for the procedure to return.  Simpler than fiddling with synchronisation primitives, right?  Everyone knows threading is hard, so use the well-tested thread utilities where you can?

Except that WaitForSingleObject and its big brother WaitForMultipleObjects are dangerous.  The basic problem is that these calls can cause deadlocks, if you ever call them from a thread that has its own message loop and windows.  That’s okay, you say, I don’t have any UI except in my main thread.  But any thread that uses COM can have hidden COM helper windows when doing RPC (and see below for more on this).  And other libraries can create their own windows as well, such as the ADO libraries.

So what causes the deadlock? Well, I’ll illustrate with the scenario we ran into.  Some old code (that I wrote, okay, okay) created a thread (we’ll call it BackgroundThread).  BackgroundThread used TThread.Synchronize to periodically update the UI status about a background database process it was running.  It doesn’t really matter what it was doing, but the use of Microsoft’s ADO database library meant that this thread was creating a hidden window, with the class ADODB.AsyncEventMessenger.  Behind the scenes, a second window was automatically created by Windows once we had the first window in the thread, and this one had the class name IME.

Every now and then, our BackgroundThread would call Synchronize(RefreshStatus).  This would signal an event which the main thread would check periodically from its message loop.  Eventually it would call the RefreshStatus procedure.  BackgroundThread would in the meantime have called WaitForSingleObject(INFINITE) to wait for an event to be signaled by the main thread indicating that the RefreshStatus procedure had finished.

Where are we going?  Well, if the main thread receives a message that it then decides to send on to other windows in the process, while BackgroundThread is getting ready to synchronize, we can end up in a deadlock.  And, it turns out that in Windows XP, this can happen when WM_INPUTLANGCHANGEREQUEST (0x50) is received, e.g. when the user presses Alt+Left Shift.  (Note, for this scenario to play, the Input Method Manager must be enabled — install Far Eastern language support in Windows XP).  Remember, this is just one possible scenario which can trigger a deadlock.

Let’s pull this apart.  I’ve loaded the stalled process into WinDbg, and am now looking at two call stacks which have deadlocks, in user mode.  First the main VCL GUI thread:

   0  Id: bf8.cb8 Suspend: 1 Teb: 7ffde000 Unfrozen
ChildEBP RetAddr
0012fadc 7e4194be ntdll!KiFastSystemCallRet
0012fb30 7e43652f USER32!NtUserMessageCall+0xc
0012fb50 7e418734 USER32!EditWndProcW+0x5d
0012fb7c 7e418816 USER32!InternalCallWinProc+0x28
0012fbe4 7e42a013 USER32!UserCallWinProcCheckWow+0x150
0012fc14 7e42a039 USER32!CallWindowProcAorW+0x98
0012fc34 004c0e7d USER32!CallWindowProcW+0x1b
0012fda4 004c0d80 audit4_home!Vcl.Controls.TWinControl.DefaultHandler+0xdd
0012fdf0 004c03d3 audit4_home!Vcl.Controls.TWinControl.WndProc+0x5b8
0012fe20 00467b3e audit4_home!Vcl.Controls.TWinControl.MainWndProc+0x2f
0012fe38 7e418734 audit4_home!System.ClassesStdWndProc+0x16
0012fe64 7e418816 USER32!InternalCallWinProc+0x28
0012fecc 7e4189cd USER32!UserCallWinProcCheckWow+0x150
0012ff2c 7e418a10 USER32!DispatchMessageWorker+0x306
0012ff3c 005a6980 USER32!DispatchMessageW+0xf
0012ff58 005a69c3 audit4_home!Vcl.Forms.TApplication.ProcessMessage+0xf8
0012ff7c 00d26d60 audit4_home!Vcl.Forms.TApplication.HandleMessage+0xf
0012ff9c 016e6903 audit4_home!S4s.Ui.Session.Appsession_main.TAppSession_Main.Run+0xcc
0012ffc0 7c817077 audit4_home!Audit4_home.initialization+0xc3
0012fff0 00000000 kernel32!BaseProcessStart+0x23

And then our BackgroundThread:

   9  Id: bf8.f30 Suspend: 1 Teb: 7ffd5000 Unfrozen
ChildEBP RetAddr
0494fdb8 7c90df5a ntdll!KiFastSystemCallRet
0494fdbc 7c8025db ntdll!NtWaitForSingleObject+0xc
0494fe20 7c802542 kernel32!WaitForSingleObjectEx+0xa8
0494fe34 0042d4b3 kernel32!WaitForSingleObject+0x12
0494fe80 00408a19 audit4_home!System.Sysutils.WaitForSyncWaitObj+0x7
0494fed0 004654f2 audit4_home!System.TMonitor.Wait+0x25
0494fedc 00c9f3a2 audit4_home!System.Classes.TThread.Synchronize+0x2e
0494fef8 00c9ef29 audit4_home!S4s.Br.Backgroundclasses.Backgroundclass.TBackgroundClass.SetStatus+0x8a
0494ff70 00464b11 audit4_home!S4s.Br.Backgroundclasses.Backgroundclass.TBackgroundClass.Execute+0x18d
0494ffa0 00409752 audit4_home!System.Classes.ThreadProc+0x45
0494ffb4 7c80b729 audit4_home!SystemThreadWrapper+0x2a
0494ffec 00000000 kernel32!BaseThreadStart+0x37

We can see that the main thread has sent a message somewhere.  It turns out it has sent a message to a window in the same thread (The window handle 0006040c is just the Edit window):

0012fb30 7e43652f 0006040c 00000050 00000001 USER32!NtUserMessageCall+0xc
0012fb50 7e418734 0006040c 00000050 00000001 USER32!EditWndProcW+0x5d

So why is it stalling?  It’s hard to determine here, because everything bad is happening in kernel mode, behind that KiFastSystemCallRet call. That must mean it’s time to step into kernel mode!  In WinDbg, press Ctrl+K, select Local.  I’m learning here, so this is as much for my own documentation as to explain to anyone else (in other words, if it doesn’t make sense to you, it won’t make sense to me, either, in 6 weeks time).  First, we find the process details:

lkd> !process 0 0 audit4_home.exe
PROCESS 894c3710  SessionId: 0  Cid: 0bf8    Peb: 7ffdf000  ParentCid: 0e0c
DirBase: 0a5c0700  ObjectTable: 00000000  HandleCount:   0.
Image: audit4_home.exe

PROCESS 893277e0  SessionId: 0  Cid: 0d94    Peb: 7ffd8000  ParentCid: 0e0c
DirBase: 0a5c0720  ObjectTable: 00000000  HandleCount:   0.
Image: audit4_home.exe

PROCESS 8950d9a8  SessionId: 0  Cid: 0f10    Peb: 7ffdb000  ParentCid: 0e0c
DirBase: 0a5c0740  ObjectTable: 00000000  HandleCount:   0.
Image: audit4_home.exe

PROCESS 89566758  SessionId: 0  Cid: 0520    Peb: 7ffde000  ParentCid: 0e0c
DirBase: 0a5c0840  ObjectTable: e127a330  HandleCount: 467.
Image: audit4_home.exe

We’ll look at the last process listed, being the one with the problem (the other three are defunct):

lkd> !process 89566758 2
PROCESS 89566758  SessionId: 0  Cid: 0520    Peb: 7ffde000  ParentCid: 0e0c
DirBase: 0a5c0840  ObjectTable: e127a330  HandleCount: 467.
Image: audit4_home.exe

THREAD 8953a920  Cid 0520.009c  Teb: 7ffdd000 Win32Thread: e1210eb0 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8953aabc  Semaphore Limit 0x2

THREAD 8963bda0  Cid 0520.0a98  Teb: 7ffd9000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8963bf3c  Semaphore Limit 0x2

THREAD 8953b9a0  Cid 0520.03c8  Teb: 7ffd8000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8953bb3c  Semaphore Limit 0x2

THREAD 895a2c08  Cid 0520.021c  Teb: 7ffd7000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
895a2da4  Semaphore Limit 0x2

THREAD 898f8cc8  Cid 0520.0bd8  Teb: 7ffd4000 Win32Thread: e2d75870 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
898f8e64  Semaphore Limit 0x2

THREAD 893287f8  Cid 0520.0eec  Teb: 7ff4f000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89328994  Semaphore Limit 0x2

THREAD 89308c30  Cid 0520.00f0  Teb: 7ff4e000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89308dcc  Semaphore Limit 0x2

THREAD 89ab4020  Cid 0520.01b0  Teb: 7ff4d000 Win32Thread: e2ed4260 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89ab41bc  Semaphore Limit 0x2

THREAD 893269b0  Cid 0520.0830  Teb: 7ff4c000 Win32Thread: e23bf2c0 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89326b4c  Semaphore Limit 0x2

THREAD 89895778  Cid 0520.0ce0  Teb: 7ff4b000 Win32Thread: e2fa06e8 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89895914  Semaphore Limit 0x2

THREAD 8933ba80  Cid 0520.0e08  Teb: 7ff4a000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8933bc1c  Semaphore Limit 0x2

THREAD 89322728  Cid 0520.0f28  Teb: 7ff49000 Win32Thread: 00000000 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
893228c4  Semaphore Limit 0x2

THREAD 89449960  Cid 0520.0ab0  Teb: 7ff48000 Win32Thread: e12d5590 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
89449afc  Semaphore Limit 0x2

THREAD 8952d020  Cid 0520.09dc  Teb: 7ffdc000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable
SuspendCount 1
b8c197d4  SynchronizationEvent

And we can see the first thread has the address 8953a920.  So let’s look at that (flag 16 means show the full stack in the process context, with parameters).

lkd> !thread 8953a920 16
THREAD 8953a920  Cid 0520.009c  Teb: 7ffdd000 Win32Thread: e1210eb0 WAIT: (Suspended) KernelMode Non-Alertable
SuspendCount 1
FreezeCount 1
8953aabc  Semaphore Limit 0x2
Not impersonating
DeviceMap                 e1050c08
Owning Process            0       Image:
Attached Process          89566758       Image:         audit4_home.exe
Wait Start TickCount      213083         Ticks: 67800 (0:00:17:39.375)
Context Switch Count      20200  NoStackSwap    LargeStack
UserTime                  00:00:00.781
KernelTime                00:00:01.484
Win32 Start Address audit4_home!Audit4_home.initialization (0x016e6840)
Start Address kernel32!BaseProcessStartThunk (0x7c810705)
Stack Init b7c1f000 Current b7c1e840 Base b7c1f000 Limit b7c19000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0 DecrementCount 16
ChildEBP RetAddr  Args to Child
b7c1e858 80503864 8953a990 8953a920 804fb094 nt!KiSwapContext+0x2f (FPO: [Uses EBP] [0,0,4])
b7c1e864 804fb094 8953aa8c 8953a920 8953a954 nt!KiSwapThread+0x8a (FPO: [0,0,0])
b7c1e88c 80502fa0 00000000 00000005 00000000 nt!KeWaitForSingleObject+0x1c2 (FPO: [5,5,4])
b7c1e8a4 804ff8e0 00000000 00000000 00000000 nt!KiSuspendThread+0x18 (FPO: [3,0,0])
b7c1e8ec 80503882 00000000 00000000 00000000 nt!KiDeliverApc+0x124 (FPO: [3,10,0])
b7c1e904 804fb094 00000240 e1210eb0 00000000 nt!KiSwapThread+0xa8 (FPO: [0,0,0])
b7c1e92c bf802f15 00000000 0000000d 00000001 nt!KeWaitForSingleObject+0x1c2 (FPO: [5,5,4])
b7c1e968 bf835eb7 00000200 00000000 00000000 win32k!xxxSleepThread+0x192 (FPO: [3,5,4])
b7c1ea04 bf8141d2 bbe83720 00000287 00000019 win32k!xxxInterSendMsgEx+0x7f6 (FPO: [Non-Fpo])
b7c1ea50 bf80ecd9 bbe83720 00000287 00000019 win32k!xxxSendMessageTimeout+0x11f (FPO: [7,7,0])
b7c1ea74 bf92b42e bbe83720 00000287 00000019 win32k!xxxSendMessage+0x1b (FPO: [4,0,0])
b7c1eaa4 bf92c675 e2d75870 e13f10e0 e13f10e0 win32k!xxxImmActivateLayout+0x5b (FPO: [2,3,4])
b7c1ec08 bf8696d9 00000004 00000000 e13f10e0 win32k!xxxImmActivateThreadsLayout+0x10c (FPO: [3,82,4])
b7c1ec48 bf86862f e13f10e0 00000100 bbe88228 win32k!xxxInternalActivateKeyboardLayout+0xb7 (FPO: [3,8,4])
b7c1ec70 bf80b5a7 8989a2d0 04090c09 00000100 win32k!xxxActivateKeyboardLayout+0x4c (FPO: [4,3,0])
b7c1ecd4 bf80ec9f bbe88228 00000050 00000001 win32k!xxxRealDefWindowProc+0x56d (FPO: [4,16,0])
b7c1ecec bf81c176 bbe88228 00000050 00000001 win32k!xxxWrapRealDefWindowProc+0x16 (FPO: [5,0,0])
b7c1ed08 bf80eee6 bbe88228 00000050 00000001 win32k!NtUserfnNCDESTROY+0x27 (FPO: [7,0,0])
b7c1ed40 8054168c 0006040c 00000050 00000001 win32k!NtUserMessageCall+0xae (FPO: [7,3,0])
b7c1ed40 7c90e514 0006040c 00000050 00000001 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7c1ed64)
0012fadc 7e4194be 7e428e0d 0006040c 00000050 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0012fb30 7e43652f 0006040c 00000050 00000001 USER32!NtUserMessageCall+0xc
0012fb50 7e418734 0006040c 00000050 00000001 USER32!EditWndProcW+0x5d (FPO: [4,0,4])
0012fb7c 7e418816 7e4364cf 0006040c 00000050 USER32!InternalCallWinProc+0x28
0012fbe4 7e42a013 00000000 7e4364cf 0006040c USER32!UserCallWinProcCheckWow+0x150 (FPO: [Non-Fpo])
0012fc14 7e42a039 7e4364cf 0006040c 00000050 USER32!CallWindowProcAorW+0x98 (FPO: [6,0,0])
0012fc34 004c0e7d 7e4364cf 0006040c 00000050 USER32!CallWindowProcW+0x1b (FPO: [5,0,0])
0012fda4 004c0d80 02940d99 02bb9430 fffffffe audit4_home!Vcl.Controls.TWinControl.DefaultHandler+0xdd
0012fdf0 004c03d3 0012fe04 004c03eb 0012fe20 audit4_home!Vcl.Controls.TWinControl.WndProc+0x5b8
0012fe20 00467b3e 00000050 00000001 04090c09 audit4_home!Vcl.Controls.TWinControl.MainWndProc+0x2f
0012fe38 7e418734 0006040c 00000050 00000001 audit4_home!System.ClassesStdWndProc+0x16
0012fe64 7e418816 02940d99 0006040c 00000050 USER32!InternalCallWinProc+0x28
0012fecc 7e4189cd 00000000 02940d99 0006040c USER32!UserCallWinProcCheckWow+0x150 (FPO: [Non-Fpo])
0012ff2c 7e418a10 0012ff60 00000000 0006040c USER32!DispatchMessageWorker+0x306 (FPO: [Non-Fpo])
0012ff3c 005a6980 0012ff60 00120100 0012ff9c USER32!DispatchMessageW+0xf (FPO: [1,0,0])
0012ff4c 7c910222 0000000f 028458e0 005a69c3 audit4_home!Vcl.Forms.TApplication.ProcessMessage+0xf8
0012ff9c 016e6903 0012ffb0 016e6916 0012ffc0 ntdll!RtlpAllocateFromHeapLookaside+0x42 (FPO: [Non-Fpo])
0012ffc0 7c817077 7c910222 0000000f 7ffde000 audit4_home!Audit4_home.initialization+0xc3
0012fff0 00000000 016e6840 00000000 00000018 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])

Highlighted above there is a call to win32k!xxxSendMessage.  This is passed the address of a window at bbe83720.  And when we look at that, we see the following:

lkd> dd bbe83720
bbe83720  000d01fe 00000006 e2d75870 8998ae50

And window 000d01fe is the IME window for the other, deadlocked thread!  Why is the kernel sending a message to another thread here and now?  And without a timeout?  That’s our deadlock.  We know how it happens, but I at least still don’t know why!

Still, that’s a step forward.  Next was to figure out why we were getting that.  After a fair bit of exploration, I looked at little harder at the call stack, and decided to investigate the parameters for xxxActivateKeyboardLayout:

8989a2d0 04090c09 00000100 win32k!xxxActivateKeyboardLayout+0x4c 

That third parameter in xxxActivateKeyboardLayout corresponds to the Flags parameter for ActivateKeyboardLayout.  Perusing the documentation for the Windows ActivateKeyboardLayout function, we can see that 0x100 is KLF_SETFORPROCESS.  Bingo!  That sounds pretty suspicious!  I worked through the assembly code for xxxActivateKeyboardLayout and xxxRealDefWindowProc, and sure enough, that was it: when xxxRealDefWindowProc processes WM_INPUTLANGCHANGEREQUEST, it sets the Flags parameter to 0x0100:

win32k!xxxRealDefWindowProc+0x44d:
8211f1dd 56              push    esi
8211f1de 6800010000      push    100h
8211f1e3 ff7514          push    dword ptr [ebp+14h]
8211f1e6 6a00            push    0
8211f1e8 e8dfe40100      call    win32k!_GetProcessWindowStation (8213d6cc)
8211f1ed 50              push    eax
8211f1ee e838bdf9ff      call    win32k!xxxActivateKeyboardLayout (820baf2b)

In Windows Vista and later versions, this problem does not arise as frequently, because the Text Services Framework takes over the Alt+Left Shift command (before DefWindowProc gets a look-in) and calls ActivateKeyboardLayout without the KLF_SETFORPROCESS flag set.  However, the issue can still arise if you ever use that flag in your own code when calling ActivateKeyboardLayout.

To summarize, it’s helpful to remember that technically, we were using the functions as designed.  The primary issue is combination of design flaws, first, with how that KLF_SETFORPROCESS flag is handled: either the kernel code should be using a timeout when it sends the message to the other thread, or it should be queuing an event for the other thread to handle when it gets around to it; and second, with TThread.Synchronize, which unfortunately by design cannot be robust against deadlocks.

In any case, to make your own code more robust:

  1. Determine, when using synchronisation calls, if there is any chance that windows could be created by your thread, and if so, use MsgWaitForMultipleObjects and a message loop instead of WaitForSingleObject or WaitForMultipleObjects.  Check the libraries you are using, and if you are using COM, it’s safest to assume that windows will be created.
  2. Don’t use the TThread.Synchronize procedure.  Recent versions of Delphi include TThread.Queue, which is asynchronous, and so avoids this deadlock.
  3. Think carefully about whether a thread is the right solution to the problem.
  4. Don’t use the KLF_SETFORPROCESS flag!

Mixing RTL and LTR: Plaintext vs HTML

Just a very short post today.  We still have some way to go in mixing RTL and LTR text.  For example, the following image, snipped from Outlook 2010, shows the issue:

The subject and body both say the same thing, but the display order is different.  Do you know why?  It’s because the subject is plain text and is assuming that the text is primarily right-to-left, whereas the body is HTML, and is assuming that the text is primarily left-to-right.

Note how the full stop in the subject appears to the left of the English text.  This is because the display renderer has assumed that the whole run of text is right-to-left, so punctuation is treated as right-to-left, and so displays after (in a right-to-left sense) the text.

The question is, of course, how do you determine directionality given an arbitrary plain text string?  It’s not really possible to do so reliably in the absence of other metadata.  The W3C article on directionality is helpful here: http://www.w3.org/TR/html4/struct/dirlang.html

Another view of the message:

Interestingly, Outlook Web Access does not do this, because its UI takes its directionality from the base HTML document:

When characters go astray: diagnosing missing characters when printing with IE9

I was recently landed with a real doozy of a support case: on some systems, under some circumstances, print jobs would be missing random characters.  Like the following image:

print-1

As opposed to how it should appear:

print-2

This example is missing the following letters:  Å ê Ö q Q

In this particular example, you may be misled into thinking the problem is just with non-ASCII characters.  But note that both upper case Q and lower case q are missing as well.  Other example documents that we collected had no non-ASCII characters in them at all.  I just used this example because it was one of the few we captured with fake data as opposed to real customer data.

This problem was not particularly printer dependent, and could be replicated (thank goodness!) with a PDF print driver.  But the circumstances behind replicating the problem were a little strange: it only occurred, in the second document printed, when more than one print job was sent to the printer in quick succession.  Some combinations of documents did not reflect the issue.  If we sent the documents in the opposite order, they would often succeed.  We were unable to replicate the problem on the printer in our office.

After many tries, I finally managed to replicate the issue on my development machine.  Now I could break out the debugging tools and start some tracing.

I first used Process Monitor to capture details about temporary files in the print process.  A lot of temporary files were generated.  I examined many of them, but the files appeared intact: the problem was not reflected at this point in the print job.

print-3

Next I attached a debugger to the process to try and figure out if I could spot any issues during the print itself.  I put a breakpoint on GDI32!StartDocW and when that fired, enabled a breakpoint on GDI32!ExtTextOutW.  This showed me all the strings that were being sent to the page during the print job.

Unfortunately, Internet Explorer 9 uses a process of printing to XPS and then converting that output to GDI commands for printing.  The library which does the conversion from XPS to GDI used the flag ETO_GLYPH_INDEX when calling ExtTextOutW.  This meant that examining the data in the debugger was pretty much a non-starter, without any easy to spot strings.  So eventually I decided to log the whole lot to a text file, and pull the TrueType font glyph indices in the hope that the information in the TTF file would be sufficient to reconstruct the original text strings.

As it turned out, I was trivially able to match that data up to the XPS data by generating the report as an XPS.  These are just a ZIP file with fonts, images and XML rendering data embedded.  Easy to spelunk.  It certainly seems that the correct glyph indices were being sent to the print engine.  I matched these up between the data captured in the debugger as follows.

The text string:

Dr Åbrique T Ömbê

The XPS element:

Pulling out just the Glyph indices from the XPS element:

39;85;3;99;69;85;76;84;88;72;3;55;3;103;80;69;114

Converting those glyph indices to hexadecimal:

0027 0055 0003 0063 0045 0055 004c 0054
0058 0048 0003 0037 0003 0067 0050 0045
0072

Matching the data captured from ExtTextOutW:

233a4fe0  00550027 00630003 00550045 0054004c
233a4ff0  00480058 00370003 00670003 00450050
233a5000  abab0072

Note that the DWORDs are reversed because Intel uses little endian byte order (and abab is just garbage past the end of the array).

This suggested that the font data itself is being corrupted.  Next, to check the font data in the .ps file as it is generated!  Helpfully, PrimoPDF stores a .ps file in a temporary folder during the conversion to PDF.  So it was just a matter of grabbing that file and looking at that.  Well, of course PostScript and RTF are near the top of the list of painfully unreadable file formats, but a little trial and error (and GhostScript) found me the following:

779 1636 M [66 31 23 66 46 30 26 46 45 41 23 56 23 66 71 46  0]xS

Of which, 080B03380E0B1B3905090322033A0D0E3B should translate to Dr Åbrique T Ömbê:

08 0B 03 38 0E 0B 1B 39 05 09 03 22 03 3A 0D 0E 3B
D  r     Å  b  r  i  q  u  e     T     Ö  m  b  ê

Interestingly, the missing glyphs are all in the range 38-3B in this example.  This line is identical in a version of the file that printed without errors, but the font data is quite different.

Right, so we know data corruption is occurring in the font data.  But we knew that data corruption was the issue already.  So where are we now?  Have I accomplished anything by going down this track?  Yes!  I know now that it is happening in the conversion of an XPS document to GDI, in the print process.  Given this and that it happens only when 2 documents are printed, it is almost certainly a threading issue in the XPS to GDI library.

Armed with that new-found knowledge, I set out to build a minimal test application.  Here’s the very hacky test case I ended up with:

#include "stdafx.h"

int Print(LPWSTR inFile, LPWSTR outFile, LPWSTR printer);

int _tmain(int argc, _TCHAR* argv[]) {
  if(argc < 2) {
    printf("Usage: XpsThread Apartment|Multi [PathToXPSFiles] [Printer]\n");
    return 1;
  }

  if(_wcsicmp(argv[1], L"Apartment") == 0)
    CoInitializeEx(0, COINIT_APARTMENTTHREADED);
  else
    CoInitializeEx(0, COINIT_MULTITHREADED);

  /* This assumes 2 files named 0.xps and 1.xps.  No need to make more complex at this stage. */

  WCHAR inFile[MAX_PATH], outFile[MAX_PATH];

  PWCHAR printer = argc > 3 ? argv[3] : L"PrimoPDF";

  for(int i = 0; i < 2; i++)
  {
    wsprintf(inFile, L"%s%s%d.xps", argc > 2 ? argv[2] : L"", argc > 2 ? (*(wcschr(argv[2],0)-1) == '\\' ? L"" : L"\\") : L"", i);
    wsprintf(outFile, L"%s%s%d.ps", argc > 2 ? argv[2] : L"", argc > 2 ? (*(wcschr(argv[2],0)-1) == '\\' ? L"" : L"\\") : L"", i);

    Print(inFile, outFile, printer);
  }

  getchar();  // Manually wait for print jobs to finish.

  CoUninitialize();
  // Examine the .ps files in GSview or whatever takes your fancy.  The second file generated will often be corrupt if COINIT_MULTITHREADED is used.
  return 0;
}

int Print(LPWSTR inFile, LPWSTR outFile, LPWSTR printer) {
  IXpsPrintJobStream *docStream = NULL;
  HRESULT hr;

  hr = StartXpsPrintJob(printer, NULL, outFile, 0, 0, NULL, 0, NULL, &docStream, NULL);
  if(SUCCEEDED(hr)) {
    FILE *fp = _wfopen(inFile, L"rb");
    BYTE buf[512];
    int n = fread(buf, 1, 512, fp);
    ULONG sz;
    while(n > 0) {
      docStream->Write(buf, n, &sz);
      n = fread(buf, 1, 512, fp);
    }
    fclose(fp);

    hr = docStream->Close();
    if(SUCCEEDED(hr)) {
      wprintf(L"Succeeded: %s\n", inFile);
      return 0;
    }
  }

  wprintf(L"Failed to print %s: %x\n", inFile, hr);
  return 1;
}

It may have been a little hacky, but it proved the bug.  It turns out, printing multiple documents at once with COM initialised with the flag COINIT_MULTITHREADED will consistently fail, even though that is the mode used in the example in MSDN.

Okay, so let’s take that information back to our application in Windbg, and see if we can find anything in MSHTML’s printing.  I popped in a couple of breakpoints:

bp coinitializeex ".echo ---CoInitializeEx---; ~.; dd esp+8 L1; gc"
bp gdi32!startdocw ".echo ---StartDocW---; ~.;"

The first would show me the thread ID and the COINIT flags issued every time CoInitializeEx was called.  The second would break when StartDoc was called to start a print job, from the XPS printing code (XpsGdiConverter.dll).  This would tell me the threading mode for the thread.  And as I was now expecting, MSHTML uses COINIT_MULTITHREADED (= 00000000).

---CoInitializeEx---
.  2  Id: 1464.f80 Suspend: 1 Teb: 7efd7000 Unfrozen
Start: mshtml!ShowModelessHTMLDialog+0x5460 (6dc0d50e)
Priority: 0  Priority class: 32  Affinity: f
189ef0b4  00000000
---StartDocW---
.  2  Id: 1464.f80 Suspend: 1 Teb: 7efd7000 Unfrozen
Start: mshtml!ShowModelessHTMLDialog+0x5460 (6dc0d50e)
Priority: 0  Priority class: 32  Affinity: f

Interestingly, when I was researching XPS printing in MSDN, I came across a page in the .NET Framework documentation which contains the following detail:

The three-parameter AddJob(String, String, Boolean) overload of AddJob must run in a single thread apartment whenever the Boolean parameter is false, which it must be when a non-XPSDrv printer is being used. However, the default apartment state for Microsoft .NET is multiple thread. This default must be reversed since the example assumes a non-XPSDrv printer

There are two ways to change the default. One way is to simply add the STAThreadAttribute (that is, “[System.STAThreadAttribute()]”) just above the first line of the application’s Main method (usually “static void Main(string[] args)”). However, many applications require that the Main method have a multi-threaded apartment state, so there is a second method: put the call to AddJob(String, String, Boolean) in a separate thread whose apartment state is set to STA with SetApartmentState. The example below uses this second technique.

This little detail is not in the Win32 XPS documentation. I can’t prove that this is the cause of the problem but it certainly seems suspicious that this goes wrong with MSHTML.  I tried a very naughty test whereby I overrode the COINIT flags whenever CoInitializeEx was called (WinDbg: bp ole32!coinitializeex “ed esp+8 2; gc”), and the problem “went away” … it’s a nice finger-in-the-wind test but certainly not conclusive!

The problem does not occur on the printer in our office, because it is an XPSDrv printer.  The problem only occurs with drivers that do not support XPS, because XPSDrv does not require apartment threading.

Anyway, I think it’s time to take this case to Microsoft.  In the meantime, the safest workaround is to wait for the first print job to finish before starting a subsequent one.  We’ll probably put that fix in for now, even though it makes the print tediously slow!

Second Rant: Why oh why does Adobe Reader need to restart?

Updated: 3:50pm — It gets worse — see below!
 
Have you ever restarted Windows (e.g. after installing Windows updates) and then been presented with the following dialog:

 And of course, once you click that Install button, you eventually end up with the following:

(For the pedantic reader, yes, I captured different versions of Adobe Reader Updater on different machines.  I’ve had to deal with this on 3 machines so far today. Get over it!)

Here’s why Adobe Reader needs to restart, even if you have nothing apparently running: the installer does not shut down the background processes AdobeARM.exe and reader_sl.exe (and possibly others).

For example on my Windows 7 x64 test machine, in Event Viewer, you can see the following events.

Product: Adobe Reader 9.5.2. The file C:\Program Files (x86)\Adobe\Reader 9.0\Reader\reader_sl.exe is being used by the following process: Name: reader_sl , Id 3084.

Product: Adobe Reader 9.5.2. The file C:\Program Files (x86)\Common Files\Adobe\ARM\1.0\AdobeARM.exe is being used by the following process: Name: AdobeARM , Id 3096.

Both these processes are part of Adobe Reader, and they should be shut down by the installer.  But the updater does not do this.  So, you are forced to restart.  This happens even if you don’t open your web browser or any other applications!

So before clicking that fateful Install button, go and kill those Adobe processes hanging around so you can avoid that restart.  Also, you will need to close your web browser(s), Office applications, and anything else that you can think of that may also be involved.  It’s a pain, but it’s still better than restarting.

Update: Killing AdobeARM.exe at the wrong time is a bit of an issue.  Because AdobeARM.exe presents the front end to the updater!  This means that if you kill AdobeARM.exe, the update dialog also disappears.  The tricky fix then, is to kill AdobeARM.exe just after clicking the Install button.  You get no feedback on the install, but you can see msiexec.exe (up to 3 instances during the install) running in Task Manager or Process Explorer.  When it finishes, two of those will close down (the last instance hangs around for a while).  It will usually leave a happy message in the Event Log.

A little extra rant?  How poor is that design? Pushing updates for the updater in such a way that it forces a restart…  A little bit of forethought could have avoided that one, Adobe.

Detecting the Citadel Trojan with an Application Failure

An application that I do some development on was having trouble starting on a Windows XP machine.  In this application, we hook some of the Windows API functions internally to extend functionality in print preview (long story).  However, on this machine, the hook was failing.

The machine had fully up-to-date antivirus software from a major vendor; they’d run various anti-malware programs, but nothing was coming up.

No weird looking processes, DLLs or drivers were visible using Process Explorer; it looked like a pretty clean machine.  So after doing the usual diagnostics, we decided to take a full dump of the process when it threw up its error message.  A debug message told us that the hook function was failing to hook RegisterClassW, RegisterClassExW, RegisterClassA and RegisterClassExA.

I loaded the dump up in windbg and took a look at the functions.

0:000> u registerclassw
user32!RegisterClassW:
7e41a39a 6849d31300      push    13D349h
7e41a39f c3              ret
7e41a3a0 ec              in      al,dx
7e41a3a1 308b45085657    xor     byte ptr [ebx+57560845h],cl
7e41a3a7 6a09            push    9
7e41a3a9 59              pop     ecx
7e41a3aa 8d7004          lea     esi,[eax+4]
7e41a3ad 8b00            mov     eax,dword ptr [eax]

Whoa.  Push what?  That’s definitely not what we normally see!  Here’s what we’d expect to see:

USER32!RegisterClassW:
7e41a39a 8bff            mov     edi,edi
7e41a39c 55              push    ebp
7e41a39d 8bec            mov     ebp,esp
7e41a39f 83ec30          sub     esp,30h
7e41a3a2 8b4508          mov     eax,dword ptr [ebp+8]
7e41a3a5 56              push    esi
7e41a3a6 57              push    edi
7e41a3a7 6a09            push    9

So that push followed by ret is going to jump to address 13D349.  Let’s look to see which module owns that address space:

0:000> !address 13D349
Usage:
Allocation Base:        00000000
Base Address:           00130000
End Address:            00169000
Region Size:            00039000
Type:                   00000000
State:                  00000000
Protect:                00000000

So, not in the image space of a normally loaded module then.  So who is allocating that memory?  Let’s look for some strings longer than 3 characters in that block:

0:000> s -sa 00130000 00169000

This returned a stack of garbage data , but some strings stood out:

00131fb8  "Coded by BRIAN KREBS for persona"
00131fd8  "l use only. I love my job & wife"
00131ff8  "."

...

00136f08  "http://%02x%02x%02x%02x%02x%02x%"
00136f28  "02x%02x.com/%02x%02x%02x%02x/%02"
00136f48  "x%02x%02x%02x.php"

...

00138eac  "http://www.google.com/webhp"

...

00139424  "facebook.com"
00139438  "%BOTID%"
00139440  "%BOTNET%"

...

That first one was a dead giveaway.  Brian Krebs is a security researcher.  A Google search found that the Citadel Trojan embedded that string.

To help us diagnose similar situations more rapidly in the future, we captured a few other visible details on the file system and in the registry.  Images below for your entertainment!

There was not a lot of point going any further.  We understood why the problem was occurring, and how to resolve it (rebuild!).  Given that no antivirus vendors appear to support removal of this trojan, we advised that the client rebuild the computer as the safest and most cost-effective way forward.