Category Archives: Bugs

The case of the terribly slow PayPal emails

Every now and then I receive a payment via PayPal.  That’s not unusual, right?  PayPal would send me an email notifying me of the payment, and I’d open up Outlook to take a look at it.  All well and good.  In the last week, however, something changed.  When I clicked on any PayPal email, Outlook would take the best part of a minute to open the email, and what’s more, would freeze its user interface entirely while doing this.  But this was only happening with emails from PayPal — everything else was fine.

Not good.  At first I suspected an addin was mucking things up, so I disabled all the Outlook addins and restarted Outlook for good measure.  No difference.  Now I was getting worried — what if this was some unintended side-effect of some malware that had somehow got onto my machine, and it was targeting PayPal emails?

So I decided to do some research.  I fired up SysInternals’ Process Monitor, set it up to show only Outlook in its filtering, and turned on the Process Monitor trace.

Process Monitor filter window – filtering for OUTLOOK.EXE

Then I went and clicked on a PayPal email.  Waited the requisite time for the email to display, then went back to Process Monitor and turned off the trace.  I added the Duration column to make it easier to spot an anomalous entry.  This doesn’t always help but given the long delay, I was expecting some file or network activity to be taking a long time to run.

Adding the Duration column

Then scrolling up the log I quickly spotted the following entry.  It had a duration of nearly 3 seconds which stood out like a sore thumb.

The first offending entry

This entry was a Windows Networking connection to connect to a share on the remote host \\102.112.2o7.net.  This came back, nearly 3 seconds later, with ACCESS DENIED.  Then there were a bunch of follow up entries that related to this, all in all taking over 30 seconds to complete.  A quick web search revealed that this domain with its dubious looking name 102.112.2o7.net belongs to a well known web statistics company called Omniture.  That took some of the load off my mind, but now I was wondering how on earth opening a PayPal email could result in Internet access when I didn’t have automatic downloads of pictures switched on.

One of the emails that caused the problem, redacted of course 🙂

I opened the source of the PayPal HTML email and searched for “102.112“.  And there it was.

The HTML email in notepad

That’s a classic web bug.  Retrieving that image of a 1×1 pixel size, no doubt transparent, with some details encoded in the URL to record my visit to the web page (or in this case, opening of the email):

What was curious about this web bug was the use of the “//” shorthand to imply the same protocol (e.g. HTTP or HTTPS) as the base page.  That’s all well and good in a web browser, but in Outlook, the email is not being retrieved over HTTP.  So Outlook interprets this as a Windows Networking address, and attempts a Windows Networking connection to the host instead, \\102.112.2o7.net\b….

At this point I realised this could be viewed as a security flaw in Outlook.  So I wrote to Microsoft instead of publishing this post immediately.  Microsoft responded that they did not view this as a vulnerability (as it does not result in system compromise), but that they’d pass it on to the Outlook team as a bug.

Nevertheless, this definitely has the potential of being exploited for tracking purposes.  One important reason that external images are not loaded by default is to prevent third parties from being notified that you are reading a particular email.  While this little issue does not actually cause the image to load (it is still classified as an external image), it does cause a network connection to the third party server which could easily be logged for tracking purposes.  This network connection should not be happening.

So what was my fix?  Well, I don’t really care about Omniture or whether PayPal get their statistics, so I added an entry in my hosts file to block this domain.  Using an invalid IP address made it fail faster than the traditional use of 127.0.0.1:

0.0.0.0    102.112.2o7.net

And now my PayPal emails open quickly again.

Why you should not use MoveFileEx with MOVEFILE_DELAY_UNTIL_REBOOT

A common problem that you, as a developer, may run into on Windows is the need to replace a file that is in use. This commonly happens with installations and upgrades, but can of course also happen in general use.

In earlier versions of Windows, when most users worked in full administrator mode, the MoveFileEx function with the MOVEFILE_DELAY_UNTIL_REBOOT flag was suggested by Microsoft as a great approach for updating files that were in use.  This flag would, as it sounds, allow you to schedule the move or deletion of a file at a time when it was (pretty much) guaranteed to succeed.

For example:

// This will delete c:\temp\coolcorelibrary.dll on the next reboot
MoveFileEx(“c:\\temp\\coolcorelibrary.dll”, NULL, MOVEFILE_DELAY_UNTIL_REBOOT);

Nowadays, of course, this flag does not work unless you are running in the context of an administrative user.  That’s great, you think, this will still work for my install or upgrade program.

But don’t trust that feeling of security.  Things are never as easy as they seem.  I first realised that this was a problem when researching issues occasionally reported by some of our Keyman Desktop users.

Take this scenario:

  1. A user, Joe, decides to uninstall your awesome app CoolCoreProgram.
  2. The uninstaller finds that a critical file (let’s call it coolcorelibrary.dll) is in use and can’t delete it
  3. Installer calls a MoveFileEx with MOVEFILE_DELAY_UNTIL_REBOOT to schedule deletion of coolcorelibrary.dll.
  4. Would you click Restart Now?  Why not leave it till later?

    The uninstall completes and presents Joe with the dreaded “Hey, you need to restart Windows now” dialog.

  5. Poor unhappy Joe swears and cancels the restart, and continues his work.  He can’t see any good reason to restart Windows…
  6. A short while later, Joe realises that he actually loves CoolCoreProgram and so he downloads and reinstalls the latest, greatest version (now with extra shiny!)
  7. Shortly thereafter, Joe finishes up for the day and turns off his computer.
  8. The next morning, after Joe starts up his computer, Windows notes its instructions from the previous day, and obediently deletes coolcorelibrary.dll.
  9. And now Joe is now really, really unhappy when he tries to start CoolCoreProgram and he gets a bizarre coolcorelibrary.dll missing error.

Who is Joe going to blame?  Is it his fault for not restarting?  Is it yours for using cool APIs such as MoveFileEx?  Or is it a woeful confluence of unintended consequences?

This is probably one of the simplest scenarios in which this problem can crop up.  Things get much worse when you talk about shared libraries, fonts, or other resources which may be in use by the system, or multi-user systems.

Some reasons I have encountered for files in use:

  1. Program is still running (duh!)
  2. System has locked the file (common with fonts, hook libraries)
  3. Antivirus or security software is scanning the file
  4. Another application is trying to update the file (i.e. don’t run multiple installers at once)

One possible fix would be to check and update the registry key
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\PendingFileRenameOperations for listed file before creating or updating them.  But that’s a big performance and reliability hit.  I can’t say that solution sits well with me.

Another idea is to block further install and upgrade tasks until Joe actually does restart his computer, for example with a registry setting or a RunOnce entry.  But that’s going to make Joe hate me even more than he already does!

Some scenarios can be fixed with the Windows Installer automatic repair functionality.  But that assumes access to the original source files is always possible.  So that’s not a general solution.

I don’t really have a good general solution to this problem.  Any ideas?

Reproducing the TrustedInstaller.exe leak

Earlier today I blogged about the extraordinary amount of time it was taking to install updates on a Windows 7 virtual machine.  I believed that the issue came down to TrustedInstaller.exe taking what seemed to be an ever increasing amount of RAM and opening an inordinate number of handles.

After writing this blog post, I wondered who at Microsoft might be interested in the issue.  I ended up tweeting a link to Mark Russinovich in the hope that it would whet his interest (the blog did have pictures of Process Explorer, one of Mark’s great Sysinternals utilities ☺):

And verily, Mark did reply:

As Mark was kind enough to respond I knew I had to go and reproduce the issue!  So I built a new VM of Windows 7 Ultimate x64 with stock standard settings, but only 1GB of RAM (oops):

Stock Windows 7 Ultimate x64 install on a VM

I configured Performance Monitor to trace Working Set and Handle counters on the TrustedInstaller.exe process, and saved the settings into a Data Collector Set so that I could capture the whole update process.
 
Then I started the Windows Update:

Windows Update: I selected the additional important update, but none of the optional updates

Three hours later (I did say whoops about the RAM, didn’t I?) Windows finished installing updates, and I was able to generate a report from the Data Collector Set that showed the following graph:

Graph of Working set and Handle Count + summary of Working Set detail

and:

Handle Count detail

At the end of the process we see 34,500 open handles, and a peak of 600MB memory usage.  As one would expect, the RAM usage went up and down quite a bit as updates were installed.  More disturbingly, the Handle Count really did not decrease significantly through the process.  So the real message of the graph is the trend for both the green Handle Count, and the red Working Set, which are both clearly heading the wrong way: up.

There were fewer updates to install than with the other VM, which was also installing Microsoft Office Updates; I’m guessing this had an impact on the overall numbers.

Minor note: I didn’t fix the time zone for the VM.  I wasn’t really doing this at 3AM!

Does Microsoft’s TrustedInstaller.exe have a leak?

When building a virtual machine today, I was curious as to why the Windows Update install process was taking such a long time.  The initial run had over 100 different updates so I expected it to take a while, but it was even slower than expected. So I fired up Process Explorer and discovered that TrustedInstaller.exe had allocated over 1GB of RAM to do its updates.  As I watched, this number kept ticking up until all the updates were installed, with a maximum memory usage of roughly 2GB.  The handle count tended to increase over time as well, hitting a maximum of 50500 handles as I watched.

This VM was setup with 2GB RAM so this was causing lots of swap activity, which I guess is what was slowing down the update process so much.

I grabbed a screenshot at Update 94 when I decided that something was not right with this:

And another shot with the final update, Update 103.  50,484 handles is a whole lotta handles.

As each update finished, both the handle count and the memory usage would drop, but never back to the level before the update.

This really does suggest that TrustedInstaller.exe is leaking memory.  So is this intentional?  Whether by design, or by accident, it’s horrible.  Microsoft?

Update: Mark asked me why one would leak by design.  I suggested that one possibility might be that TrustedInstaller locks files or objects to prevent other processes from making changes to them before Windows is restarted, as the update process does not complete until Windows restarts.

Debugging an application that would not behave with WM_QUERYENDSESSION

I was recently tasked with addressing a problem with an application that would not shut down when given a WM_QUERYENDSESSION message. On the surface, this seemed easy enough to fix — just make sure that the WM_QUERYENDSESSION message was appropriately handled by all the top level windows in the application.

Step 1 was to add a handler into the main top level window that avoided the normal shutdown dialog boxes that warned of backup and similar tasks. Where possible, Microsoft advises, you should not block the shutdown of Windows with dialog boxes, etc.

That done, I tested and found that the application was still not shutting down when it was supposed to. So I fired up Spy++ and got started on watching the WM_QUERYENDSESSION chatter in the application. Here I ran into my first little catch-22. I decided that I wanted to do a real shutdown, not a simulated one, to make sure that I was testing the correct scenario.

Problem was, Spy++ was always being shut down before I could capture the messages needed in my target application. Eventually I realised that I needed to start Spy++ after my application, because Windows would ask each application in turn to WM_QUERYENDSESSION, and as we know, my application would cancel the process because it wasn’t accepting a shutdown! Then I realised I could also make sure the shutdown was cancelled even when my application behaved properly, simply by loading Notepad after my application and typing a character or two into it. Then during the shutdown sequence, Iwould be asked if I want to save changes to the text document, and I could just click Cancel. Crude but effective.

With that out of the way, I ran through a Spy++ session and was able to spot an offending window that returned 0 to WM_QUERYENDSESSION. A little tracing and I found a utility window procedure that failed to call DefWindowProc. Naughty. Easily solved.

Well so I thought. Started testing again. The application was still failing to shutdown. I ran through Spy++, traced all the WM_QUERYENDSESSION messages for the process, and every single window procedure responded to the message, virtually instantly, with a nice clean 1, indicating that the window was happy to end the session. After 5 seconds, Windows would show its “Application is not responding” message, indicating that it had not received a response from one of the top level windows.

This was rather puzzling. I threw in a bit of logging into a dll using SetWindowsHookEx for WH_CALLWNDPROC and WH_CALLWNDPROCRET, just in case I was missing something with Spy++. No difference, except that it was somewhat easier to read.

So then I got thinking. This application runs quite a few background threads for loading data from a database. What if, I wondered, one of those threads had a window created but was not running a message loop? It sounded logical but I wasn’t very confident that that was the cause. Problem was, how to tell? I looked for ways to check the queue status for a thread in Windbg but could not find an easy solution. Eventually I decided to add some debugging into the common ThreadProc wrapper for the application, called GetQueueStatus(QS_SENDMESSAGE), and logged the response. Re-ran the application, and bingo, there was a thread closing down with a message in its queue. No prizes if you guess what that message was!

After that, it was simple. Reviewing the thread’s main function I quickly discovered a call to WaitForSingleObject. The thread did not appear to create any windows in any of our code, but it did use COM in its database connectivity, which meant that a helper window was created behind the scenes. I changed the WaitForSingleObject to a MsgWaitForMultipleObjects, and processed any outstanding messages when signalled with a PeekMessage loop.

And then it finally worked (and I forgot to start Notepad, so Windows restarted…) But it really stumped me for a bit today.

How to connect to a Netgear DG834G router in Windows 7

Trying to connect my Windows 7 beta AspireOne to our wireless network (with a Netgear DG834G router), was receiving an unspecified error – “Windows failed to connect”. After playing with a multitude of settings, including security, access control, manual configuration and more, I discovered that changing wireless mode on the router from “g only” to “g & b” solved all my problems…

After that I went back and turned on all the security settings on the router again!