Category Archives: Development

The case of the stack trace that wouldn’t (trace)

One of my favourite Windows tools is Procmon.  I pull it out regularly, often as a first port of call when diagnosing complicated and opaque problems in the software I develop.  Or in anyone’s software, really.

Procmon captures a trace of key I/O activities on your computer, including file, registry and network activity, and makes it really easy to spot operations that have failed or that may be causing problems.  It’s great for spotting authentication problems, sharing violations, missing files and more (… malware).  Procmon goes as far as recording a stack trace for nearly every operation it captures!

Today, we were trying to diagnose a problem with a process that was taking 15 seconds or longer to start on a Windows XP computer.  The normal start time for that process should have been 1-2 seconds.  None of the usual culprits came forward and admitted fault, so it was time to pull out Procmon again.

We quickly spotted a big fat delay in the trace.  Note the time stamps in the two selected rows  in the screen capture below.

A big time gap between 4:10:04 and 4:10:17

Now it was time to try and find out what was causing this.  So we examined the stack trace for each entry, except … there were no symbols.  Easy enough to fix — copy dbghelp.dll from a version of Microsoft’s Debugging Tools for Windows onto the system temporarily, fixup the symbol path in Procmon’s options, and … nope, still no symbols.  Now this is one area where Procmon falls down a little bit.  If symbol loading fails, it just silently fails.  No warnings, errors or hints as to what might be going on.

This issue was occurring on a client’s computer, so it was time to take the investigation elsewhere for examination.  Before we could really examine the captured trace, we needed to get symbols going.  But how?

Procmon to the rescue!

That’s right, we realised we can use Procmon to diagnose itself!  I booted up a clean new Windows XP virtual machine, loaded Procmon onto it, ran a basic capture of some random events.

A trace on XP
A trace on XP
No symbols showing, only exports
No symbols showing, only exports
Configuring symbols for Procmon
Configuring symbols for Procmon

Even after configuring symbols, they still silently failed to load.  So I stopped the capture, saved it and immediately opened the saved capture, to stop this instance of Procmon from capturing events on the local computer.  I then started a second instance of Procmon, removed the Procmon exclusion from the filtering, and instead, added a filter to include Procmon (I also filtered specifically for the PID of the original Procmon, later):

Configuring Procmon to watch itself
Configuring Procmon to watch itself

Then I started the trace, switched back to the first Procmon, and tried to examine the stack.  Of course, still no symbols, but now it was time to switch back to the second, active Procmon process and see what we found.

And what we found was that dbghelp.dll was looking for symsrv.dll in order to download its symbols.  So we copied that also into the folder with procmon.exe  and suddenly everything worked!

dbghelp wants symsrv to help it as well
dbghelp wants symsrv to help it as well
The undecorated stack for the symsrv load request
The undecorated stack for the symsrv load request

Update 19 Sep 2013: Oops.  Forgot to attach the decorated stack (sorry!):

The call stack with symbols
The call stack with symbols loaded.  Note how the function names differ from the original stack.

So that’s the first takeaway from this story: when you want symbols, copy both dbghelp.dll and symsrv.dll from your copy of Debugging Tools for Windows.  We found no other dependencies, even with the latest version of these files.

A diversion

One curious anomaly we spotted: Procmon (or possibly Dbghelp) is looking in some strange places for debug symbols, including appending a SRV*path*url style symbol path to procmon’s parent path, and looking there, without much success:

Some weirdness in symbol loading
Some weirdness in symbol loading

I leave that one for you to solve.

Backtrack to the stack (trace)

Back to the original trace.  We loaded up the saved trace, and found that we now got kernel mode symbols just fine, but no user mode symbols would load.  In fact, Procmon doesn’t even appear to be looking for symbols for these user mode frames — either on the local drive or on the network.  And this time Procmon isn’t able to give us any more detail.  However, when we debugged the call that Procmon made to SymFindFileInPath when viewing a call stack in this log vs another new log, we found that Procmon wasn’t even providing the necessary identifying information.

What information is this?  The identifying information that the symbol servers use is the TimeDateStamp and the SizeOfImage fields from the PE header of the executable file (slightly different for .pdb files).

I surmise that this identifying information is missing from our original trace because this trace was captured before we copied version 6.0 or later of dbghelp.dll onto the client’s computer — meaning that the version that Procmon used when capturing the trace did not record this identifying information.

Therefore, the second takeaway of the story is: always copy a recent version of dbghelp.dll and symsrv.dll into the folder with procmon.exe, before starting a trace.  Even if you intend to analyze the trace later, you’ll find that without these, you won’t get full stack traces.

(Dear Microsoft, please can you consider including these in the Procmon and Procexp downloads, given that you now own Sysinternals?  Saves a lot of hassle!)

I am happy to announce that the missing characters have been found!

Last year I published a blog post about characters missing from print jobs with Internet Explorer 9 and my adventures in tracing and reporting the issue to Microsoft.

We saw situations where a letter would be printed with random letters missing, as per the following example:

Here’s how it should have been printed (oh yes, that’s a completely fictional name we used for testing non-English Latin script letters in printing):

Today, I was notified that Microsoft have finally publicly published a hotfix!  We received the hotfix from them a few weeks ago, before it was publicly available, and it certainly solved the problem on our test machines and end user computers.

Download the hotfix from:

http://support.microsoft.com/kb/2853777 (Windows 7 SP1, Windows Server 2008 SP1) 

http://support.microsoft.com/kb/2855336 (Windows 8, Windows Server 2012)

(The first link notes that the hotfix is included in the rollup in the second link, though the second link doesn’t mention the hotfix!)

Sadly, they’ve only published a solution for Windows 7 Service Pack 1 onwards, and not for Vista, which could be an issue for some users. 

It was interesting to see that the issue was indeed a race condition (two threads ending up with the same random seed, because, and I quote:

This issue occurs because a conflict causes the text that uses the font to become corrupted when the two threads try to install the font at the same time. The name of the font is generated by the RAND function together with a SEEDvalue whose time value is set to srand(time(NULL)). When the two documents are printed at the same time, the SEEDvalue for the font is the same in both documents. Therefore, the conflict occurs.

So not related to the threading model, which was a little bit of misdirection on my part 😉  That’s par for the course though when debugging complex issues without source…

Fantastic to finally have the fix 🙂

Giro-Your-Strava updated to give Le Tour treatment to the climbs!

Update May 2014: As Strava’s ride pages have changed format significantly, Giro-Your-Strava no longer works. The good news is that Mesmeride does the same and more!

After the Strava API debacle, my little Tour Segment Gradient tool no longer works, which is sad.  I’d put together a number of other Strava API-based widgets, but this was the only one which was really at all popular.  Yesterday, DC Rainmaker himself mentioned (thank you Ray!) the Giro-Your-Strava elevation graph tool (which does still work) on his blog, so what better time to update the Tour Segment Gradient tool?

In short, what I have done is to dump both the Giro and Le Tour gradient mashups into the same bookmarklet.  One click and you get beautiful isometric graphs for your ride (in Giro style) and your efforts (in Le Tour style).  Yes, I get the inconsistency, but what would life be without idiosyncracies?

1. Install the bookmarklet.

Here’s the bookmarklet.  Just drag it to your Bookmarks toolbar to install it:

Giro-Your-Strava

2. Load a favourite Strava ride and click the bookmark.

(It’s best to wait for the page to load completely before clicking the bookmark.)

Presto, you’ll get two spiffy new buttons, one for your ride:

And one for the segment view:

So go ahead and click the Giro button, and you’ll see:

Click the Le Tour button for the new elevation profile for a segment:

Have fun!

One danger with bookmarklets that fiddle with an existing site in this way is that they will tend to break when the site updates.  There are no stability guarantees that APIs (ususally!) provide, so YMMV.  However, if the Strava site layout changes, it’s probably only a simple tweak to the code to get it working again.

There’s nothing beautiful about the code on the backend.  It really needs rewriting and modularisation etc etc etc but hey, it works 😉  Do what you want with the code, just share it with us all if you improve things!

Giro-style elevation graphs for Strava

Update May 2014: As Strava’s ride pages have changed format significantly, Giro-Your-Strava no longer works. The good news is that Mesmeride does the same and more!
Update 13 July 2013: An updated version of Giro-Your-Strava is now available here.

Just in time for the last few days of the Giro, I’ve finished a little after-hours Strava mashup project that builds on the segment graphs that I originally created for the Hobart 10,000.

I’m sure you’ve seen some of the Giro elevation graphs. Here’s one, from Stage 11:

Now, here’s a bookmarklet.  Drag it to your bookmarks toolbar, load up a favourite hilly ride on Strava, and click the bookmarklet.

Giro-Your-Strava

A mysterious new button will appear in the graph menu.  Go, on click it!

And up pops an elevation graph that makes it look like you’ve been riding the Giro!

The algorithm picks the categorised climbs from your ride (and tries to figure out the most appropriate segment where multiple segments finish at the top of a hill). Non-categorised segments are currently ignored.  The whole project is published on GitHub, so you can tweak it and improve it to your heart’s content.  Your first step should be to tidy up the mess I’ve left you 🙂

RSA Key Exchange between CryptoAPI and CNG (BCrypt)

Microsoft, a few years ago, wrote a new cryptography API for Windows Vista called Cryptography Next Generation (CNG).  This replaces the existing cryptography API in Windows XP and earlier versions which was simply called CryptoAPI.

Now, compatibility between the two APIs is touted as a feature of the new CNG API, but a somewhat more sombre reality tends to set in pretty quickly once you start working with them.  A motley collection of loosely documented data structures, endianness differences, and “Not Supported” footnotes pile up as quickly as you can open MSDN documentation tabs in your browser.  In my research, I also found a dearth of concrete examples of how to interoperate between the two APIs online.  In fact, there was only one page which really helped, buried in Microsoft’s Output Protection Manager documentation.

There is something magical about all your bits lining up in a row when mixing crypto libraries of any ilk, and I wanted to share!  Hence this story.

My requirement was pretty simple: I wanted to generate an RSA key pair using CryptoAPI in my client application, hand off the public key to another application, which happens to use CNG, wave my hands a little, and magically transact a key exchange.  Here’s how to do it (with some error handling elided for your own reading sanity).  You should be able to load this into any recent version of Visual Studio, add bcrypt.lib to your library includes and build.  The code should be reasonably self-explanatory, so I’ll leave it there.

The usual disclaimers apply. Source on GitHub.

RIGHTeously tripping over T-SQL’s LEN function

We tripped over recently on our understanding of Microsoft SQL Server’s T-SQL LEN function.  The following conversation encapsulates in a nutshell what any sensible developer would assume about the function.

@marcdurdin I guess the answer is, removing the line would give the same result. Right? 🙂

— S Krupa Shankar (@tamil) April 23, 2013

Now, I wouldn’t be writing this blog post if that was the whole story.  Because, like so many of these things, it’s not quite that simple.

Originally, we were using RIGHT(string, LEN(string) – 2) to trim two characters off the front of our string.  Or something like that.  Perfectly legitimate and reasonable, one would think.  But we were getting strange results, trimming more characters than we expected.

It turns out that T-SQL’s LEN function does not return the length of the string.  It returns the length of the string, excluding trailing blanks.  That is, excluding U+0020, 0x20, 32, ‘ ‘, or whatever you want to call it.  But not tabs, new lines, zero width spaces, breaking or otherwise, or any other Unicode space category characters.  Just that one character.  This no doubt comes from the history of the CHAR(n) type, where fields were always padded out to n characters with spaces.  But today, this is not a helpful feature.

Of course, RIGHT(string) does not ignore trailing blanks

But here’s where it gets really fun.  Because a post about strings is never complete until you’ve done it in Unicode.  Some pundits suggest using DATALENGTH to get the actual length of the string.  This returns the length in bytes, not characters (remember that NCHAR is UTF-16, so 2 bytes per character… sorta!).  Starting with SQL Server 2012, UTF-16 supplementary pairs can be treated as a single character, with the _SC collations, so you can’t even use DATALENGTH*2 to get the real length of the string!

OK.  So how do you get the length of a string, blanks included, now that we’ve established that we can’t use DATALENGTH?  Here’s one simple way:

  SET @realLength = LEN(@str + ‘.’) – 1

Just to really do your head in, let’s see what happens when you use LEN with a bunch of garden variety SQL strings.  I should warn you that this is a display artefact involving the decidedly unrecommended use of NUL characters, and no more, but the weird side effects are too fun to ignore.

First, here’s a set of SQL queries for our default collation (Latin1_General_CI_AS):

Note the output.  Not exactly intuitive!  Now we run the Unicode version:

Not quite as many gotchas in that one?  Or are there?  Notice how the first line shows a wide space for the three NUL characters — but not quite as wide as the Ideographic space…

Now here’s where things go really weird.  Running in Microsoft’s SQL Server Management Studio, I switch back to the first window, and, I must emphasise, without running any queries, or making any changes to settings, take a look at how the Messages tab appears now!

That’s right, the first line in the messages has magically changed!  The only thing I can surmise is that switching one window into Unicode output has affected the whole program’s treatment of NUL characters.  Let that be a warning to you (and I haven’t even mentioned consuming said NULs in C programs).

The Fragile Abstract Factory Delphi Pattern

When refactoring a large monolithic executable written in Delphi into several executables, we ran into an unanticipated issue with what I am calling the fragile abstract factory (anti) pattern, although the name is possibly not a perfect fit.  To get started, have a look at the following small program that illustrates the issue.

program FragileFactory;

uses
  MyClass in 'MyClass.pas',
  GreenClass in 'GreenClass.pas',
  //RedClass in 'RedClass.pas',
  SomeProgram in 'SomeProgram.pas';

var
  C: TMyClass;
begin
  C := TMyClass.GetObject('TGreenClass');
  writeln(C.ToString);
  C.Free;

  C := TMyClass.GetObject('TRedClass');  // oh dear… that’s going to fail
  writeln(C.ToString);
  C.Free;
end.
unit MyClass;

interface

type
  TMyClass = class
  protected
    class procedure Register;
  public
    class function GetObject(ClassName: string): TMyClass;
  end;

implementation

uses
  System.Contnrs,
  System.SysUtils;

var
  FMyClasses: TClassList = nil;

{ TMyObjectBase }

class procedure TMyClass.Register;
begin
  if not Assigned(FMyClasses) then
    FMyClasses := TClassList.Create;
  FMyClasses.Add(Self);
end;

class function TMyClass.GetObject(ClassName: string): TMyClass;
var
  i: Integer;
begin
  for i := 0 to FMyClasses.Count – 1 do
    if FMyClasses[i].ClassNameIs(ClassName) then
    begin
      Result := FMyClasses[i].Create;
      Exit;
    end;

  Result := nil;
end;

initialization
finalization
  FreeAndNil(FMyClasses);
end.
unit GreenClass;

interface

uses
  MyClass;

type
  TGreenClass = class(TMyClass)
  public
    function ToString: string; override;
  end;

implementation

{ TGreenClass }

function TGreenClass.ToString: string;
begin
  Result := 'I am Green';
end;

initialization
  TGreenClass.Register;
end.

What happens when we run this?

C:\src\fragilefactory>Win32\Debug\FragileFactory.exe
I am Green
Exception EAccessViolation in module FragileFactory.exe at 000495A4.
Access violation at address 004495A4 in module ‘FragileFactory.exe’. Read of address 00000000.

Note the missing TRedClass in the source.  We don’t discover until runtime that this class is missing.  In a project of this scale, it is pretty obvious that we haven’t linked in the relevant unit, but once you get a large project (think hundreds or even thousands of units), it simply isn’t possible to manually validate that the classes you need are going to be there.

There are two problems with this fragile factory design pattern:

  1. Delphi use clauses are fragile (uh, hence the name of the pattern).  The development environment frequently updates them, typically they are not organised, sorted, or even formatted neatly.  This makes validating changes to them difficult and error-prone.  Merging changes in version control systems is a frequent cause of errors.
  2. When starting a new Delphi project that utilises your class libraries, ensuring you use all the required units is a hard problem to solve.

Typically this pattern will be used with in a couple of ways, somewhat less naively than the example above:

  1. There will be an external source for the identifiers.  In the project in question, these class names were retrieved from a database, or from linked resources.
  2. The registered classes will be iterated over and each called in turn to perform some function.

Of course, this is not a problem restricted to Delphi or even this pattern.  Within Delphi, any unit that does work in its initialization section is prone to this problem.  More broadly, any dynamically linking registry, such as COM, will have similar problems.  The big gotcha with Delphi is that the problem can only be resolved by rebuilding the project, which necessitates rollout of an updated executable — much harder than just re-registering a COM object on a client site for example.

How then do we solve this problem?  Well, I have not identified a simple, good clean fix.  If you have one, please tell me!  But here are a few things that can help.

  1. Where possible, use a reference to the class itself, such as by calling the class’s ClassName function, to enforce linking the identified class in.  For example:
    C := TMyClass.GetObject(TGreenClass.ClassName);
    C := TMyClass.GetObject(TRedClass.ClassName);
  2. When the identifiers are pulled from an external resource, such as a database, you have no static reference to the class.  In this case, consider building a registry of units, automatically generated during your build if possible.  For example, we automatically generate a registry during build that looks like this:
    unit MyClassRegistry;
    
    // Do not modify; automatically generated 
    
    initialization
    procedure AssertMyClassRegistry;
    
    implementation
    
    uses
      GreenClass,
      RedClass;
    
    procedure AssertMyClassRegistry;
    begin
      Assert(TGreenClass.InheritsFrom(TMyClass));
      Assert(TRedClass.InheritsFrom(TMyClass));
    end; 
    
    end.

    These are not assertions that are likely to fail but they do serve to ensure that the classes are linked in.  The AssertMyClassRegistry function is called in the constructor of our main form, which is safer than relying on a use clause to link it in.

  3. Units that can cause this problem can be identified by searching for units with initialization sections in your project (don’t forget units that also use the old style begin instead of initialization — a helpful grep regular expression for finding these units is (?s)begin(?:.(?!end;))+\bend\.).  This at least gives you a starting point for making sure you test every possible case.  Static analysis tools are very helpful here.
  4. Even though it goes against every encapsulation design principle I’ve ever learned, referencing the subclass units in the base class unit is perhaps a sensible solution with a minimal cost.  We’ve used this approach in a number of places.
  5. Format your use clauses, even sorting them if possible, with a single unit reference on each line.  This is a massive boon for source control.  We went as far as building a tool to clean our use clauses, and found that this helped greatly.

In summary, then, the fragile abstract factory (anti) pattern is just something to be aware of when working on large Delphi projects, mainly because it is hard to test for: the absence of a unit only comes to light when you actually call upon that unit, and due to the fragility of Delphi’s use clauses, unrelated changes are likely to trigger the issue.

How not to re-raise an exception in Delphi

Just a quick exception management tip for today.

I was debugging a weird cascade of exceptions in an application today — it started with an EOleException from a database connection issue, and rapidly degenerated into a series of EAccessViolation and EInvalidPointer exceptions: often a good sign of Use-After-Free or Double-Free scenarios.  Problem was, I could not see any place where we could be using a object after freeing it, even in the error case.

Here’s a shortened version of the code:

procedure ConnectToDatabase;
begin
  try
    CauseADatabaseException;  // yeah... bear with me here.
  except
    on E: EDatabaseError do
    begin
      E.Message := 'Failed to connect to database; the '+ 
                   'error message was: ' + E.Message;
      raise(E);
    end;
  end;
end;

Can you spot the bug?

I must admit I read through the code quite a few times before I spotted it.  It’s not an in-your-face-look-at-me type of bug!  In fact, the line in question appears to be completely logical and plausible.

Okay, enough waffle.  The bug is in the last line of the exception handler:

      raise(E);

When we call raise(E) we are telling Delphi that here is a shiny brand new exception object that we want to raise.  After Delphi raises the exception object, the original exception object is freed, and … wait a minute, that’s the exception object we were raising!  One delicious serve of Use-After-Free or Double-Free coming right up!

We should be doing this instead:

      raise;  // don't reference E here!

Another thing to take away from this is: remember that you don’t control the lifetime of the exception object.  Don’t store references to it and expect it to survive.  If you want to maintain knowledge of an inner exception object, use Delphi’s own nested exception management, available since Delphi 2009.

Delphi, Win32 and leaky exceptions

Have you ever written any Delphi code like the following snippet?  That is, have you ever raised a Delphi exception in a Win32 callback?  Or even just failed to handle potential exceptions in a callback?

    function MyWndEnumFunc(hwnd: HWND; lParam: LPARAM): BOOL; stdcall;
    begin
      if hwnd = TForm4(lParam).Handle then
        raise Exception.Create('Raise an exception just to demonstrate the issue');
      Result := True;
    end;

    procedure TForm4.FormCreate(Sender: TObject);
    begin
      try
        EnumWindows(@MyWndEnumFunc, NativeInt(Self));
      except
        on E:Exception do
          ShowMessage('Well, we unwound that exception.  But does Win32 agree?');
      end;
    end;

Raymond Chen has posted a great blog today about why that approach will eventually end in tears.  You may get away with it for a while, or you may end up with horrific stack or heap corruption. The moral of the story is, any time you have a Win32 callback function, you need to make sure no exceptions leak.  Like this:

function MyWndEnumFunc(hwnd: HWND; lParam: LPARAM): BOOL; stdcall;
begin
  try
    if hwnd = TForm4(lParam).Handle then
      raise Exception.Create('Raise an exception just to demonstrate the issue');
    Result := True;
  except
    // Handle the exception, perhaps pass it to Application.HandleException,
    // or log it, or abort your app.  Just don’t let it bubble through any
    // Win32 code.  You need to write the HandleAllExceptions function!
    HandleAllExceptions;
    Result := False;
  end;
end;

If you use Delphi’s AllocateHwnd procedure, remember that it also does not handle exceptions for you (I’ve just reported this in QualityCentral as QC108653 as this caveat should at least be documented).  So you need to do it:

procedure TForm4.MyAllocatedWindowProc(var Message: TMessage);
begin
  try
    if Message.Msg = WM_USER then
      raise Exception.Create('Go wild!');
  except
    Application.HandleException(Self);  // Or whatever, just don’t let it leak
  end;
  with Message do Result := DefWindowProc(FMyHandle, Msg, wParam, lParam);
end;

procedure TForm4.FormCreate(Sender: TObject);
begin
  FMyHandle := AllocateHwnd(MyAllocatedWindowProc);
  SendMessage(FMyHandle, WM_USER, 0, 0);
end;

Why does switching languages hang my app?

This is a very short follow-up to yesterday’s blog about WM_INPUTLANGCHANGEREQUEST, WaitForSingleObject and applications hanging.

While I thought, yesterday, that this bug was no longer such a problem on Windows Vista and Windows 7, it turns out that if you assign a hotkey to a language (e.g. Alt+Shift+2), the KLF_SETFORPROCESS flag will be set when that hotkey is pressed.  This means that this issue is still a problem today.

This is not just a problem for Delphi.  I’ve found references to Outlook hanging, .NET 4.5 WPF applications hanging, and VC++ applications hanging.

Moral of the story (again): thread-safe code is hard.  Don’t believe that your code suddenly becomes thread-safe because you synchronize it to the main thread (no matter what the Delphi documentation tells you!)