Today I’ve got a process on my machine that is supposed to be exiting, but it has hung. Let’s load it up in Windbg and find what’s up. The program in question was built in Delphi XE2, and symbols were generated by our internal tds2dbg tool (but there are other tools online which create similar .dbg files). As usual, I am writing this up for my own benefit as much as anyone else’s, but if I put it on my blog, it forces me to put in enough detail that even I can understand it when I come back to it!
Looking at the main thread, we can see unit finalizations are currently being called, but the specific unit finalization section and functions which are being called are not immediately visible in the call stack, between InterlockedCompareExchange
and FinalizeUnits
:
0:000> kb ChildEBP RetAddr Args to Child 0018ff3c 0040908c 0018ff78 0040909a 0018ff5c audit4_patient!System.Sysutils.InterlockedCompareExchange+0x5 [C:\Program Files (x86)\Embarcadero\RAD Studio\9.0\source\rtl\sys\InterlockedAPIs.inc @ 23] 0018ff5c 004094b2 0018ff88 00000000 00000000 audit4_patient!System.FinalizeUnits+0x40 [C:\Program Files (x86)\Embarcadero\RAD Studio\9.0\source\rtl\sys\System.pas @ 17473] 0018ff88 74b7336a 7efde000 0018ffd4 76f99f72 audit4_patient!System..Halt0+0xa2 [C:\Program Files (x86)\Embarcadero\RAD Studio\9.0\source\rtl\sys\System.pas @ 18599] 0018ff94 76f99f72 7efde000 35648d3c 00000000 kernel32!BaseThreadInitThunk+0xe 0018ffd4 76f99f45 01a5216c 7efde000 00000000 ntdll!__RtlUserThreadStart+0x70 0018ffec 00000000 01a5216c 7efde000 00000000 ntdll!_RtlUserThreadStart+0x1b
So, the simplest way to find out where we were was to step out of the InterlockedCompareExchange
call. I found myself in System.SysUtils.DoneMonitorSupport
(specifically, the CleanEventList
subprocedure):
0:000> p eax=01a8ee70 ebx=01a8ee70 ecx=01a8ee70 edx=00000001 esi=00000020 edi=01a26e80 eip=0042dcb1 esp=0018ff20 ebp=0018ff3c iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200202 audit4_patient!CleanEventList+0xd: 0042dcb1 33c9 xor ecx,ecx
After a little more spelunking, and a review of the Delphi source around this function, I found that this was a part of the System.TMonitor
support. Specifically, there was a locked TMonitor
somewhere that had not been destroyed. I stepped through a loop that was spinning, waiting for the object to be unlocked so its handle could be destroyed, and found a reference to the data in question here:
0:000> p eax=00000001 ebx=01a8ee70 ecx=01a8ee70 edx=00000001 esi=00000020 edi=01a26e80 eip=0042dcaf esp=0018ff20 ebp=0018ff3c iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200202 audit4_patient!CleanEventList+0xb: 0042dcaf 8bc3 mov eax,ebx
Looking at the record pointed to by ebx
, we had a reference to an event handle handy:
0:000> dd ebx L2 01a8ee70 00000001 00000928 TSyncEventItem = record Lock: Integer; Event: Pointer; end;
Although Event
is a Pointer
, internally it’s just cast from an event handle. So I guess that we can probably find another reference to that handle somewhere in memory, corresponding to a TMonitor
record:
TMonitor = record strict private // ... snip ... var FLockCount: Integer; FRecursionCount: Integer; FOwningThread: TThreadID; FLockEvent: Pointer; FSpinCount: Integer; FWaitQueue: PWaitingThread; FQueueLock: TSpinLock;
And if we search for that event handle:
0:000> s -[w]d 00400000 L?F000000 00000928 01a8ee74 00000928 00000000 0000092c 00000000 (.......,....... 07649334 00000928 002c0127 0012ccb6 0000002e (...'.,......... 0764aa14 00000928 01200125 004b8472 000037ce (...%. .r.K..7.. 07651e24 00000928 05f60125 01101abc 00000e86 (...%........... 08a47544 00000928 00000000 00000000 00000000 (...............
Now one of these should correspond to a TMonitor
record. The first entry (01a8ee74
) is just part of our TSyncEventItem
record, and the next three don’t make sense given that the FSpinCount
(the next value in the memory dump) would be invalid. So let’s look at the last one. Counting quickly on all my fingers and toes, I establish that that makes 08a47538
the start of the TMonitor
record. And… so we search for a pointer to that.
0:000> s -[w]d 00400000 L?F5687000 08a47538 076b1d24 08a47538 08aa3e40 076b1db1 0122fe50 8u..@>....k.P.".
Just one! But here it gets a little tricky, because the PMonitor
pointer is in a ‘hidden’ field at the end of the object. So we need to locate the start of the object.
0:000> dd 076b1d00 076b1d00 0122fe50 00000000 00000000 076b1df1 076b1d10 0122fe50 00000000 00000000 076b0ce0 076b1d20 004015c8 08a47538 08aa3e40 076b1db1 076b1d30 0122fe50 00000000 00000000 076b1d51 ... snip ...
I’m just stabbing in the dark here, but that 004015c8
that’s just four bytes back smells suspiciously like an object class pointer. Let’s see:
0:000> da poi(4015c8-38)+1 004016d7 "TObject&"
Ta da! That all fits. A TObject
has no data members, so the next 4 bytes should be the TMonitor
(search for hfMonitorOffset
in the Delphi source to learn more). So we have a TObject
being used as a TMonitor
lock reference. (Learn about that poi(address-38)+1
magic). But what other naughty object is hanging about, using this TObject
as its lock?
0:000> s -[w]d 00400000 L?F5687000 076b1d20 098194b0 076b1d20 00000000 00000000 09819449 .k.........I...
Just one. Let’s trawl back in memory just a little bit and have a look at this one.
0:000> dd 09819480 09819480 00f0f0f0 00ffffff 00000000 09819641 09819490 00447b7c 00000000 00000000 00000000 098194a0 00000000 098150c0 00448338 098195c8 098194b0 076b1d20 00000000 00000000 09819449 ... snip ... 0:000> da poi(00448338-38)+1 004483c0 "TThreadList&"
And what does a TThreadList
look like?
TThreadList = class private FList: TList; FLock: TObject; FDuplicates: TDuplicates;
Yes, that definitely looks hopeful! That FLock
is pointing to our lock TObject
… I believe that’s called a Quality Match.
This is still a little bit too generic for me, though. TThreadList
is a standard Delphi class used by the bucketload. Let’s try and identify who is using this list and leaving it lying about. First, we’ll quickly have a look at that TThreadList.FList
to see if it has anything of interest — that’s the first data member in the object == object+4.
0:000> dd poi(098194ac) 098195c8 00447b7c 00000000 00000000 00000000 ... snip ... 0:000> da poi(447b7c-38)+1 00447cad "TList'"
Yep, it’s a TList
. Just making sure. It’s empty, what a shame (TList.FCount
is the second data member in the object == 00000000, as is the list pointer itself).
So how else can we find the usage of that TThreadList
? Is it TThreadList
referenced anywhere then? Break out the search tool again!
0:000> s -[w]d 00400000 L?F5687000 098194a8 076d3410 098194a8 00000000 09819580 00000000 ................
Yes. Just once, again. Again, we scroll back in memory to find the base of that object.
0:000> dd 076d33c0 076d33c0 08abfe60 00000000 00000000 076d4c39 076d33d0 01616964 0000091c 00001850 00000100 076d33e0 00000001 00000000 00000000 00000000 076d33f0 076d33d0 00000000 01616d38 076d33d0 076d3400 00000000 00000000 00000000 00000000 076d3410 098194a8 00000000 09819580 00000000 076d3420 00000000 076d2ea9 0075e518 00000000 076d3430 00401ecc 00000000 00000000 00000000
There was a false positive at 076d33f8
, but then magic happened at 076d33d0
:
0:000> da poi(01616964-38)+1 016169e8 "TAnatomyDiagramTileLoaderThread&"
Wow! Something real! Let’s dissect this a bit. Looking at the definition of TThread
, we have the following data:
076d33d0 01616964 class pointer TAnatomyDiagramTileLoaderThread 076d33d4 0000091c TThread.FHandle 076d33d8 00001850 TThread.FThreadID 076d33dc 00000100 TThread.FCreateSuspended, .FTerminated, .FSuspended, .FFreeOnTerminate (watch that endianness!) 076d33e0 00000001 TThread.FFinished 076d33e4 00000000 TThread.FReturnValue 076d33e8 00000000 TThread.FOnTerminate.Code 076d33ec 00000000 TThread.FOnTerminate.Data 076d33f0 076d33d0 TThread.FSynchronize.TThread (= Self) 076d33f4 00000000 padding 076d33f8 01616d38 TThread.FSynchronize.FMethod.Code (= last Synchronize target method) 076d33fc 076d33d0 TThread.FSynchronize.FMethod.Data (= Self) 076d3400 00000000 TThread.FSynchronize.FProcedure 076d3404 00000000 TThread.FSynchronize.FProcedure 076d3408 00000000 TThread.FFatalException 076d340c 00000000 TThread.FExternalThread
Then, into TAnatomyDiagramTileLoaderThread
:
076d3410 098194a8 TAnatomyDiagramTileLoaderThread.FTiles: TThreadList
So… we can tell that the thread has exited (FFinished == 1
), and verify that by looking at running threads, looking for thread id 1850
:
0:000> ~ . 0 Id: 1c98.1408 Suspend: 1 Teb: 7efdd000 Unfrozen 1 Id: 1c98.16c8 Suspend: 1 Teb: 7efda000 Unfrozen 2 Id: 1c98.11a0 Suspend: 1 Teb: 7ee1c000 Unfrozen 3 Id: 1c98.1bbc Suspend: 1 Teb: 7ee19000 Unfrozen 4 Id: 1c98.10dc Suspend: 1 Teb: 7ee04000 Unfrozen 5 Id: 1c98.12f0 Suspend: 1 Teb: 7edfb000 Unfrozen 6 Id: 1c98.1d38 Suspend: 1 Teb: 7efaf000 Unfrozen 7 Id: 1c98.1770 Suspend: 1 Teb: 7edec000 Unfrozen 8 Id: 1c98.1044 Suspend: 1 Teb: 7ede3000 Unfrozen 9 Id: 1c98.bf4 Suspend: 1 Teb: 7ede0000 Unfrozen 10 Id: 1c98.b3c Suspend: 1 Teb: 7efd7000 Unfrozen
Furthermore, the handle is invalid:
0:000> !handle 91c Could not duplicate handle 91c, error 6
That suggests that the object has already been destroyed. But that the TThreadList
hasn’t.
And sure enough, when I looked at the destructor for TAnatomyDiagramTileLoadThread
, we clear the TThreadList
, but we never free it!
Now, another way we could have caught this was to turn on leak detection. But leak detection is not always perfect, especially when you get some libraries that *cough* have a lot of false positives. And of course while we could have switched on heap leak detection, that involves rebuilding and restarting the process and losing the context along the way, with no guarantee we’ll be able to reproduce it again!
While this approach does feel a little tedious, and we did have some luck in this instance with freed objects not being overwritten, and the values we were searching for being relatively unique, it does nevertheless feel pretty deterministic, which must be better than the old “try-it-and-hope” debugging technique.
Thanks to your article I now know about tds2dbg – I’d been trying to use map2dbg before, but without much success.
When I debug using the dbg files generated by tds2dbg, windbg doesn’t show local variables. I understand (this might be wrong) that the dbg files do contain the correct info but it’s not used by windbg. How do get around this limitation? Do you have a tds2dbg that generates pdb files that do permit local variables to be displayed? Or do you not find it a limitation that local variables aren’t visible?
Keep up the good work.
Currently local variables are not exported to the .dbg file. In theory .dbg supports local variable debug info but it’s not trivial to convert from the tds variable data to the dbg, which is why it hasn’t been done yet. .pdb is not formally documented by Microsoft. Frankly, I find it frustrating not having named local variables in windbg but the effort required to make that possible has meant it has not been a priority.
Many thanks for such a prompt and useful reply.
Most welcome 🙂
Hello,
You mention an internal tds2dbg tool – I’ve been trying to get other publicly available tools for TDS parsing to work to no avail. Would you be able to share it?
You can find tds2dbg now at https://github.com/keymanapp/keyman/tree/master/windows/src/ext/tds2dbg
Hello,
how is this version any different from the original project version?
http://tds2dbg.sourceforge.net/
Well, the source is there in both cases, so you could answer that question yourself 😉
But it no longer has any significant differences — we maintained our own version internally years ago, but that is no longer necessary; the code is copied into the Keyman repo now to simplify dependency management.
Mind you, I’m not a big fan of pointing people to SourceForge any more, given the rabbit warren of misleading advertising plastering the site. 🙂
Very impressive blog posts about using Windbg with Delphi apps! Don’t understand how I missed it for some many years.
Do you have any experience with getting Taskmgr/VS/ProcessExplorer-style dump files from Delphi processes (from a customer) and use that as a starting point for Windbg debugging?
We have some customers that have hard-to-reproduce issues, and being able to attach a debugger (even a low-level one like Windbg) could be very helpful.
Hi Hallvard,
Thanks for reading! Yes, most of the techniques I describe with Windbg are equally applicable to a static crash dump file. A normal minidump is usually sufficient; I haven’t found that I need other parameters for stack-based analysis. Hardest part is often matching up symbol .dbg files correctly.
Also interesting now is Windbg Preview’s Time Travel Debugging. I’ve used that on a particularly gnarly Delphi app bug successfully. Takes a little bit of practice to get used to it but it’s kinda cool being able replay the entire lifecycle of an app!