Lynx Roundup, September 10th

Bayesian view on Neural Networks! How Chrome can still freeze even on a nigh-supercomputer! A tour inside a chip that does Floating Point calcs!

Resident Scientist Snkia works tirelessly towards robot utopia. These are his findings.

I wasn’t looking for trouble. I wasn’t trying to compile a huge project in the background (24-core CPU and I can’t move my mouse), I was just engaging in that most mundane of 21st century tasks, writing an email at 10:30 am. And suddenly gmail hung. I kept typing but for several seconds but no characters were appearing on screen. Then, suddenly gmail caught up and I resumed my very important email. Then it happened again, only this time gmail went unresponsive for even longer. Well that’s funny

I have trouble resisting a good performance mystery but in this case the draw was particularly strong. I work at Google, making Chrome, for Windows, focused on performance. Investigating this hang was actually my job. And after a lot of false starts and some hard work I figured out how Chrome, gmail, Windows, and our IT department were working together to prevent me from typing an email, and in the process I found a way to save a significant amount of memory for some web pages in Chrome.

This investigation had so many rabbit holes that I’m going to save some of the digressions for a follow-on post (which is here) , but this one will entirely explain the hangs.

As usual I had UIforETW running in the background, tracing to circular buffers, so I just had to type Ctrl+Win+R and the buffers, representing the last thirty seconds or so of system activity, were saved to disk. I loaded it into Windows Performance Analyzer (WPA), but I couldn’t definitively find the hang.

When a Windows program stops pumping messages there will be ETW events emitted to indicate exactly where this happened, so those types of hangs are trivial to find. But apparently Chrome kept on pumping messages. I looked for a time when one of Chrome’s crucial threads either went into a busy loop or went completely idle, but I couldn’t find anything conclusive. There were some spots where Chrome went mostly idle, but even then all of the key threads kept chugging along, so I couldn’t be sure where the hang was – Chrome might have been going idle because nothing was happening:

image

imageUIforETW has a built-in key logger which is often useful when identifying key points in a trace. However, for obvious security reasons it defaults to anonymizing the keystrokes, treating every number typed as a ‘1’ and every letter as an ‘A’. This was making it hard to find the exact moment of the hang so I turned the input tracing from “Private” to “Full” and waited for the hang. At around 10:30 the next morning the hang happened again. I saved the trace buffers, and put these notes into UIforETW’s trace information field:

Was typing “defer to those with more scuba experience” and gmail hung around the end of “those” and then resumed partway through “experience”. gmail tab was PID 27368.

Clearly just a normal work-related discussion of commute methods, but the important thing is that now I had a way of locating the hang in the ETW trace. I loaded the trace, looked at the keyboard data in the Generic Events field (the events are emitted by UIforETW itself, each one is a purple diamond in the screen shot below) and suddenly I could see exactly where the hang was, and clearly the dip in CPU usage was 100% correlated:

image

Okay, but why is Chrome stopping? Here’s some foreshadowing: not shown in these screenshots is that WmiPrvSE.exe was consuming an entire CPU hyperthread whenever this happened. But that shouldn’t matter. I have a 24-core/48-thread machine so consuming one hyperthread means my machine is still about 98% idle.

I then zoomed in to the clearly-significant period where Chrome was idle and looked in particular at CrRendererMain in chrome.exe (27368), the gmail tab process.

Aside: I just want to say thank you to 2015 me for asking Microsoft to improve their thread naming mechanisms, and thank you to Microsoft for doing everything I suggested – thread names in WPA are great!

The problem suddenly became clearer. During the 2.81 s hang this thread was scheduled and ran 440 separate times. Normally running every 6 ms is enough to make a program responsive, but for some reason it wasn’t making any progress. I noticed that every time it woke up it was on the same stack. Simplified down it was:

chrome_child.dll (stack base)
KernelBase.dll!VirtualAlloc
ntoskrnl.exe!MiCommitVadCfgBits
ntoskrnl.exe!MiPopulateCfgBitMap
ntoskrnl.exe!ExAcquirePushLockExclusiveEx
ntoskrnl.exe!KeWaitForSingleObject (stack leaf)

Chrome is calling VirtualAlloc, which tries to update some “CfgBits” and needs to acquire a lock. At first I assumed that Chrome was calling VirtualAlloc 440 times, which seemed like a weird thing to do, but the reality was perhaps weirder. Chrome called VirtualAlloc once and a lock needed to be acquired. Chrome was signaled when that lock was available but – for 439 times in a row – when Chrome woke up and tried to acquire the lock it was unavailable. The lock had been reacquired by the process that had just released it.

This is because many Windows locks are, by design, not fair and if a thread releases a lock and then tries to reacquire it immediately then it can, in cases like this, reacquire it every single time. Starvation. More on this next time.

The Readying Process – the one that signaled Chrome to go grab the lock – in every case was WmiPrvSE.exe, releasing the lock on this stack:

ntoskrnl.exe!KiSystemServiceCopyEnd (stack base)
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiUnlockAndDereferenceVad
ntoskrnl.exe!ExfTryToWakePushLock (stack leaf)

I got distracted by WMI for a while (more on that next time) but eventually I wrote a program to recreate WMI’s behavior. I had CPU sampling data that showed me where WmiPrvSE.exe was spending its time (some edits and simplifications made):

WmiPerfClass.dll!EnumSelectCounterObjects (stack base)
WmiPerfClass.dll!ConvertCounterPath
pdh.dll!PdhiTranslateCounter
pdh.dll!GetSystemPerfData
KernelBase.dll!blah-blah-blah
advapi32.dll!blah-blah-blah
perfproc.dll!blah-blah-blah
perfproc.dll!GetProcessVaData
ntdll.dll!NtQueryVirtualMemory
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiQueryAddressSpan
ntoskrnl.exe!MiQueryAddressState
ntoskrnl.exe!MiGetNextPageTable (stack leaf)

It turns out that reproducing the slow scanning from the sampling data was quite easy. The interesting part of the call stack above is NtQueryVirtualMemory, which is used to scan process memory, and is called by the suggestively named GetProcessVaData, where Va probably stands for Virtual Address. My VirtualScan program just called NtQueryVirtualMemory in the obvious loop to scan the address space of a specified process, the code worked, it took a really long time to scan the gmail process (10-15 seconds), and it triggered the hang. But why?

The advantage of writing the scanning loop myself was that I could start generating statistics. NtQueryVirtualMemory returns data on each contiguous range of address space that has matching attributes. That might mean all reserved, all committed with particular protection settings, etc. The gmail process had about 26,000 of these blocks but I found another process (WPA, as it turns out) that had 16,000 memory blocks but could be scanned very quickly.

At some point I looked at the gmail process with vmmap and noticed that the gmail process had a fair amount of memory (361,836 KiB) and a lot of separate blocks (49,719) coming from a single reservation in the Shareable category – a reservation with a size of 2,147,483,648 KiB, aka 2 TiB. Wot?

image

I happened to know that the 2 TiB reservation is used for Control Flow Guard (CFG), and I remembered that “CFG” showed up on the call stacks where the gmail Chrome process was waiting – MiCommitVadCfgBits was on the stack. Maybe the large number of blocks in the CFG region were the problem!

Control Flow Guard (CFG) is used to try to stop exploits. Its 2 TiB reservation is a sparse bitmap which indicates which addresses (across 128 TiB of user-mode address space) are valid indirect branch targets. So I modified my virtual memory scanner to count how many blocks were in the CFG region (just look for the 2 TiB reservation) and how many blocks were tagged as executable. Since CFG memory is used to describe executable memory I expected to see one block of CFG memory for each block of executable memory in the process. Instead I saw 98 blocks of executable memory and 24,866 blocks of committed CFG memory. A significant discrepancy:

     Scan time,  Committed, page tables, committed blocks
Total: 41.763s, 1457.7 MiB,    67.7 MiB,  32112, 98 code blocks
CFG:   41.759s,  353.3 MiB,    59.2 MiB,  24866

vmmap shows reservations and commited memory as blocks and my scanning tool only counts committed blocks – that’s why vmmap shows 49,684 blocks where my scanning tool reports 24,866

In hindsight it is obvious, but what if the CFG bitmap never gets cleaned up? What if memory gets allocated in the CFG bitmap when you allocate executable memory, but it doesn’t get freed when you free the executable memory. That would explain this behavior.

End-to-end repro

The next step was to write a VAllocStress program that allocates and then frees thousands of blocks of executable memory at random addresses. This executable needs to have CFG enabled and needs to be 64-bit and I hit some road blocks because I didn’t really know what I was trying to do but it worked. After allocating and freeing this many blocks of executable memory this program should sit in a loop where it occasionally try to allocate/free more executable memory, and note when this is slow. Here is my basic algorithm for VAllocStress:

  1. Spin in a loop a lot of times:
    1. Allocate some executable memory with VirtualAlloc at a random address
    2. Free the memory
  2. Then, spin in an infinite loop:
    1. Sleep for 500 ms (don’t want to be a CPU hog)
    2. Allocate some executable memory with VirtualAlloc at a fixed location
    3. Print a message if the VirtualAlloc call takes more than ~500 ms
    4. Free the memory

That’s it. It’s super simple. And it was so satisfying when it worked. I ran it and then got my VirtualScan program to scan my VAllocStress process. I quickly got confirmation that the CFG block was horribly fragmented, and the scanning took a long time. And my VAllocStress program hung for the duration of the scan!

At this point I had simulated the CFG fragmentation, the long scanning times, and the hang. Huzzah!

Root-cause

It turns out that v8 (Chrome’s JavaScript engine) has CodeRange objects for managing code-gen, and each CodeRange object is constrained to a 128 MiB address range. This should be small enough to avoid the rampant CFG allocation I was seeing.

But what if you have multiple CodeRange objects, and what if those get allocated at random addresses and then freed? I instrumented the CodeRange constructor, left gmail running, and found the smoking gun. Every couple of minutes another CodeRange object was created (and destroyed). Being in the debugger made it easy to find that WorkerThread::Start was allocating these objects, and suddenly everything makes sense:

  1. gmail uses service workers, probably for implementing offline mode
  2. These appear and disappear every few minutes because that’s how service workers behave
  3. Each worker thread gets a temporary CodeRange object which allocates some executable pages for JITted JavaScript code from a random location in the 47-bit process address space
  4. The 2 TB CFG memory reservation gets some entries every time new code pages are allocated
  5. The CFG allocations are never freed
  6. NtQueryVirtualMemory is painfully slow when scanning CFG memory (about 1 ms per block) for reasons that I don’t understand

The slow scanning of CFG memory is actually fixed in Windows 10 RS4 (April 2018 Update), which briefly had me wondering if all of this investigation was pointless. But it’s not.

Memory

The CFG reservation starts out as just an address reservation – no memory is actually committed. As executable pages are allocated however parts of the CFG reservation are turned into committed memory, using real memory pages. And those pages never get freed. If you keep allocating and freeing randomly located blocks of executable memory then the CFG region will grow arbitrarily large! Well, that’s not quite true. The CFG memory block is best thought of a cache with bounded size. However that is little consolation when the bounded size is 2 TiB, per-process!

The worst I’ve seen so far is when my gmail tab had been running for eight days and had accumulated 353.3 MiB of CFG memory, and 59.2 MiB of page-tables to map that memory, for a total waste of about 400 MiB. For some reason most people see either no symptoms or much milder symptoms than I do.

 

Resolution

The v8 (Chrome’s JavaScript engine) team has updated their code so that it reuses addresses for CodeRange objects, which will squish this bug quite nicely. Microsoft already fixed their CFG memory scanning performance problems. Maybe some day Microsoft will free CFG regions when executable memory is freed, at least in the easy cases where large address ranges are freed. And a vmmap bug was reported.

Most of my coworkers and our customers never noticed this problem. I only noticed it because:

  • I had gmail’s offline mode enabled
  • I was running on an old version of Windows 10
  • Our IT department was running regular WMI scans of our computers
  • I was paying attention
  • I got lucky

It seems highly improbably that, as one of the most qualified people to diagnose this bug, I was the first to notice it. If any of the factors above had not been true then I would not have experienced this hang and the associated memory waste might never have been found.

By the way, the reason that the hangs kept happening at 10:30 am is just because that’s when our IT team ran their inventory scans. If you want to trigger a scan manually, you can go to Control Panel | Configuration Manager | Actions | Select ‘Hardware Inventory Cycle’ and then Run Now.

clip_image002

Source code

If you want to play along at home you can grab the source-code to VAllocStress and VirtualScan from github, here.

Rabbit holes

This investigation included a lot of rabbit holes. Some were important (page tables), some were instructive (locks), and some were purely a waste of time (WMI details). I’ll talk more about these, and a vmmap bug, in a follow-up post. If you want more gory details – some of which ended up being wrong or irrelevant – you can follow the whole investigation at crbug.com/870054.

Hacker news discussion is here.

Reddit discussion is here.

Russian translation is here.

Japanese translation is here.

Zombies probably won’t consume 32 GB of your memory like they did to me, but zombie processes do exist, and I can help you find them and make sure that developers fix them. Tool source link is at the bottom.

Is it just me, or do Windows machines that have been up for a while seem to lose memory? After a few weeks of use (or a long weekend of building Chrome over 300 times) I kept noticing that Task Manager showed me running low on memory, but it didn’t show the memory being used by anything. In the example below task manager shows 49.8 GB of RAM in use, plus 4.4 GB of compressed memory, and yet only 5.8 GB of page/non-paged pool, few processes running, and no process using anywhere near enough to explain where the memory had gone:

image

My machine has 96 GB of RAM – lucky me – and when I don’t have any programs running I think it’s reasonable to hope that I’d have at least half of it available.

Sometimes I have dealt with this by rebooting but that should never be necessary. The Windows kernel is robust and well implemented so this memory disappearing shouldn’t happen, and yet…

The first clue came when I remembered that a coworker of mine had complained of zombie processes being left behind – processes that had shut down but not been cleaned up by the kernel. He’d even written a tool that would dump a list of zombie processes – their names and counts. His original complaint was of hundreds of zombies. I ran his tool and it showed 506,000 zombie processes!

It occurred to me that one cause of zombie processes could be one process failing to close the handles to other processes. And the great thing about having a huge number of zombies is that they are harder to hide. So, I went to Task Manager’s Details tab, added the Handles column, and sorted by it. Voila. I immediately saw that CcmExec.exe (part of Microsoft’s System Management Server) had 508,000 handles open, which is both a lot  and also amazingly close to my zombie count.

image

I held my breath and killed CcmExec.exe, unsure of what would happen:

Performance Tab after cropped

The results were as dramatic as I could imagine. As I said earlier, the Windows kernel is well written and when a process is killed then all of its resources are freed. So, those 508,000 handles that were owned by CcmExec.exe were abruptly closed and my available memory went up by 32 GB! Mystery solved!

What is a zombie process?

Until this point we weren’t entirely sure what was causing these processes to hang around. In hindsight it’s obvious that these zombies were caused by a trivial user-space coding bug. The rule is that when you create a process you need to call CloseHandle on its process handle and its thread handle. If you don’t care about the process then you should close the handles immediately. If you do care – if you want to wait for the process to quit – WaitForSingleObject(hProcess, INFINITE); – or query its exit code – GetExitCodeProcess(hProcess, &exitCode); – then you need to remember to close the handles after that. Similarly, if you open an existing process with OpenProcess you need to close that handle when you are done.

If the process that holds on to the handles is a system process then it will even continue holding those handles after you log out and log back in – another source of confusion during our investigation last year.

So, a zombie process is a process that has shut down but is kept around because some other still-running process holds a handle to it. It’s okay for a process to do this briefly, but it is bad form to leave a handle unclosed for long.

Where is that memory going?

Another thing I’d done during the investigation was to run RamMap. This tool attempts to account for every page of memory in use. Its Process Memory tab had shown hundreds of thousands of processes that were each using 32 KB of RAM and presumably those were the zombies. But ~500,000 times 32 KB only equals ~16 GB – where did the rest of the freed up memory come from? Comparing the before and after Use Counts pages in RamMap explained it:

image

We can plainly see the ~16 GB drop in Process Private memory. We can also see a 16 GB drop in Page Table memory. Apparently a zombie process consumes ~32 KB of page tables, in addition to its ~32 KB of process private memory, for a total cost of ~64 KB. I don’t know why zombie processes consume that much RAM, but it’s probably because there should never be enough of them for that to matter.

A few types of memory actually increased after killing CcmExec.exe, mostly Mapped File and Metafile. I don’t know what that means but my guess would be that that indicates more data being cached, which would be a good thing. I don’t necessarily want memory to be unused, but I do want it to be available.

Trivia: rammap opens all processes, including zombies, so it needs to be closed before zombies will go away

I tweeted about my discovery and the investigation was picked up by another software developer and they reproed the bug using my ProcessCreateTests tool. They also passed the information to a developer at Microsoft who said it was a known issue that “happens when many processes are opened and closed very quickly”.

Windows has a reputation for not handling process creation as well as Linux and this investigation, and one of my previous ones, suggest that that reputation is well earned. I hope that Microsoft fixes this bug – it’s sloppy.

Why do I hit so many crazy problems?

I work on the Windows version of Chrome, and one of my tasks is optimizing its build system, which requires doing a lot of test builds. Building chrome involves creating between 28,000 and 37,000 processes, depending on build settings. When using our distributed build system (goma) these processes are created and destroyed very quickly – my fastest full build ever took about 200 seconds. This aggressive process creation has revealed a number of interesting bugs, mostly in Windows or its components:

What now?

If you aren’t on a corporate managed machine then you probably don’t run CcmExec.exe and you will avoid this particular bug. And if you don’t build Chrome or something equivalent then you will probably avoid this bug. But!

CcmExec is not the only program that leaks process handles. I have found many others leaking modest numbers of handles and there are certainly more.

The bitter reality, as all experienced programmers know, is that any mistake that is not explicitly prevented will be made. Simply writing “This handle must be closed” in the documentation is insufficient. So, here is my contribution towards making this something detectable, and therefore preventable. FindZombieHandles is a tool, based on NtApiDotNet and sample code from @tiraniddo, that prints a list of zombies and who is keeping them alive. Here is sample output from my home laptop:

274 total zombie processes.
249 zombies held by IntelCpHeciSvc.exe(9428)
249 zombies of Video.UI.exe
14 zombies held by RuntimeBroker.exe(10784)
11 zombies of MicrosoftEdgeCP.exe
3 zombies of MicrosoftEdge.exe
8 zombies held by svchost.exe(8012)
4 zombies of ServiceHub.IdentityHost.exe
2 zombies of cmd.exe
2 zombies of vs_installerservice.exe
3 zombies held by explorer.exe(7908)
3 zombies of MicrosoftEdge.exe
1 zombie held by devenv.exe(24284)
1 zombie of MSBuild.exe
1 zombie held by SynTPEnh.exe(10220)
1 zombie of SynTPEnh.exe
1 zombie held by tphkload.exe(5068)
1 zombie of tpnumlkd.exe
1 zombie held by svchost.exe(1872)
1 zombie of userinit.exe

274 zombies isn’t too bad, but it represents some bugs that should be fixed. The IntelCpHeciSvc.exe one is the worst, as it seems to leak a process handle every time I play a video from Windows Explorer.

Visual Studio leaks handles to at least two processes and one of these is easy to reproduce. Just fire off a build and wait ~15 minutes for MSBuild.exe to go away. Or, if you “set MSBUILDDISABLENODEREUSE=1” then MSBuild.exe goes away immediately and every build leaks a process handle. Unfortunately some jerk at Microsoft fixed this bug the moment I reported it, and the fix may ship in VS 15.6, so you’ll have to act quickly to see this (and no, I don’t really think he’s a jerk).

You can also see leaked processes using Process Explorer, by configuring the lower pane to show handles, as shown here (note that both the process and thread handles are leaked in this case):

image

Just a few of the bugs found, not all reported

Process handles aren’t the only kind that can be leaked. For instance, the “Intel(R) Online Connect Access service” (IntelTechnologyAccessService.exe) only uses 4 MB of RAM, but after 30 days of uptime had created 27,504 (!!!) handles. I diagnosed this leak using just Task Manager and reported it here. I also used the awesome !htrace command in windbg to get stacks for the CreateEventW calls from Intel’s code. Think they’ll fix this?

image

Using Processs Explorer I could see that NVDisplay.Container.exe from NVIDIA has ~5,000 handles to \BaseNamedObjects\NvXDSyncStop-61F8EBFF-D414-46A7-90AE-98DD58E4BC99 event, creating a new one about every two minutes? I guess they want to be really sure that they can stop NvXDSync? Reported, and a fix has been checked in.

image

Apparently Corsair Link Service leaks ~15 token handles per second. Reported here.

Apparently Adobe’s Creative Cloud leaks tens of thousands of handles – ~6,500 a day? Reported here.

Apparently Razer Chroma SDK Service leaks a lot of handles – 150,000 per hour? Reported here.

Apparently ETDCtrl.exe (11.x), some app associated with ELANTech/Synaptics trackpads, leaks handles to shared memory. The process accumulated about 16,000 handles and when the process was killed about 3 GB of missing RAM was returned to the system – quite noticeable on an 8 GB laptop with no swap.

Apparently nobody has been paying attention to this for a while – hey Microsoft, maybe start watching for handle leaks so that Windows runs better? And Intel and NVIDIA? Take a look at your code. I’ll be watching you.

So, grab FindZombieHandles, run it on your machine, and report or fix what you find, and use Task Manager and Process Explorer as well.

Twitter announcement is here, Hacker News discussion is here, reddit discussion is here.

Updates: Microsoft recommended disabling the feature that leaks handles and doing so has resolved the issue for me (and they are fixing the leaks). It’s an expensive feature and it turns out we were ignoring the data anyway! Also, all Windows 10 PIDs are multiples of four which explains why ~500,000 zombies led to PIDs in the 2,000,000+ range.

http://www.latimes.com/opinion/op-ed/la-oe-dubal-gig-companies-undercut-california-wage-law-20180814-story.html

http://www.righto.com/2018/08/inside-die-of-intels-8087-coprocessor.html

Recommender Systems: Exploring the Unknown Using Uncertainty

Neural Networks from a Bayesian Perspective

Author image
Center of the Universe Website
Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
Author image
Center of the Universe

Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.