It was a quiet Sunday evening...
At my university's rocketry club we're using a particular piece of simulation software. Let's call it ContosoSim1. You enter the various parameters of the rocket into it (mass, shape, engine power, etc.), and the software simulates the rocket's flight. Pretty neat.
At one point we decided to install the simulator on the workstation we have at the lab, so that everybody will be able to use it2. Except when we did... the program failed to launch. As in, you double-click the shortcut and nothing happens.
At first, we thought it was a bug somewhere in the software (which it was, sorta), so we contacted customer service. After some back-and-forth the service rep asked us to send the application's crash log from the Reliability Monitor.
The log clearly showed that the application crashed with an access violation. At this point my curiosity was piqued. Either our lab workstation was configured not to display the usual "send error report" dialog box, or the application suppressed it itself. In any case, this had the makings of an interesting bug.
It was time to fire up WinDbg.
An initial state of failure
Running the simulator under a debugger yields the following call stack:
|
|
So the offending code is not in the simulator itself, but rather in some external
library (vtkRenderingOpenGL2-7.1.dll
). Throwing it into Ghidra reveals the following:
|
|
The crash occurs at the highlighted line, upon calling __glewBlendFuncSeparate
,
which Ghidra shows to be exported from vtkglew-7.1.dll
.
The unusual thing here is that __glewBlendFuncSeparate
is not exported as a function,
but rather as a function pointer. Since this pointer is NULL
for some reason,
the whole thing crashes.
Digging further, we find that the troublesome pointer is initialized in
_glewInit_GL_VERSION_1_4
(vtkglew-7.1.dll
), via a call to
wglGetProcAddress
(opengl32.dll
). Unfortunately,
the decompilation output for _glewInit_GL_VERSION_1_4
is not exactly readable,
so it's time to look for another way.
The power of open source
A quick search reveals that all those VTK libraries are actually part of
The Visualization Toolkit, which is open source under the BSD license!
Great, that simplifies things. After cloning the repo and checking out the
v7.1.0
tag (which should correspond to the version ContosoSim is using) we can search
for the crashing code.
From the call stack above we know that the function called just before
OpenGLInitState
is SetupPixelFormatPaletteAndContext
. In the source it can be
found in Rendering/OpenGL2/vtkWin32OpenGLRenderWindow.cxx
.
Here's the relevant part:
|
|
And this is OpenGLInit
, inside
Rendering/OpenGL2/vtkOpenGLRenderWindow.cxx
:
|
|
Looking in Ghidra confirms that there is a tail-call optimization in OpenGLInit
,
which is why the call stack doesn't show this function, only OpenGLInitState
.
Here's the relevant part from its code:
|
|
From our reversing we already know that glDepthFunc
and glEnable
are plain
function exports (from opengl32.dll
). On the other hand, glBlendFuncSeparate
is a macro that expands to the __glewBlendFuncSeparate
pointer we have seen
previously (with __declspec(dllimport)
). It's clear now that this code assumes
__glewBlendFuncSeparate
to be properly initialized, since there's
no NULL
check here.
Right, so we have thus confirmed our reversing findings. Time to finish this.
What happens inside OpenGLInitContext
? The (almost) first thing this function
does is call glewInit
inside vtkglew-7.1.dll
, which calls
glewContextInit
:
|
|
What this does is query the OpenGL version, parse the returned string to determine
the major and minor version, then call _glewInit_GL_VERSION_1_4
if the version is
at least 1.4. And this last function is what initializes __glewBlendFuncSeparate
,
as we have seen in Ghidra (and confirmed by the source).
Case closed?
No matter how we look at it, this is clearly a bug in VTK. OpenGLInitState
assumes
that glBlendFuncSeparate
is available, but that depends on the OpenGL version.
Indeed, even if the version is at least 1.4, wglGetProcAddress
can still technically
return NULL
, but in that case that would be a bug in the OpenGL implementation.
Setting a breakpoint on glGetString
, we can see that on our lab workstation (where
ContosoSim crashes) it returns a version number of 1.1.0
, which explains why
glBlendFuncSeparate
remains NULL
.
Case closed. Send an email to the vendor telling them to upgrade VTK3 and wait for a fix. After the fix we may just get a message telling us our OpenGL version is too old, but at least that's progress.
Except... The simulator does work on other machines. So what gives?
When a DLL is more than the sum of its exports
Maybe on the machines where the simulator doesn't crash it simply goes through a
different control path, bypassing the NULL
-dereference? A likely hypothesis,
however upon closer inspection it can be quickly tossed out: on my machine
__glewBlendFuncSeparate
is not NULL
, and is indeed called from the same
flow.
Okay, so maybe we just have different versions of OpenGL? Nope, again. Both systems
have the same opengl32.dll
. Identical.
Alright, this is not funny anymore.
Taking another look at __glewBlendFuncSeparate
, we see that it's not NULL
,
and it's also not inside opengl32.dll
. In fact, it points to ig8icd32.dll
,
which is the "OpenGL(R) Driver for Intel(R) Graphics Accelerator".
But surely, glGetString
should still return 1.1.0
, right? Right?! It's the same DLL!
I don't know why it surprised me, but sure enough, the version string returned was
4.4.0 - Build 20.19.15.4624
.
Somehow, this Intel DLL manages to override a legitimate Windows one. My immediate
thought was that Intel hooked it somehow4, but the truth is more prosaic.
Setting a breakpoint on the load of this DLL (sxe ld ig8icd32.dll
),
we can see that there is a function in opengl32.dll
— LoadAvailableDrivers
— which is responsible for loading it.
And so, opengl32.dll
loads a GPU vendor's OpenGL implementation and delegates to it.
If the GPU vendor implements OpenGL from version 1.4 and upwards, VTK will
work as expected. Otherwise, it'll crash.
Lovely.
But why doesn't this delegation happen on our lab workstation? Well... Because it doesn't have a GPU. At all.
Who to blame and what to do?
And so, a combination of an old PC with a buggy version of VTK means we can't use ContosoSim. Sure, we can wait for the simulator vendor to update VTK on their end, but we need the software now! And, as stated previously, maybe we do actually need a GPU with modern OpenGL support. Unfortunately, upgrading the workstation is not exactly an option.
Perhaps there is a way...
During my wanderings through the interwebs, I noticed that Qt has the ability to emulate OpenGL in software. Although the simulator does use Qt, setting the environment variables mentioned here does not help with the crash. But it does suggest that if we could find a software implementation of OpenGL, we might be able to fool it...
As luck would have it, there is such a thing: Mesa3D. And, what's even better, there is a Windows build. Just dropping it in ContosoSim's installation directory makes it launch. Granted, it's bound to be slower than GPU-assisted OpenGL, and perhaps it's even going to crash because of implementation issues.
But for now — it works.
-
Until that point it was installed only on the personal computers of some of the aerodynamics team members. ↩︎
-
AFAICT the issue has been fixed somewhere around commit
6498240fd590654cc9f7dd9aedc17c0dbc867c2b
, but I kinda lost myself in the commit history so this might not be an accurate estimate. In any event, as of this writing the latest version of VTK is 9.0.1, so it's a safe bet they've fixed it. ↩︎ -
Hey, why should AV vendors have all the fun? 😈 ↩︎