Biko's House of Horrors

Virtual Trouble Kit

It was a quiet Sunday evening...

At my university's rocketry club we're using a particular piece of simulation software. Let's call it ContosoSim1. You enter the various parameters of the rocket into it (mass, shape, engine power, etc.), and the software simulates the rocket's flight. Pretty neat.

At one point we decided to install the simulator on the workstation we have at the lab, so that everybody will be able to use it2. Except when we did... the program failed to launch. As in, you double-click the shortcut and nothing happens.

At first, we thought it was a bug somewhere in the software (which it was, sorta), so we contacted customer service. After some back-and-forth the service rep asked us to send the application's crash log from the Reliability Monitor.

The log clearly showed that the application crashed with an access violation. At this point my curiosity was piqued. Either our lab workstation was configured not to display the usual "send error report" dialog box, or the application suppressed it itself. In any case, this had the makings of an interesting bug.

It was time to fire up WinDbg.

An initial state of failure

Running the simulator under a debugger yields the following call stack:

 # ChildEBP RetAddr
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 018ff880 5d8df5da 0x0
01 018ff8b4 5d90e88c vtkRenderingOpenGL2_7_1!vtkOpenGLRenderWindow::OpenGLInitState+0x41
02 018ffa3c 5d90d07e vtkRenderingOpenGL2_7_1!vtkWin32OpenGLRenderWindow::SetupPixelFormatPaletteAndContext+0x1bc
03 018ffaa0 5d90f20f vtkRenderingOpenGL2_7_1!vtkWin32OpenGLRenderWindow::CreateAWindow+0x2ae
04 018ffaa8 002a14a6 vtkRenderingOpenGL2_7_1!vtkWin32OpenGLRenderWindow::Start+0x14
05 018ffad0 0029f273 ContosoSim+0xd14a6
06 018ffae4 0023565a ContosoSim+0xcf273
07 018ffb70 001d31d9 ContosoSim+0x6565a
08 018ffc08 0053ed64 ContosoSim+0x31d9
09 018ffc3c 0053e1e0 ContosoSim+0x36ed64
0a 018ffc88 765efa29 ContosoSim+0x36e1e0
0b 018ffc98 774c7a7e kernel32!BaseThreadInitThunk+0x19
0c 018ffcf4 774c7a4e ntdll!__RtlUserThreadStart+0x2f
0d 018ffd04 00000000 ntdll!_RtlUserThreadStart+0x1b

So the offending code is not in the simulator itself, but rather in some external library (vtkRenderingOpenGL2-7.1.dll). Throwing it into Ghidra reveals the following:

void __thiscall vtkOpenGLRenderWindow::OpenGLInitState(vtkOpenGLRenderWindow *this)
{
  undefined extraout_DL;
  undefined uVar1;
  undefined local_18 [12];
  undefined4 local_c;
  uint local_8;

  local_8 = DAT_100f49ac ^ (uint)&stack0xfffffffc;
  glDepthFunc(0x203);
  glEnable(0xb71);
  uVar1 = 2;
  (**(code **)__glewBlendFuncSeparate_exref)(0x302,0x303,1,0x303);  // <-- CRASH!
  glEnable(0xbe2);
  if (*(int *)(this + 0xa8) == 0) {
    glDisable(0xb20);
  }
  else {
    glEnable(0xb20);
  }
  if (*(int *)(this + 0xac) == 0) {
    glDisable(0xb41);
  }
  else {
    glEnable(0xb41);
  }
  glPixelStorei(0xcf5,1);
  glPixelStorei(0xd05,1);
  (**(code **)(*(int *)this + 0x2f0))(local_18);
  (**(code **)(*(int *)this + 0x194))(local_c);
  InitializeTextureInternalFormats(this);
  FUN_100819cb(local_8 ^ (uint)&stack0xfffffffc,extraout_DL,uVar1);
  return;
}

The crash occurs at the marked line, upon calling __glewBlendFuncSeparate, which Ghidra shows to be exported from vtkglew-7.1.dll. The unusual thing here is that __glewBlendFuncSeparate is not exported as a function, but rather as a function pointer. Since this pointer is NULL for some reason, the whole thing crashes.

Digging further, we find that the troublesome pointer is initialized in _glewInit_GL_VERSION_1_4 (vtkglew-7.1.dll), via a call to wglGetProcAddress (opengl32.dll). Unfortunately, the decompilation output for _glewInit_GL_VERSION_1_4 is not exactly readable, so it's time to look for another way.

The power of open source

A quick search reveals that all those VTK libraries are actually part of The Visualization Toolkit, which is open source under the BSD license! Great, that simplifies things. After cloning the repo and checking out the v7.1.0 tag (which should correspond to the version ContosoSim is using) we can search for the crashing code.

From the call stack above we know that the function called just before OpenGLInitState is SetupPixelFormatPaletteAndContext. In the source it can be found in Rendering/OpenGL2/vtkWin32OpenGLRenderWindow.cxx. Here's the relevant part:

void vtkWin32OpenGLRenderWindow::SetupPixelFormatPaletteAndContext(
  HDC hDC, DWORD dwFlags,
  int debug, int bpp,
  int zbpp)
{
  // ... Snip ...

  // make sure glew is initialized with fake window
  this->OpenGLInit();

  // ... Snip ...
}

And this is OpenGLInit, inside Rendering/OpenGL2/vtkOpenGLRenderWindow.cxx:

void vtkOpenGLRenderWindow::OpenGLInit()
{
  OpenGLInitContext();
  OpenGLInitState();
}

Looking in Ghidra confirms that there is a tail-call optimization in OpenGLInit, which is why the call stack doesn't show this function, only OpenGLInitState. Here's the relevant part from its code:

void vtkOpenGLRenderWindow::OpenGLInitState()
{
  glDepthFunc( GL_LEQUAL );
  glEnable( GL_DEPTH_TEST );

  // initialize blending for transparency
  glBlendFuncSeparate(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA,
                      GL_ONE,GL_ONE_MINUS_SRC_ALPHA);

  // ... Snip ...
}

From our reversing we already know that glDepthFunc and glEnable are plain function exports (from opengl32.dll). On the other hand, glBlendFuncSeparate is a macro that expands to the __glewBlendFuncSeparate pointer we have seen previously (with __declspec(dllimport)). It's clear now that this code assumes __glewBlendFuncSeparate to be properly initialized, since there's no NULL check here.

Right, so we have thus confirmed our reversing findings. Time to finish this.

What happens inside OpenGLInitContext? The (almost) first thing this function does is call glewInit inside vtkglew-7.1.dll, which calls glewContextInit:

GLenum GLEWAPIENTRY glewContextInit (GLEW_CONTEXT_ARG_DEF_LIST)
{
  // ... Snip ...

  /* query opengl version */
  s = glGetString(GL_VERSION);
  dot = _glewStrCLen(s, '.');
  if (dot == 0)
    return GLEW_ERROR_NO_GL_VERSION;

  major = s[dot-1]-'0';
  minor = s[dot+1]-'0';

  if (minor < 0 || minor > 9)
    minor = 0;
  if (major<0 || major>9)
    return GLEW_ERROR_NO_GL_VERSION;


  if (major == 1 && minor == 0)
  {
    return GLEW_ERROR_GL_VERSION_10_ONLY;
  }
  else
  {
    // ... Snip ...

    GLEW_VERSION_1_4   = GLEW_VERSION_1_5   == GL_TRUE || ( major == 1 && minor >= 4 ) ? GL_TRUE : GL_FALSE;

    // ... Snip ...
  }

  // ... Snip ...

#ifdef GL_VERSION_1_4
  if (glewExperimental || GLEW_VERSION_1_4) GLEW_VERSION_1_4 = !_glewInit_GL_VERSION_1_4(GLEW_CONTEXT_ARG_VAR_INIT);
#endif /* GL_VERSION_1_4 */

  // ... Snip ...
}

What this does is query the OpenGL version, parse the returned string to determine the major and minor version, then call _glewInit_GL_VERSION_1_4 if the version is at least 1.4. And this last function is what initializes __glewBlendFuncSeparate, as we have seen in Ghidra (and confirmed by the source).

Case closed?

No matter how we look at it, this is clearly a bug in VTK. OpenGLInitState assumes that glBlendFuncSeparate is available, but that depends on the OpenGL version. Indeed, even if the version is at least 1.4, wglGetProcAddress can still technically return NULL, but in that case that would be a bug in the OpenGL implementation.

Setting a breakpoint on glGetString, we can see that on our lab workstation (where ContosoSim crashes) it returns a version number of 1.1.0, which explains why glBlendFuncSeparate remains NULL.

Case closed. Send an email to the vendor telling them to upgrade VTK3 and wait for a fix. After the fix we may just get a message telling us our OpenGL version is too old, but at least that's progress.

Except... The simulator does work on other machines. So what gives?

When a DLL is more than the sum of its exports

Maybe on the machines where the simulator doesn't crash it simply goes through a different control path, bypassing the NULL-dereference? A likely hypothesis, however upon closer inspection it can be quickly tossed out: on my machine __glewBlendFuncSeparate is not NULL, and is indeed called from the same flow.

Okay, so maybe we just have different versions of OpenGL? Nope, again. Both systems have the same opengl32.dll. Identical.

Alright, this is not funny anymore.

Taking another look at __glewBlendFuncSeparate, we see that it's not NULL, and it's also not inside opengl32.dll. In fact, it points to ig8icd32.dll, which is the "OpenGL(R) Driver for Intel(R) Graphics Accelerator".

But surely, glGetString should still return 1.1.0, right? Right?! It's the same DLL! I don't know why it surprised me, but sure enough, the version string returned was 4.4.0 - Build 20.19.15.4624.

Somehow, this Intel DLL manages to override a legitimate Windows one. My immediate thought was that Intel hooked it somehow4, but the truth is more prosaic. Setting a breakpoint on the load of this DLL (sxe ld ig8icd32.dll), we can see that there is a function in opengl32.dllLoadAvailableDrivers — which is responsible for loading it.

And so, opengl32.dll loads a GPU vendor's OpenGL implementation and delegates to it. If the GPU vendor implements OpenGL from version 1.4 and upwards, VTK will work as expected. Otherwise, it'll crash.

Lovely.

But why doesn't this delegation happen on our lab workstation? Well... Because it doesn't have a GPU. At all.

Who to blame and what to do?

And so, a combination of an old PC with a buggy version of VTK means we can't use ContosoSim. Sure, we can wait for the simulator vendor to update VTK on their end, but we need the software now! And, as stated previously, maybe we do actually need a GPU with modern OpenGL support. Unfortunately, upgrading the workstation is not exactly an option.

Perhaps there is a way...

During my wanderings through the interwebs, I noticed that Qt has the ability to emulate OpenGL in software. Although the simulator does use Qt, setting the environment variables mentioned here does not help with the crash. But it does suggest that if we could find a software implementation of OpenGL, we might be able to fool it...

As luck would have it, there is such a thing: Mesa3D. And, what's even better, there is a Windows build. Just dropping it in ContosoSim's installation directory makes it launch. Granted, it's bound to be slower than GPU-assisted OpenGL, and perhaps it's even going to crash because of implementation issues.

But for now — it works.


  1. With apologies to Microsoft 😎. ↩︎

  2. Until that point it was installed only on the personal computers of some of the aerodynamics team members. ↩︎

  3. AFAICT the issue has been fixed somewhere around commit 6498240fd590654cc9f7dd9aedc17c0dbc867c2b, but I kinda lost myself in the commit history so this might not be an accurate estimate. In any event, as of this writing the latest version of VTK is 9.0.1, so it's a safe bet they've fixed it. ↩︎

  4. Hey, why should AV vendors have all the fun? 😈 ↩︎