alltom.com

Thoughts On: Instrumenting JavaScript in JavaScript

For my Theseus research project, I needed to be able to store execution traces of JavaScript on the web. I wanted it to work on any site with any browser, so I started by making a proxy server that would inject the debugger into every web page. It would rewrite all JavaScript code to save the execution trace to a global, in-memory object as it ran.

The implementation was fairly straight-forward, though I ran into plenty of surprises, which I've written about here.

The finished instrumentation library is called fondue. I ended up packaging the proxy server as a generic middleware as fondue-middleware.

# Implementation Overview

I parsed and rewrote JavaScript using the Falafel library, which is based on the Esprima parser. I also used a version of JSHint that's modified to provide information about the scopes of all variables.

The proxy is written in Node.js, as a Connect middleware. When the middleware detects the application/javascript MIME type, it instruments the entire response (assuming it's JavaScript). When the MIME type is text/html, it hunts for <script> tags and instruments their contents.

When the instrumented JavaScript runs, the trace data gets saved in a global object so it can be accessed directly by the debugger JavaScript that also executes on the page. When using Chrome, that global object can also be accessed from another application using the remote debugging protocol with Runtime.evaluate.

# Global Object

The basic idea is to call a function at every key point in the JavaScript execution, such as when entering or exiting a function. The functions that get called to record that data all exist on a global object called window.theseus. Code like this gets injected at the top of every page:

<script src="/jquery.min.js?theseus=0"></script>1
<script>
if (!window.theseus) {
  window.theseus = (function ($) {
    var invocationsById = {};
    var topLevelInvocations = [];
    ...

    return {
      traceEnter: function (...) {
        ...
      },

      traceExit: function (...) {
        ...
      },

      ...
    };
  }($));2
}
jQuery.noConflict(true);
</script>

Instrumentation data is stored in the local variables (invocationsById, topLevelInvocations, etc).

# Instrumenting Function Entry and Exit

The first information I wanted to gather was the call graph: how many times every function was called, and who called it. I wanted to rewrite every function like this:

                                   function foo() {
function foo() {                     theseus.traceEnter("function-id", ...);
  ... original code ...     -->      ... original code ...
}                                    theseus.traceExit("function-id", ...);
                                   }

That would allow me to create the graph by pushing function identifiers3 onto a stack in traceEnter and popping them in traceExit. Just before pushing, the caller would be at the top of the stack.

Reality is a little more complicated: functions can exit from anywhere with a return statement or an exception. I use a finally block to ensure that traceExit is called even in those cases. So the instrumented code actually looks more like this:

                                   function foo() {
                                     debugger.traceEnter("function-id", ...);
                                     try {
function foo() {                       ... original code ...
  ... original code ...     -->      } catch (e) {
}                                      debugger.traceException("function-id", e);
                                     } finally {
                                       debugger.traceExit("function-id", ...);
                                     }
                                   }

Note, I don't need to pass function-id to traceException or traceExit, but doing so is useful to be able to perform sanity checks.

# Capturing Variable Values

A function's behavior is determined by the state that's available to it: arguments, accessible local variables, and global variables. There's not enough memory to snapshot all of the global state at every function call (have you ever looked at how much data there is in window?), but storing local state is feasible. arguments gets passed into traceEnter and traceExit to be saved.

# Instrumenting Function Call Sites

Using only the code from the previous section, you can't tell what line a function was called from. Consider code like this:

function foo() {
  bar();
  bar();
}

I wanted to be able to tag each invocation of bar with information about which of those places in the source code it was called from. To do this, I instrument every function call site as well.

My technique is to leave the call site information in Theseus's global state, to be picked up by the next call to traceEnter. The tricky part is doing so after the arguments are evaluated but before the function is invoked. I do that by replacing every function call with a call to an instrumentation function. That function gets the original arguments, and a pointer to the function that would have been called:

foo(a, b) -->
  theseus.traceFunCall("call-site-id", { func: foo }, [a, b])

xxx.foo(a, b) -->
  theseus.traceFunCall("call-site-id", { this: xxx, property: "foo" }, [a, b])

There are two forms because traceFunCall calls the original function with apply, which requires passing the value of this to use. That information is (unfortunately!) not embedded in the function pointer.

# Instrumenting Function Creation

Remember that I did all of this to create a debugging tool to help understand asynchronous code. A very important relationship I wanted to retain was that between a callback function and the place in the call graph it came from. Consider this code:

function download(url) {
  $.get(url, function(data) {
    $('.result').html(data);
    alert('Load was performed.');
  });
}

It would be useful to tie any invocations of the callback function to the invocation of download which spawned them. Cases like this, where the callback function is defined inline, are easy to instrument:

function () { ... } -->
  (debugger.traceFunCreate("function-id", function () { ... }))

traceFunCreate stores a pointer to the invocation at the top of the call stack (in the example above, that would be an invocation of download) and returns a wrapped version of the function that associates its invocations to the invocation of the creator.

Unfortunately, callback functions are often passed by name:

var callback = function () { ... }
$.get(url, callback);

Though it would be possible to detect when a function pointer is passed as an argument in a function call, it wouldn't be feasible to wrap it because function identity is important for things like unregistering event handlers:

$("button").on("click", clickHandler);
...
$("button").off("click", clickHandler);

So I'm punting on the case where functions are passed by name.

# Performance

I don't notice a difference in the performance of regular web sites, but I've also attempted to test the limits of my current implementation.

I instrumented the entire Brackets editor, which is about 80,000 lines of JavaScript. The launch went from a few hundred milliseconds to about 30 seconds, but I believe most of that time was spent parsing and rewriting JavaScript4, and the app was mostly usable after that. If I only instrumented the code outside the thirdparty directory (which contained about 50,000 of those lines5, including jQuery and CodeMirror), start-up time went back down to a second or two and run-time performance was great.

I instrumented Infinite Mario and it became jumpy, but playable, for about a minute. The game freezes hard after that minute, presumably from a garbage collection cycle that gets out of hand. Eventually Chrome gives up and bails without reporting an error.