Summary of our CCS paper on DOM-based XSS

Since the traffic on my server has gone up due to the fact that Sebastian linked my paper on twitter, I thought about writing a short summary of the paper as such.

So, what is DOM-based XSS?

In contrast to the quite well-known types of Cross-Site Scripting, namely persistent and reflected Cross-Site Scripting, DOM-based XSS is not caused by insecure server-side code but rather vulnerable client-side code. The basic problem of XSS is that content provided by the attacker is not properly filtered or encoded and thus ends up in the rendering engine in its malicious form. In terms of DOM-based XSS the sources for these kinds of attacks stem from the Document Object Model (DOM). The DOM provides access to HTML document as well as additional information like cookies, the URL or WebStorages.
I guess the most common thing we saw in our study was code similar to the following:

<script>
document.write("<iframe src='http://adserver/ad.html?
referer="+document.location+"'></iframe>")
</script>

The page tries to build an iframe with advertisement inside. The ad provider wants to be aware of the URL that included it and thus writes the complete location into the src attribute. Nevertheless, this is a vulnerability since the attacker might be able to control the URL completely. At this point, it is necessary to explain how browsers encode data coming from the DOM. While Firefox encodes everything coming from the document.location object (such as the complete location, the query parameters and the URL fragment (denoted by the # mark)), Internet Explorer does not even encode the complete URL. Chrome is somewhere in the middle, encoding the query parameters but not the fragment. Thus, an attacker could easily lure his victim (using IE or Chrome) to a URL such as http://vulnerab.le/#’/><script >alert(1)</script>. Looking at the above example, this would lead to the following being written to the document:

<iframe src=’http://adserver/ad.html?referer=’http://vulnerab.le/#’><script>alert(1)</script>‘></iframe>

Obviously, this will work in IE and Chrome, whereas it will not work in Firefox, since the ‘ is encoded as %27.

How to detect DOM-based XSS?

One option to detect any kind of vulnerability is static code analysis. In server-side programming languages like PHP, static code audit is often used to detect vulnerabilities. However, JavaScript is tricky to statically analyze due to the constant use of eval() and the fact that it allows prototype chaining. Also, an object can have functions added to it during runtime. These things make it harder to statically analyze JavaScript. Therefore, we decided to do a dynamic analysis. In terms of XSS, we can abstract vulnerabilities to being data flows from attacker-controllable sources to security-critical sinks. This already suggests our approach – dynamic data flow tracking or taint tracking. Taint tracking is a concept where we mark pieces of data that might come from an attacker as “dirty” or tainted. If this piece of tainted data ends up in a sink, this flow might be vulnerable. However, flows like the one depicted above are not the ones that happen all the time. Usually, some part of the URL is extracted or tainted data is concatenated  with untainted (benign) data. Thus, not only tainting a string is necessary, but rather also passing on the taint in all string operations. Our first thought was to implement this into a headless browsing engine like HTMLUnit. However, these engines usually are not as “up to date” as real browsers and also don’t implement certain edge cases. Thus, we opted to implement our approach within Chromium, the open-source counterpart of Google Chrome.

To be able to automatically verify that a given flow is vulnerable, we decided to implement our taint-tracking on per-char level. Thus, we need to store the source of each and every character in strings that might contain any piece of tainted data. To allow for this to work, we opted to store the taint information for one character in an additional byte. In this byte, we store the numerical identifier for a source. Since we only have 14 sources relevant to our approach, we can store this information in the lower four bits of the allocated byte. If a character is not stemming from a DOMXSS source, we set the source identifier to 0.

JavaScript implements three built-in encoding functionalities, namely escape, encodeURI and encodeURIComponent. These, if used properly, can stop DOMXSS attacks. Therefore, we want to ensure that if a string is passed through one of these functions, the taint information reflects this fact. Since we still had four bits left in our taint byte, we used bits 5 to 7 encode whether any of these functions were passed (as a bitmask).

Our tainting needed to work both inside the V8 JavaScript engine and the WebKit (now Blink) rendering engine. As V8 is highly optimized and objects in there do not have actual member variables but rather consist of a header and (for strings) and allocated piece of memory to store the characters. Inside the header, different “variables” are identified by their offset. Thus, just adding a new piece of information is quite hard. However, inside V8, objects of the same built-in types (such as strings, integers or different types of numbers) share a so-called “map”. This map identifies the type of object. Inside this map class (C++ class), we found a bitmap storing information on the type of object. This bitmap still had 3 bits unused. Thus, we chose to add new “type” to V8 (namely tainted string, actually one kind of tainted string for each of the types already available in V8). Whenever pieces of information come from DOM sources, we change the type of the resulting string to the tainted one. In V8, we now allocate additional length bytes right behind the actual character data and fill it with the taint information. For WebKit, the task is easier, since it employs member variables. Thus, in there, we added a new vector to the string class and store the taint information in there.

On access to either sinks inside the DOM (e.g. document.write) or JavaScript (e.g. eval), we want to somehow report this flow. To keep the changes (already ~4k LoC) to the browser as small as possible, we opted to implement a Chrome extension to handle analysis of the taint information. Thus, whenever we see access to a sink, we try to call a function (reportFlow) that is injected into every page we load by our extension. This information is passed on to the extensions backend and further parsed. The following figure shows the complete system.

analysis

Large-scale study

After we implemented the tainting and Chrome extension, we started a shallow crawl of the Alexa Top 5000. The extension allowed us to control multiple computers running multiple tabs to visit a total of 504,275 URLs. In total, since a lot of pages actually include other frames (mostly advertising..), we gathered data for 4,358,031 documents. In these, we discovered a grand total of 24,474,306 data flows. This process took us roughly 5 days on 6 really old machines. We have now improved the performance of the extension and got new hardware which now allows us to crawl roughly 60,000 URLs per day per machine. The following table shows the amount of data we gathered – the rows depict the sinks, the columns the sources.

table

Validating vulnerabilities: exploit generation

At this point, it is important to understand that a flow as such does not necessarily lead to a vulnerability. As mentioned before, encoding the data might (if the right function is used for the right context) help. Also, if e.g. the URL is changed, the server might not provide the same, vulnerable code. A third option is that common filters are used that only allow certain types of data (e.g. only integers). To actually validate that a flow is vulnerable, we decided to implement an automated exploit generator. Since we have detailed information on the sink context (HTML or JavaScript) as well as exact location of the tainted data, we were able to built an exploit generator that is capable of building precise payloads to trigger the vulnerabilities. Since this part was mainly done by Sebastian, I refer you to the the paper or our talk at CCS in November for details on the implementation.

As a first approach, we only selected to generate exploits for sinks that allow direct code execution (HTML and JavaScript contexts) and not second-order vulnerabilities like cookies or WebStorage (as depicted in yellow in the table). Also, we decided to only use sources that are easily controllable – mainly the URL and the referrer (depicted in green). In total, we looked at 313,794 possible security-critical flows. Using the exploit generator, we built a total of 181,238 unique exploit test cases. This stems from the fact that if e.g. the same iframe is written to the document in the same manner twice, the payload breaking out of the context and triggering our verification function is the same. Out of these, we were able to trigger our reporting function 69,987 times. However, obviously this does not mean that we actually have 70k vulnerabilities in the Alexa Top 5000. In order to zero in on the actual amount of vulnerable pieces of code, we designed a uniqueness criterion. From our taint analysis, we are also able to determine the exact position of the sink access in the code. Here, we need to distinguish between three types of locations: an inline-script inside the page, an external javascript file or inside eval(). For the first, we get the URL of file (obviously) as well as the the line and the offset in the line. The same holds true for external JS files, whereas when inside an eval, we cannot determine the file name but only line and line offset. Thus, as the criterion for external scripts, we use the complete URL including line and line offset. For inline scripts, since we don’t know if the script in e.g. a CMS will always be placed in the same line, we only use the line offset. For eval, we use line and line offset. All these are combined on a per-domain basis, whereas domain is the normalized second-level domain (thus www.foo.bar and test.foo.bar share the same “domain”).

Results

In this, we found that the exploits triggered 8,163 unique vulnerabilities. However, as already discussed, pages often include third-party content – in this case also coming from outside Top 5000. Filtering out all those domains not inside the Top 5000, we found a total number 6,167 unique vulnerabilities on 480 different domains. It is notable that due to our shallow crawl, this number is surely a lower bound.

Concluding, if you want to have a more detailed presentation of the paper, please see our talk at CCS.

Posted in WebSec
5 comments on “Summary of our CCS paper on DOM-based XSS
  1. Travis says:

    Are you planning on making your chromium code available? Automated detection for dom-based XSS is tricky, nice work on the research!

    • ben says:

      Hi Travis,
      at this point we will not make our work public (yet). We have some more to be done on the whole thing before we can release it. I’ll surely post it here if we publish it :-)
      Cheers
      Ben

  2. Inian says:

    Great article :)

    I have a couple of questions.
    Are the exploits you generated exploitable by a remote attacker? Or does the attacker need to be on the same network as the victim?
    With respect to the URL as a source, the attacker can just control the attributes like location.hash, location.search, etc. right? Since modification of the other attributes (like location.href) would redirect the page to a different location.

    • ben says:

      Hi Inian,

      first and foremost – hash and search are parts of href :-)
      All the vulnerabilities we found can be abused if the attacker gets a victim to visit the page with his payload. Although this might sound difficult at first, it’s easy to send a link to a page containing cute kitten pics to the victim and embed an iframe with the vulnerable page in there.

      For the CCS paper, we mainly utilized the hash – however, for current work we also looked at e.g. search params. However, most often pages use something like the following.

      // Get query string following the question mark
      var qs = location.href.substring(location.href.indexOf(“?”)+1);

      This tries to get the query (which could be done using location.search), hence the attacker may use something like
      http://example.org#?my_unencoded_payload_here

      If you any more question, feel free to drop me an email – I had to deactivate the mail notification for comments due to approx. 100 spambot comments per day…

      Cheers,
      Ben

      • inian says:

        Hi Ben,

        Thanks for your reply :)
        It is quite amazing how so many websites are vulnerable just by considering the location hash as the source!
        I will email you if I have any further queries.

        Cheers,
        Inian

1 Pings/Trackbacks for "Summary of our CCS paper on DOM-based XSS"
  1. […] Since the traffic on my server has gone up due to the fact that Sebastian linked my paper on twitter, I thought about writing a short summary of the paper as such.  […]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>