Since the traffic on my server has gone up due to the fact that Sebastian linked my paper on twitter, I thought about writing a short summary of the paper as such.
So, what is DOM-based XSS?
In contrast to the quite well-known types of Cross-Site Scripting, namely persistent and reflected Cross-Site Scripting, DOM-based XSS is not caused by insecure server-side code but rather vulnerable client-side code. The basic problem of XSS is that content provided by the attacker is not properly filtered or encoded and thus ends up in the rendering engine in its malicious form. In terms of DOM-based XSS the sources for these kinds of attacks stem from the Document Object Model (DOM). The DOM provides access to HTML document as well as additional information like cookies, the URL or WebStorages.
I guess the most common thing we saw in our study was code similar to the following:
<script> document.write("<iframe src='http://adserver/ad.html? referer="+document.location+"'></iframe>") </script>
The page tries to build an iframe with advertisement inside. The ad provider wants to be aware of the URL that included it and thus writes the complete location into the src attribute. Nevertheless, this is a vulnerability since the attacker might be able to control the URL completely. At this point, it is necessary to explain how browsers encode data coming from the DOM. While Firefox encodes everything coming from the document.location object (such as the complete location, the query parameters and the URL fragment (denoted by the # mark)), Internet Explorer does not even encode the complete URL. Chrome is somewhere in the middle, encoding the query parameters but not the fragment. Thus, an attacker could easily lure his victim (using IE or Chrome) to a URL such as http://vulnerab.le/#’/><script >alert(1)</script>. Looking at the above example, this would lead to the following being written to the document:
Obviously, this will work in IE and Chrome, whereas it will not work in Firefox, since the ‘ is encoded as %27.
How to detect DOM-based XSS?
To be able to automatically verify that a given flow is vulnerable, we decided to implement our taint-tracking on per-char level. Thus, we need to store the source of each and every character in strings that might contain any piece of tainted data. To allow for this to work, we opted to store the taint information for one character in an additional byte. In this byte, we store the numerical identifier for a source. Since we only have 14 sources relevant to our approach, we can store this information in the lower four bits of the allocated byte. If a character is not stemming from a DOMXSS source, we set the source identifier to 0.
After we implemented the tainting and Chrome extension, we started a shallow crawl of the Alexa Top 5000. The extension allowed us to control multiple computers running multiple tabs to visit a total of 504,275 URLs. In total, since a lot of pages actually include other frames (mostly advertising..), we gathered data for 4,358,031 documents. In these, we discovered a grand total of 24,474,306 data flows. This process took us roughly 5 days on 6 really old machines. We have now improved the performance of the extension and got new hardware which now allows us to crawl roughly 60,000 URLs per day per machine. The following table shows the amount of data we gathered – the rows depict the sinks, the columns the sources.
Validating vulnerabilities: exploit generation
In this, we found that the exploits triggered 8,163 unique vulnerabilities. However, as already discussed, pages often include third-party content – in this case also coming from outside Top 5000. Filtering out all those domains not inside the Top 5000, we found a total number 6,167 unique vulnerabilities on 480 different domains. It is notable that due to our shallow crawl, this number is surely a lower bound.
Concluding, if you want to have a more detailed presentation of the paper, please see our talk at CCS.