On Stripping Styles For Security
By Adrian Sutton
A while back people discovered that many RSS readers, and all online RSS aggregators didn’t sandbox content from different sites and malicious HTML could cause cross site scripting (XSS) attacks and general nastiness. As a result most feed readers filter HTML through a seriously restrictive white list, including removing all CSS information. I’ve reached the point where I’ve simply had enough of this. CSS is a vital part of the internet and if feeds are going to be useful, we need them to work with CSS properly. So let’s take a look at what’s really going wrong:
Taking The Easy Option Instead Of The Right Option
Filtering down to the most basic HTML standards is easy, but it doesn’t deliver the experience that end users should be expecting. There’s no reason you can’t support the vast majority of the HTML standard and still protect your users from XSS attacks. You do have to invest a lot more time into fully parsing the HTML into a model that only allows valid options, dropping anything invalid, not recognized or unsafe and then serialize back out. That takes a lot more work than just applying string parsing to strip things out, but it is far less likely to result in data loss (formatting is data too).
Put Security In The Right Place
For people writing browser based readers or combining feeds into a browser sanitizing the HTML is really the only option they can take. All those rich client feed readers have a second, much better solution – isolate each entry. Let’s face it, every day we browse to a whole heap of websites that we have no knowledge about and those sites are allowed to run any JavaScript they like – yet that’s not a major issue because browsers sandbox content with a range of rules. Why is it that feed readers don’t do this?
For instance, NetNewsWire can allow JavaScript to execute in any feed entry without any security risk because it only shows one entry at a time1. The entry is sandboxed so viewing it in the feed reader is just like viewing it in your browser2. Even if you want to present a combined view, why aren’t these readers making the effort to sandbox each entry into the area of that view they are allowed to have effect on? Again, it’s more work than just creating a layout in HTML and throwing it into a browser view but it gives users full functionality without the security risks.
What About Browser Bugs?
If you separate out the content into separate browser views there’s still the possibility that a browser bug will allow security flaws. Of course, if that’s the case your feed reader is only a small part of the problem – any website you visit with your browser could also cause you issues. If your feed reader sanitizes HTML it would only take a little bit of social engineering to make you open the entry in your browser where the exploit would occur. The simplest way to achieve this is to design the entry content to make a lot more sense when viewed with CSS styles (or just add a note to the top to say that).
Why Do You Need Formatting Anyway
Formatting is data. In many cases the way content looks is just as important as what the content says. The specific example that’s driving me crazy at the moment is the recent changes feed from our internal wiki. In NetNewsWire it’s the most fantastic way to follow a wiki I’ve ever come across – each change turns up in the feed reader with the complete page content and the changes highlighted either green or red so you can quickly see what changed in context. The markup is actually nice and semantic – the changes are marked up using ins and del tags – but the default rendering of these tags is next to useless as it doesn’t stand out from the rest of the content enough. So for those poor people using FeedBurner (and any other reader that strips the background-color style on the ins and del tags) the feed is nearly useless because it’s too hard to see what changed. The formatting of the content is critical to being able to extract the information – falling back to just the semantic information renders the content almost useless.
The great irony in this situation is that font tags would have worked perfectly even though they don’t include the right semantic information and make the feed content far less accessible.
Conclusion
There has been a huge push in recent years to move away from the old habits of early HTML and to leverage CSS for presentation – the fact that it doesn’t work in feed readers is a major pain for people trying to do the right thing. It’s good that we identified a security threat and dealt with it quickly – but it’s not acceptable to stop there. We need to work to get the functionality that we used to have back without reintroducing the security risks. It’s not simple, but it is important.
Let’s stop neutering the web.
1 – at least with the way I configure it. I'm not sure if it has a combined view that throws everything into one browser instance↩
2 – very much like it given that NetNewsWire uses WebKit for it's rendering↩