May 25, 2013

Should String Be An Abstract Class?

Why are HTTP headers handled as plain strings in programming?

Is there anything in software engineering that is just a string? If not, shouldn't String be an abstract class, forcing developers to subtype and at least name datatypes?

Domain-Driven Security

Former colleague Dan Bergh Johnsson, application security expert Erlend Oftedal, and I have been evangelizing the idea of Domain-Driven Security. We truly believe proper domain and data modeling will kill many of the standard security bugs such as SQL injection and cross-site scripting.

This blog post is a case for Domain-Driven Security and a case against strings.

The addHeader() Method in Java

Let's be concrete and dive directly into programming with HTTP headers.

In Java EE's interface HttpServletResponse we find the following method (ref):

void addHeader(java.lang.String name,
               java.lang.String value)

Not a heavily debated method as far as I know. On the contrary it looks like most such interfaces do. An implementation of the interface may look like this (ref):

public void addHeader(String name, String value) {
  if (isCommitted())
    return;

  if (included)
    return;     // Ignore any call from an included servlet

  synchronized (headers) {
    ArrayList values = (ArrayList) headers.get(name);
    if (values == null) {
      values = new ArrayList();
      headers.put(name, values);
    }
    values.add(value);
  }
}

It shows we can really set any string as an HTTP header. And that's convenient, right?

The Ubiquitous String

java.lang.String is the ubiquitous datatype that solves all our problems. It can contain anything and nothing and of course it has its sibling in any popular programming language out there. Let's have a look at what a string is.

Java uses Unicode strings in UTF-16 code units which handle over 100,000 characters. As far as I know C# and JavaScript does the same. The max size of strings is often limited by the max size of integers, typically 2^31 - 1 which is just over 2 billion.

So, a string …
  • is anything between 0 and 2 billion in length, 
  • can contain 100,000 different characters, and 
  • can be null.
Hardly a good spec for HTTP headers.

HTTP Headers By the Spec

RFC 2047 gives us the formal specification of how HTTP headers should look. An excerpt will suffice for our discussion.

message-header = field-name ":" [ field-value ]
       field-name     = token
       field-value    = *( field-content | LWS )
       field-content  = <the OCTETs making up the field-value
                        and consisting of either *TEXT or
                        combinations 
of token, separators, and
                        quoted-string>

token          = 1*<any CHAR except CTLs or separators>

CHAR           = <any US-ASCII character (octets 0 - 127)>

CTL            = <any US-ASCII control character
                        (octets 0 - 31) and DEL (127)>

separators     = "(" | ")" | "<" | ">" | "@" |
                 "," | ";" | ":" | "\" | <"> |
                 "/" | "[" | "]" | "?" | "=" |
                 "{" | "}" | SP  | HT

LWS            = [CRLF] 1*( SP | HT )

CRLF            = CR LF

OCTET          = <any 8-bit sequence of data>

TEXT           = <any OCTET except CTLs,
                        but including LWS>

Let's summarize.
  • HTTP header names can consist of ASCII chars 32-126 except 19 chars called separators.
  • Then there shall be a colon.
  • Finally the header value can consist of any ASCII chars 9, 32-126 except 19 chars called separators … or a mix of tokens, separators, and quoted strings.
  • On top of this web servers such as Apache impose length constraints on headers, somewhere around 10,000 chars.
There's clearly a huge difference between just a string and RFC 2047.

The Dangers of Unvalidated HTTP Headers

Can this go wrong? Is there any real danger in using plain strings for setting HTTP headers? Yes. Let's look at HTTP response splitting as an example.

We have built a site where an optional URL parameter tells the server which language to use.

www.example.com/?lang=Swedish

… redirects to …

www.example.com/

… with a custom header telling the web client to use Swedish. After all, we don't want that language parameter pestering our beautiful URL the rest of the session.

So in the redirect response we do the following:

response.addHeader("Custom-Language",
                   request.getParameter("lang"));

The result is an HTTP response like this:

HTTP/1.1 302 Moved Temporarily
Date: Wed, 24 Dec 2013 12:53:28 GMT
Location: http://www.example.com/
Set-Cookie: 
JSESSIONID=1pMRZOiOQzZiE6Y6iivsREg82pq9Bo1ape7h4YoHZ62RXj
ApqwBE!-1251019693; path=/
Custom-Language: Swedish
Connection: Close

But what if the request looks like this (%0d is carriage return, %0a is linefeed):

www.example.com/?lang=foobar%0d%0aContent-Length:%200%0d%0a%0d%0aHTTP/1.1%20200%20OK%0d%0aContent-Type:%20text/html%0d%0aContent-Length:%2019%0d%0a%0d%0a<html>Well, hello!</html>

That would generate the following HTTP response (linefeeds included):

HTTP/1.1 302 Moved Temporarily
Date: Wed, 24 Dec 2013 15:26:41 GMT
Location: http://www.example.com/
Set-Cookie: 
JSESSIONID=1pwxbgHwzeaIIFyaksxqsq92Z0VULcQUcAanfK7In7IyrCST
9UsS!-1251019693; path=/
Custom-Language: foobar
Content-Length: 0

HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 19 
<html>Well, hello!</html>
Content-Type: text/html

… which will be interpreted as two responses by the web browser. This is an example of the security attack called HTTP response splitting (link to WASC from where I've adapted my example). And that's just one of the dangers of letting users mess with headers. Setting or deleting cookies is another. In fact, the whole header section is in danger.

The HTTP splitting vulnerability has been fixed under the hood in at least Tomcat 6+, Glassfish 2.1.1+, Jetty 7+, JBoss 3.2.7+. (Thanks for that info, Jeff Williams.)

Should We Fix the addHeader() API?

Now we can ask ourselves two different things. The first is – should we fix the addHeader() and related APIs? Yes. They should look something like this:

void addHeader(javax.servlet.http.HttpHeaderName name,
               javax.servlet.http.HttpHeaderValue value)

… where the two domain classes HttpHeaderName and HttpHeaderValue accept strings to their constructors and validate that the strings adhere to the RFC 2047 specification. In one blow all Java developers are relieved of the burden to write that validation code themselves and relieved of always having to remember running it.

Should String Be An Abstract Class?

The larger question is about strings in general. Yes, they are super convenient. But we're fooling ourselves. We think the time we save by not modeling our domain, by not writing that validation code, by not narrowing down our APIs to do exactly what they're supposed to, we think that time is better spent on other activities. It's not.

I truly believe nothing is just a string. Nothing is any of 100,000 characters and anything between 0 and 2 billion in length.

Therefore String should be an abstract class, forcing us developers to subtype and think about what we're really handling.

Even better, why not have a way to declare that a class can only be used in object composition? That way programmers could choose if an "is-a" relation or a "has-a" relation is most suitable for narrowing down the String class.

May 13, 2013

Introduction to Software Security

April 22, 2013 I successfully defended my PhD in computer science, more specifically in the area of software security [fulltext pdf]. I thought I'd share some parts of the thesis in a more digestible format and allow myself to augment our results, comment, and have opinions, things you typically don't see in academic publications.

Let's start with my introductory chapter …


The cover.

``To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem. In this sense the electronic industry has not solved a single problem, it has only created them, it has created the problem of using its products.''
–Edsger W.Dijkstra, The Humble Programmer, 1972

Computer software products are among the most complex artifacts, if not the most complex artifacts mankind has created (see Dijkstra's quote above). Securing those artifacts against intelligent attackers who try to exploit flaws in software design and construct is a great challenge too.

Our research contributes to the field of software security. Software as an artifact meant to interact with its environment including humans. Security in the sense of withstanding active intrusion attempts against benign software.

Software Vulnerabilities

Software can be intentionally malicious such as viruses (programs that replicate and spread from one computer to another and cause harm to infected ones), trojans (malicious programs that masquerade as benign) and software containing logic bombs (malicious functions set off when specified conditions are met).

However, attacks against computer systems are not limited to intentionally malicious software. Benign software can contain vulnerabilities and such vulnerabilities can be exploited to make the benign software do malicious things. A successful exploit has traditionally been the same as an intrusion. But in the era of web application vulnerabilities that term is not used as often. Nevertheless, a successful cross-site scripting attack (XSS) can be seen as executing arbitrary code inside the web application. And arbitrary code execution in a web application may very well be of high impact if the application handles sensitive information (password fields, credit card numbers etc) or is authorized to do sensitive state changes on the server (money transfers, profile updates, message posting etc). I would therefore argue that XSS is an intrusion attack.

Vulnerabilities can be responsibly reported to the public by creating a so called CVE Identifier – a unique, common identifier for a publicly known information security vulnerability. Identifiers are created by CVE Numbering Authorities for acknowledged vulnerabilities. Larger software vendors typically handle identifiers for their own products. Some of these participating vendors are Apple, Oracle, Ubuntu Linux, Microsoft, Google, and IBM.

The National Institute of Standards and Technology (NIST) has a statistical database over reported software vulnerabilities with a publicly accessible search interface. Two specific types of vulnerabilities are of specific interest in the context of our research, namely buffer overflows and format string vulnerabilities in software written in the programming language C. The statistics for Buffer Errors and Format String Vulnerabilities are shown below.



Reported software vulnerabilities due to buffer errors have increased significantly since 2002. Their percentage of the total number of reported vulnerabilities has also increased from 1-4 % between 2002 and 2006 to 10-16 % between 2008 and 2012. These statistics are in stark contrast to the statistics from CERT that Wagner et al used to show that buffer overflows represented 50 % of all reported vulnerabilities in 1999 [pdf]. We have not investigated if there are significant differences in how the two statistics were produced. Still, up to 16 % of all reported vulnerabilities is a significant number.

The reported format string vulnerabilities peaked between 2007 and 2009 but have never reached 0.5 % of the total. Our experience is that format string vulnerabilities are less prevalent, easier to fix, and harder to exploit than buffer overflow vulnerabilities. Nevertheless format string vulnerabilities are still being used for exploitation such as the Corona iOS Jailbreak Tool.

Avoiding Software Intrusions

Intrusion attempts or attacks are made by malicious users or attackers against victims. A victim can be either a machine holding valuable assets or another human computer user. Securing software against intrusions calls for anti-intrusion techniques as defined by Halme and Bauer. We have taken the liberty of adapting and reproducing Halme and Bauer's figure showing anti-intrusion approaches, see below.


  1. Preempt – strike offensively against likely threat agents prior to an intrusion attempt. May affect innocents.
  2. Prevent – severely handicap the likelihood of a particular intrusion’s success.
  3. Deter – increase the necessary effort for an intrusion to succeed, increase the risk associated with an attempt, and/or devalue the perceived gain that would come with success.
  4. Deflect – leads an intruder to believe that he or she has succeeded in an intrusion attempt, whereas in fact the intrusion was redirected to where harm is minimized.
  5. Detect – discriminate intrusion attempts and intrusion preparation from normal activity and alert the operations. Detection can also be done in a post mortem analysis.
  6. Actively countermeasure – counter an intrusion as it is being attempted.

Avoiding the Vulnerabilities

There are many ways to achieve more secure software, i.e. avoiding to have vulnerabilities. Microsoft's Security Development Lifecycle (SDL) defines seven phases where security enhancing activities and technologies apply:

  1. Training
  2. Requirements
  3. Design
  4. Implementation
  5. Verification
  6. Release
  7. Response

Further things can be done in an even wider scope. Programming languages can be constructed with security primitives which allow programmers to express security properties of the system they are writing – so called security-typed languages, a part of language-based security [pdf]. Operating systems and deployment platforms can be hardened and secured both in construction and configuration.

Our research objectives have been on the Requirements and Implementation phases of Microsoft's SDL and on hardening of the runtime environment for software applications. Want to know what we found out? Stay tuned for upcoming posts where we dive into the details of our studies.

Jan 6, 2013

Review of The Tangled Web by Michal Zalewski


My family and I spent Christmas and New Year abroad so I got the chance to do some reading. Sitting in my bookshelf for far too long it was time to take on The Tangled Web by Michal Zalewski.


The Best Book on Web Application Security

Let me start by saying The Tangled Web is the best text I've read on web application security. Yes, that includes blog posts, articles, and papers. This is a must-read for any technical OWASPer or the like.

The book covers a lot of ground, serves nice examples, and shows why Michal is one of the most highly regarded experts in the app sec field. Additionally Chris Evans served as technical reviewer and people like Adam Barth and Tavis Ormandy are frequently referenced in the text which means much of Google's best security people made this book what it is.

Detailed But Soon Dated

Most of The Tangled Web is spent on browser security, meaning how browsers implement web standards and de facto standards. Michal has an in-depth knowledge in both the history and the current state of how browsers work and why. This means you'll get plenty of browser quirks and curious differences between browsers and browser versions as well as up and coming features.

The downside of this is that the book quickly gets dated. Several of the features presented as forthcoming or WebKit-only are in fact available and in use today, a year after the book was published. As a reader you're either aware of this or will have to check what has changed.

For this book to be the reference I'd love to repeat, Michal will have to update it. And if he didn't plan for that ahead I'm guessing a second edition will be a major undertaking and not as rewarding for him as the first edition.

So my advice is to read it now while it's still highly relevant. Then hope for new editions with release notes on what has changed. Otherwise The Tangled Web will become a historic exposé of web security.

More For Security Pros Than For Developers

While the subtitle of the book says "A Guide To Securing Modern Web Applications" it's more geared toward security professionals and pentesters than toward developers. And in my experience the developers are the ones who need to secure the applications where security professionals either review proposals or prove that the system is not secure enough.

I'm not saying The Tangled Web is a guide on how to break web apps. On the contrary every chapter has a "Security Engineering Cheat Sheet" and the text is mostly about how to avoid pitfalls. But developers (such as myself) always look for concrete ways to test our code and Michal does not provide guidance on how to do that nor suggests data sets for security testing. If you read between the lines you start getting ideas of good test data yourself so the information is probably in Michal's head. I'm not looking for pen test guidance but for security unit and integration testing guidance.

To be concrete – I would have liked a description and an online reference on a data set for unit testing my URL parser(s). Michal's coverage of problems in parsing URLs is great so you are totally ready to put your code to the test after reading it.

From a developer's perspective The Tangled Web is more of an exposé of web security problems than a guide on how to secure your apps.

My Technical Takeaways

Here's a glimpse of what I underlined and took notes of in the book.

  • Newline handling quirks. Michal's coverage gives new life to header injection or response splitting, a subject not discussed too often these days. Especially worrying is the discrepancy between how Apache and IIS handle a lone CR and how browsers handle it.
  • Attacker-controlled 401 Unauthorized responses. Any resource hosted by the attacker (images etc) can in an instance be configured to respond with 401 and provoke a Basic Authentication dialog to appear in the browser. The victim(s) have no chance of seeing which origin is asking for credentials and thus will believe it's the main site.
  • HTML parsing behavior. Although people like Mario Heiderich and Gareth Heyes regularly treat me to new parsing flaws in browsers I really liked Michal's coverage of the subject.
  • Multiple parsing steps of inline JavaScript. User input in nested JavaScript such as event handlers or setTimeouts is almost impossible to get right in a complex application. Michal shows why with an example where the user input has to first be double encoded using JavaScript backslash sequences and then encoded again using HTML entities. In that exact order.
  • Cookie problems with SOP and country-level TLDs. Some countries require example.co.[country code] domains for businesses. Others allow example.[country code]. Japans allows both. This messes up the restrictions for how host can be set for cookies. They aren't allowed to be set for *.com but what about *.com.pl? With the idea of arbitrary generic TLDs we will head back into the dark ages with cookie leakage and overwriting.
  • The X-Content-Type-Options: nosniff response header. Should be default on all your content HTTP responses unless you actually want browsers to try to sniff content type and do all kinds of dangerous interpretation of untyped responses. As of 2011 only 0.6% of top 10,000 sites use this header.

Things I Miss in the Book

While weighing in on 268 pages there are some things I expected to see in there but didn't.

Parameter pollution

It is mentioned but no proper coverage is provided around parameter pollution. Both HTTP and JSON are susceptible to multiple instances of parameters. HTTP parameters, HTTP cookies, and JSON padding (not JSONP) are examples. I would like a general discussion on this topic and its implications for the server-side.

DOM clobbering

Again some parts of this topic are mentioned but there is no real coverage. I would like to read Michal's take on global DOM ids and CSS classes being generated, injected, or mistakingly duplicated in mashups. He does reference Heiderich et al's "Web Application Obfuscation" though.

CSRF in-depth

Cross-site request forgeries are mentioned very briefly. I believe there's room for a few pages on the many nuances such as GET vs POST, Referer header reliance, blindness of the attack, multi-step CSRF, and how CORS eases the attack scenarios.

Script inclusion from outside the app

Working with CSP you quickly see that JavaScript gets included from several places outside the application. ISPs, hotels, and venues add their stuff via proxies. And browser plugins add JavaScript in the browser which means not even HTTPS will help you avoid it. The dangers and diversity of this situation is not covered in the book.

Handling of legacy web

Many organizations are in a legacy state with ten year old asp, jsp, or php sites. There are techniques for moving such applications forward but the book doesn't cover them at all. On the contrary Michal specifically advices against a technique that works wonders in my experience – sandboxing same-origin legacy content in iframes to allow for CSP and clean global footprint in new code.

Security of infrastructure and frameworks

The frontend technology stack is overflowing of frameworks and micro frameworks such as jQuery, Bootstrap, Backbone, Ext JS, and Dojo. Many of them offer flawed security controls. Even more are plagued with insecure defaults. Some are making steady progress in security whereas others bluntly ignore even reported flaws. This whole situation is very tangible on the web today but not covered in the book. A detailed guide on top frameworks is probably out of scope but given the prevalent use of frameworks something should be written on the topic.

Architectural considerations

The web has interesting aspects such as the typical mix of programming languages within the system (fronted and backend), the loose and untyped coupling between server and client, and problems of mixing code, content, and style. This has bearing on security. System-wide static analysis is unheard of for one. Versioning and validity checksums are super hard to get right. I would love to read Michal's thoughts on where we are and where we should be headed in terms of secure architecture on the web.

The Best Part – The Epilogue

While the technical parts of this book (95 % that is) are really great I cannot help but think Michal's epilogue was the best part. It's short and not in the least the kind of self-indulging stuff you typically come across. Instead Michal challenges the whole security industry and academia. Are we really helping society with our paranoia and foil hats? Or are we a breed of IT pros about to be extinct? After all, no other part of mankind or society is "secure". It's all about trust and the tradeoff between development and risk.

I think Michal is on to something important and it makes me happy I decided long ago to go 70 % development and 30 % security. That's the app sec productivity sweet spot in my opinion.


Disclaimer: I got a free copy of the book from the publisher.