Home

Handling Relative URLs for Redirects / Forwards

☕️ 7 min read

Introduction

The purpose of this post is to explore an approach to handling relative URLs safely for redirects and forwards. Many web security vulnerabilities that originate from unvalidated redirects and forwards are often remediated by restricting URLs. This restriction usually takes the form of an allow-list of known good absolute URLs in some capacity. See OWASP Validating URLs or Google Open Redirect for examples of this. Unfortunately, not all applications can adopt default allow-listing approach because the absolute URL may not be known ahead of time. This can cause friction as the one-size-fits all approach does not always work.

Notably, all of the code examples in this post can be found in the supporting url-parsing github repo.

Background

Objectively, URL parsing is difficult. There are many individual components that comprise a URL, and how each component interacts with one another can be confusing. For example, authority delegation in a URL goes way outside the scope of the average user. Orange Tsai presented A New Era of SSRF at Black Hat USA 2017 highlighting some of the problems that can arise, 10/10 research would recommend reading.

Basic URL Structure

The syntax and semantics of a URI are intentionally broad to create an extensible means for identifying resources. This introduces ambiguity as there are inconsistencies between URL parsers and the RFC2396 / RFC3986 specifications. WHATWG defined a contemporary implementation based on these specifications forming a standard. The following comporises URL Strings and URL Objects in JavaScript.

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                            href                                             │
├──────────┬──┬─────────────────────┬─────────────────────┬───────────────────────────┬───────┤
│ protocol │  │        auth         │        host         │           path            │ hash  │
│          │  │                     ├──────────────┬──────┼──────────┬────────────────┤       │
│          │  │                     │   hostname   │ port │ pathname │     search     │       │
│          │  │                     │              │      │          ├─┬──────────────┤       │
│          │  │                     │              │      │          │ │    query     │       │
"  https:   //    user   :   pass   @ sub.host.com : 8080   /p/a/t/h  ?  query=string   #hash "
│          │  │          │          │   hostname   │ port │          │                │       │
│          │  │          │          ├──────────────┴──────┤          │                │       │
│ protocol │  │ username │ password │        host         │          │                │       │
├──────────┴──┼──────────┴──────────┼─────────────────────┤          │                │       │
│   origin    │                     │       origin        │ pathname │     search     │ hash  │
├─────────────┴─────────────────────┴─────────────────────┴──────────┴────────────────┴───────┤
│                                            href                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

To demonstrate this at a code level, a URL can be parsed and accessed through a convinient object as seen below:

const { URL } = require('url');
var url = 'https://user:pass@sub.host.com:8080/p/a/t/h?query=string#has'
var newUrl = new URL(url);
console.log(newUrl)

Printing the URL object as was done above gives you an easy structure to access individual parsed URL components. Fortuantely this handles almost all of the heavy lifting for you.

URL {
  href: 'https://user:pass@sub.host.com:8080//p/a/t/h?query=string#has',
  origin: 'https://sub.host.com:8080',
  protocol: 'https:',
  username: 'user',
  password: 'pass',
  host: 'sub.host.com:8080',
  hostname: 'sub.host.com',
  port: '8080',
  pathname: '//p/a/t/h',
  search: '?query=string',
  searchParams: URLSearchParams { 'query' => 'string' },
  hash: '#has'
}

Relative URL Structure

Now that we have had a brief refresher on generic URL structure lets drill into the relative portion of the URL. The RFC3986 - URI Genric Syntax defines a Relative Reference as:

A relative reference takes advantage of the hierarchical syntax to 
express a URI reference relative to the name space of another 
hierarchical URI.

relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

relative-part = "//" authority path-abempty
                / path-absolute
                / path-noscheme
                / path-empty

The URI referred to by a relative reference, also known as the target
URI

A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used.  A
relative reference that begins with a single slash character is
termed an absolute-path reference.  A relative reference that does
not begin with a slash character is termed a relative-path reference.

For clarification, the “authority” part above refers to RFC3986 - Authority. As seen below, the authority is defined as:

   Many URI schemes include a hierarchical element for a naming
   authority so that governance of the name space defined by the
   remainder of the URI is delegated to that authority (which may, in
   turn, delegate it further).  The generic syntax provides a common
   means for distinguishing an authority based on a registered name or
   server address, along with optional port and user information.

   The authority component is preceded by a double slash ("//") and is
   terminated by the next slash ("/"), question mark ("?"), or number
   sign ("#") character, or by the end of the URI.

      authority   = [ userinfo "@" ] host [ ":" port ]

   URI producers and normalizers should omit the ":" delimiter that
   separates host from port if the port component is empty.  Some
   schemes do not allow the userinfo and/or port subcomponents.

   If a URI contains an authority component, then the path component
   must either be empty or begin with a slash ("/") character.  Non-
   validating parsers (those that merely separate a URI reference into
   its major components) will often ignore the subcomponent structure of
   authority, treating it as an opaque string from the double-slash to
   the first terminating delimiter, until such time as the URI is
   dereferenced.

To demonstrate the text above, the following code snippet from Mozilla demonstrates the concepts well.

Full URL

https://developer.mozilla.org/en-US/docs/Learn

Implicit protocol

//developer.mozilla.org/en-US/docs/Learn

In this case, the browser will call that URL with the same protocol as the 
one used to load the document hosting that URL.

Implicit domain name

/en-US/docs/Learn

This is the most common use case for an absolute URL within an HTML document. 
The browser will use the same protocol and the same domain name as the one 
used to load the document hosting that URL. Note: it isn't possible to omit 
the domain name without omitting the protocol as well.

Attacking Implementations

If this problem wasn’t complicated enough, browsers also add some additional complexity. Modern browsers automatically convert back slashes (\) into forward slashes (/) despite this being against RFC3986 - URI Genric Syntax. In addition, the @ character can be used to define a target host redirecting the victim to a new domain, this type of attack is defined as Semantic Attacks.

The dangerous characters and encoded versions can be seen below:

127.0.0.1:3000?nextUrl=//nikola.dev
127.0.0.1:3000?nextUrl=/%2Fnikola.dev
127.0.0.1:3000?nextUrl=%2F%2Fnikola.dev
127.0.0.1:3000?nextUrl=\\nikola.dev
127.0.0.1:3000?nextUrl=\%5Cnikola.dev
127.0.0.1:3000?nextUrl=%5C%5Cnikola.dev

Interestingly, the \ and / characters (and URL encoded equivalents) can repeat and are interchangable. The following is a valid payload:

http://127.0.0.1:3000/?nextUrl=/%5C/%5C/\%2F\/\%2F\/\%2F\/nikola.dev

Attackers can use this to bypass filters also depending on the underlying logic, for example if the nextUrl must have example.com this can be bypassed:

127.0.0.1:3000?nextUrl=//example.com%40nikola.dev
127.0.0.1:3000?nextUrl=//example.com@nikola.dev

Fortunately, as defined in the WHATWG Goals, if a url contains percent-encoded bytes it returns percent-decode. This means we do not need to worry about the percent encoded versions as they are canonicalised by the library on our behalf. An example of this can be seen below:

node app.js
Server running at http://127.0.0.1:3000/

URL Requested
Raw url: /?nextUrl=/nikola.dev
Parsed nextUrl parameter: /nikola.dev

URL Requested
Raw url: /?nextUrl=%2Fnikola.dev
Parsed nextUrl parameter: /nikola.dev

Recommended Approach

Much like any untrusted user input, relative URLs should be canonicalised, sanitised, and then validated - in that order. Canonicalisation and sanitation should be done through established URL parsing libraries such as URL Node package that follow the WHATWG standard. The output of these operations should then be validated using a strict pattern, only allowing required characters. Dangerous characters such as @, # and multiple / characters should not be on the allow list. An example of such validation can be seen below:

exports.isUrlValid = function (relUrl) {

    console.log("Handling: " + relUrl)

    var re = new RegExp("^(\/[a-zA-Z0-9]+){0,}$");

    if (re.test(relUrl)) {
      return true
    } else {
      return false
    }
  };

Where possible handle absolute URLs to avoid introducing unnecessary complexity, OWASP Validating URLs is a great resource on such solutions.

Conclusion

This post presented a novel approach to handling relative URLs for redirects and forwards. The examples presented are by no means comprehensive, this is just touching the surface of the problem. However, the examples demonstrated did give insight into the approach attackers take when examining forward / redirect logic. All of the code examples in this post can be found in the supporting url-parsing github repo.

References: