September 10, 2011

SSL vs. "Referers": Friend or Foe?

In a recent posting in some other venues, I noted with pleasure that Google is now testing the use of "SSL by default" for Google Search.

In passing, I very briefly touched on the implications of SSL for "referer" data that is traditionally passed along to Web sites when a user clicks a link.

I received a surprisingly high level of diametrically opposed reactions. On one side, people were saying, "Good riddance! Referers are privacy invasive and never should have been implemented in the first place!"

On the other hand, I also got many messages with claims along the lines of, "This is just Google's attempt to ruin my analytics -- they don't really care about privacy."

The latter assertion is the easier to address. I've been talking to Google folks for years about SSL issues, and there has been a consistent desire to move their services toward this protection on a default basis (as they've already done with Gmail and Google+). The collateral impact on referers has been an issue of concern all along, and possible workarounds such as enhanced Webmaster Tools data and other techniques have always been part of the discussions.

But the still largely status quo of "postcard security" data on the Internet, where any entity -- commercial, government, or others -- who have access to a data stream can read most information in the clear, has become intolerable, and securing these paths to the extent practicable must be viewed as an important priority. For now, SSL is a practical means to that end.

The "Good Riddance" reaction probably needs a bit more exploration.

Let's remember what "referers" (typically misspelled in this manner due to an original misspelling in the HTTP specifications) really do.

When a user views info on a Web site, the associated site's logs will typically record a variety of data regarding the connection, including source IP address, various browser-related configuration information, and other information -- most notably for our discussion the referer.

The referer is the URL of the page that contained the link that the user clicked to reach the destination site -- the page that "referred" the user. In the case of a search results page, that referer will usually including the user's search query as embedded in the URL itself.

However, when a user click arrives via a site that was viewed through SSL, the information that would otherwise normally have been relayed (like the referer) will usually no longer appear. Note however that the IP address of the user will still be present.

The passing of referer information is a function not only of the sites involved but also of the user's browser. Various browser extensions and plugins have long existed that allow users to optionally block referers if they wish.

There are various reasons why referers were originally implemented. One important one was to aid in session sequencing, since knowing the full URL of the previous page -- that referring page -- could be useful to maintaining session transactional states, especially in the absence of more advanced methodologies that would further evolve later.

Some critics of referers make the claim that only "snooping businesses" are interested in such data, and so cutting it off would harm nobody of real merit.

But this really is not true. I believe if you took a poll, you'd find that the vast majority of Web site operators -- including nonprofits, individuals, and so on, not just commercial enterprises -- use referer data to better understand what people find to be of interest on their sites, and to have some sense of how their sites are being referenced by the broader world.

I know that I find this data to be of significant interest, and I don't run any ads or other monetizing elements on my blog. While there are other ways to discover relevant links over time, being able to see immediately when there's a "flood" of hits referring from a particular site (e.g., a Slashdot posting!) can be very important not just as a point of knowledge but from a site management standpoint as well. Visible search terms in referers tell me what issues from my postings are of particular worth to readers, and help me determine followups and future emphasis.

Could I continue posting new items if all log referers suddenly vanished? Sure. It would mean switching to more limited tools that were less real-time in nature, like retrospective searching and such, to try understand the dynamics of users viewing my site, but the fundamental ability to run my blog would of course not be significantly undermined.

But there would be a notable diminishing of the "value proposition" between readers and the site.

While you may never have thought of them in this way, referers can be viewed as something of an "equalizing" agent between large and small Web sites.

When you conduct a search on a search engine, that site obviously knows your query, so that they can provide you with a list of results. You then usually visit sites based on that list, and (hopefully) obtain the information of interest. This transaction -- that typically occurs without your being charged any fee by either party -- still has real value.

Questions: Is it unreasonable for the site that actually provides the information that answers your query, to see the same data (the search query itself) that the search engine itself had? The search engine must have the query to process your request, and can use this information to improve its search results over time. Is it reasonable to argue that the actual content site should have the same opportunity to improve its services through the use of this data?

These questions can certainly be argued either way. I personally come down on the side of best possible use of data in a responsible and egalitarian manner whenever possible.

In any case, the increasing routine and default use of SSL, with the many important benefits it brings, is likely moving the era of traditional referers toward a gradual diminution and ultimately an effective closure in many respects. Other analytical mechanisms (either existing or yet to be developed and deployed) will likely take up some of the slack, and in some cases provide even greater insights.

But perhaps of even greater importance in the long run, is the reality that questions surrounding the collection and use of transactional data, even related to relatively routine operations on the Internet, can be much more complex than they might appear at first glance, and that seemingly obvious "simple" solutions (such as blanket restrictions) may actually create or exacerbate far more problems than they might solve.

This is true regardless of who is referering to ... I mean referring to ... uh, talking about these issues!


Update (9:00 PM):

When I wrote the above posting text earlier today, my intention was to highlight the complexity of these issues from a "philosophical" standpoint, not to get at all into the technical details of SSL and browsers. But some queries I've received since I posted suggest that a few more words are in order.

I'm simplifying somewhat, but the decision to send (or not send) the current referer onward with a user click is made by the user's browser itself. That is why existing browser options and extensions to control referers can function. The SSL referer pass-along prohibition is based on the desire to avoid exposing a URL "resulting" via an SSL connection (e.g., SSL to a search engine), on a subsequent click (like from search results) to a site that is not using SSL, exposing the referer URL in unencrypted ("in the clear") form.

If a "clicked-to" site (e.g., clicked from search results generated via an SSL connection to a search engine) is also using SSL, the requirement for "end-to-end" encryption is met, and a browser may (subject to any other restrictive settings or options at the browser) pass along a referer as usual.

So we have yet another irony. As major sites convert to default SSL, especially search engines, there will be a dramatic drop-off in referers, all else being equal, since most sites don't use SSL, and appropriately deploying SSL on complex and busy sites can be a nontrivial task in various respects.

If we could flip a switch and make every site on the Internet SSL at once, the "SSL to non-SSL" ("no referer") issue essentially would not exist.

In reality though, at least for the foreseeable future, there will likely be a widening gap between major sites supporting default SSL and the vast numbers of "referred-to" smaller sites that don't. Combine this with the (in my opinion inappropriate) "demonization" of referers by various parties -- likely to affect browser defaults in this context -- and you can see why I suspect that traditional referers will be in a downward accessibility spiral, as I discussed in the main blog entry above.

I hope that this clarifies the issues at least a wee bit.


Posted by Lauren at September 10, 2011 01:30 PM | Permalink
Twitter: @laurenweinstein
Google+: Lauren Weinstein