The Well-Formed Web

Should you use Content Negotiation in your Web Services?

2003-09-06T21:54:43-05:00

Should you use Content Negotiation when building your web service? The short answer is no. There are definite problems with conneg and I can give some examples of problems I have run into and also point to problems other have run into.

First let's back up and explain Content Negotiation. Your browser is a generic display program and can take in various kinds of media, such as HTML, JPEGs, CSS, Flash, etc. and display it for you. The first thing to note is that each of those kinds of media have different mime types. Each format has it's own registered mime type and when a client does a GET on a URL it gets back not only the content but the response also includes a Content-Type: header which lists the mime-type of what is in the body.

One of the interesting things about HTTP is that it allows the same URI to have multiple representations. For example I could have a URL that had both plain/text and text/html representations. Now that leads to two obvious questions.

How does the server know which represenation to serve?
How can the browser influence the servers choice to get something it can handle?

Let's start by answering question two first. The browser uses the Accept: header to list out the mime-types that it is willing to accept. There is also a weighting scheme that allows the client to specify a preference for one media type over another. For example, here is the capture of some of the headers, including the Accept: header, sent by Mozilla when it does a GET on a URI:

Accept: text/xml,application/xml,application/xhtml+xml,\
    text/html;q=0.9,text/plain;q=0.8,video/x-mng,\
    image/png,image/jpeg,image/gif;q=0.2,*/*;q=0.1
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate,compress;q=0.9
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

The Accept: header list the mime-types that the browser can handle along with weights of the form q= where the argument is a floating point number between 0 and 1. The weights indicate a preference for that media type, with a higher number inidicating a higher preference. Note that there are several bits of complexity I am going to ignore for now. The first is the last type the Mozilla browser says in can accept, */*;q=0.1. This is a wild card match, which will match any mime-type that the server could want to serve up. The second is that there are multiple Accept headers, one for language, one for encoding, another for charset. How these over-lap and influence the response sent won't be covered here.

Now to answer the first question. The server looks at the available representations is has and servers up the one with the highest preference to the client. Based on the Accept: header it sends an appropriate representation back and indicates the type it chose using the Content-Type: header.

This seems like a really cool and vastly under utilized feature of HTTP. It also seems particularly intriguing for web services. You could return JPEGs from that mapping service for the older client platforms, but also serve up SVG for the newer clients so they can scale and rotate their maps. What could possibly go wrong?

The first thing that could go wrong is a bug or mis-configuration on the client or the server. This has happened to me in the past. The W3C does conneg on some of their recommendations, returning either HTML or plain text based on the clients capabilities. This is fine, but one day their server was either confused or mis-configured because it would only serve the recommendation in plain/text. I really needed the HTML form, but after trying multiple browsers from multipe locations I could only retrieve the text format. I ended up pulling the HTML version out of the Google cache.

The second problem that I ran across highlights the real core problem with conneg. I was trying to use the W3C XSLT service to do some transformations on my web pages. Now the server side software I use to run Well-Formed Web does conneg and can return either HTML or an RSS item fragment for each URI. At the time I was serving up XHTML 1.0, which is valid XML and thus good input into an XSLT service. So the way the XSLT service works is that you enter two URIs, one for the source content and the other for the XSLT sheet to apply to the source content. My transformation kept failing and it was because of the Accept headers that the XSLT service sent when it went to retrieve the source content. My server kept returning the RSS item fragment and not the XHTML. Now this would have been fine if I wanted to apply an XSLT sheet to my RSS item fragment, but in this case I wanted it to apply to the XHTML. Note that the problem could have been completely reversed, I could have been trying to apply the XSLT to the RSS item and not to the XHTML and my server could have returned the XHTML all the time. The crux of the problem is that when I gave the URI to the XSLT transformation service I have no way of specifying what mime-type to request. I get no chance to tweak the services Accept: header.

Let's cover that again to clarify. If I hand you a URI only, and that URI supports conneg, then I get no control over which representation you retrieve. In the cases where you are passing a URI into a service that is later going to retrieve a represenation from that URI, you really have no idea which representation it's going to get. That could mean that you end up passing your RSS feed to the W3C HTML validator, or you end up passing XHTML instead of RSS into an XSLT translator service, or you end up passing a 12MB PNG to a handheld instead of that 20KB SVG file. You end up with a problem that is hard to debug and one that wouldn't exist if each URI had only one mime-type.

Google2Atom

2003-11-22T01:18:42-05:00

Welcome to the Google2Atom web service. Just enter your search and your Google key below. Once you press "Search" you will get an Atom feed of the search results.

Note: The Google Key is no longer mandatory, if it's not supplied it will use my own key. In light of that please feel free to use my key for experimentation, but if you start making heavy use of this service please get your own Google API Key to avoid limiting others use of this service.

This is a REST based reformulation of the Google API. As such it uses query parameters in a GET based HTTP request to do the search. That is, it works just like the regular google web page, but this form returns a well-formed XML document instead of a web page. Why is this better?

Simplicity: It works just like the google web page, so it is conceptually easier to understand.
Composability: Since the request is just a simple GET the results of a query can be composed with other web services. For example, the results could be transformed using XSLT or fed into a validator.

Bonus Features

One feature found in this interface that is not found in the original Google API is the well-formedness of the results content. PyTidy is used to transform the HTML snippets from the Google API into well-formed XML and place those into 'content' elements with type='text/html' and mode='xml'.

Colophon

Google2Atom is written in Python and uses both the pyTidy and pyGoogle libraries.

wfw namespace elements

2003-10-10T13:11:46-05:00

The wfw namespace, http://wellformedweb.org/CommentAPI/ contains multiple elements. As more are added in various places I will endeavor to keep the list here updated.

wfw:comment: The first element to appear in this namespace is comment. This element appears in RSS feeds and contains the URI that comment entries are to be POSTed to. The details of this are outlined in the CommentAPI Specification.
wfw:commentRss: The second element to appear in the wfw namespace is commentRss. This element also appears in RSS feeds and contains the URI of the RSS feed for comments on that Item. This is documented in Chris Sells' Specification. Note that for quite a while this page has had a typo and erroneously referred to this element as 'commentRSS' as opposed to the correct 'commentRss'. Feed consumers should be aware that they may run into both spellings in the wild. Please see this page for more information.

The HTTP verb PUT under Apache: Safe or Dangerous?

2003-08-23T00:45:25-05:00

"Is the HTTP verb PUT under Apache safe or dangerous?" This is a question I come across often, and have now run into it twice in the work on Atom. So is it safe? The answer is maybe.

Here are two such examples:

Using DELETE and PUT may be the "right thing to do" in an ideal world, but the fact of the matter is that a lot -- if not the vast majority -- of webservers do not allow these operations.

If anyone knows of a newer article describing HTTP PUT with apache, I would be very interested in seeing it. Because, due to my experience with PUT, you have to define a single PUTScript in httpd.conf, and if you PUT something to an apache server at the URI www.example.com/blog/entries/1 or something similar, apache passes all of the information to the PUTScript, not to anything else.

Both of the above quotes are from the Atom Wiki discussion of the use of PUT. A little digging reveals that the ApacheWeek article Publishing Pages with PUT is referenced most often when the danger of PUT is raised.

That ApacheWeek article does talk about the dangers of PUT and the cautions you need to follow when writing a script that does content publishing via PUT. That key part of that phrase is content publishing. That means that PUT is being used to upload arbitrary content to the server and the client is determining via the URI where the content should be stored. Now you can imagine how this might be dangerous, for example not correctly checking URI paths that include ../.. could let a malicious agent re-write your .bashrc.

Implementing a PUT script can be difficult and a security hazard in the context of content publishing, but that's the case because the client is choosing the target URI and the client could upload any content type. In the case of Web Services in general, and the AtomAPI in particular, PUT is used in a much narrower manner and avoids those potential security problems.

In the case of the AtomAPI PUT is only allowed on URIs that point to a pre-existing resource. The AtomAPI follows a general idiom for editing resources of doing a GET to retrieve the original XML, then a PUT on the same URI to upate that resource with the edited XML. No URIs are created by doing a PUT. PUT is not accepted on arbitrary URIs. This makes the use of PUT in the context of the AtomAPI just as safe as POST.

There are quite a few ways to configure Apache to process incoming requests. In particular it is possible to have a single script that handles all PUT requests below a chosen directory. This strategy, and all of the associated security concerns associated with it, are covered fully in the Publishing Pages with PUT.

When processing request with a CGI script all the PUT requests will come through. The verb is passed to the CGI program via the REQUEST_METHOD environment variable, and the program decides what to do with the content.

Using PUT propoerly has advantages in Web Service development. First, Apache lets you control security based on the verb using the Limit and LimitExcept directives, which let you restrict access controls based on the verb. Here is a sample of one of my .htaccess files that restricts the use of all verbs except GET to the CGI program Bulu.cgi.

<Files Bulu.cgi>
AuthType Basic
AuthName myrealm
AuthUserFile /path/to/my/password/file
  <LimitExcept GET>
  Require valid-user
  </LimitExcept>
</Files>

In addition, the Script directive can be used to dispatch to a CGI program based on the verb used:

Script PUT /cgi-bin/put.cgi

The second advantage using PUT brings is clarity. Given the idiom of using GET/PUT in tandem on a URI to edit resources PUT clearly signals what the interface is doing.

Resources

ApacheWeek: Publishing Pages with PUT

RestEchoApiPutAndDelete: Discussion on the use of PUT and DELETE in the AtomAPI.

mod_actions: An Apache module for controlling dispatching based on verb or content-type.

Configuring your WWW server to understand the PUT method, from the W3Cs Amaya project documentation.

WebDAV is also something you may be interested in if you are looking for ways to publish your content using HTTP. WebDAV stands for "Web-based Distributed Authoring and Versioning". It is a set of extensions to the HTTP protocol which allows users to collaboratively edit and manage files on remote web servers. Mod_dav in an Apache module that implements WebDAV.

Six Plus One

2003-08-03T01:34:49-05:00

Previously I talked about the six different places there are to store information in an HTTP transaction. This is slightly misleading.

To review, the six places are:

Request URI
Request Headers
Request Content
Response Status Code
Response Headers
Response Content

This is slightly misleading because the URI is listed as a single storage location. This isn't the best characterization, as it really contains two different sets of information: the path, and the query parameters.

Now the path part of a URI usually corresponds to the directory structure on the server. But remember that the path structure of a server is completely controlled by that server and it need not corresponse to any file or directory strucure. While it is at times convenient to map it to a directory structure, this isn't required, and it is possible to pass path information to a CGI program. For example, if you do a GET on the following URL:

http://example.org/cgi-bin/test.py/fred/12

and there exists a program named test.py in the cgi-bin directory then that program will be executed. The remaining path after the program is passed to the CGI program in the PATH_INFO environment variable. In contrast, if query parameters are passed in, they are passed to the CGI program via the QUERY_STRING environment variable.

For example, if this is the script test.py:

import os
print "Content-type: text/plain\n\n"
print "PATH_INFO = %s" % os.environ['PATH_INFO']
print "QUERY_STRING = %s" % os.environ['QUERY_STRING']

And it handles the GET for this URI:

http://localhost/cgi-bin/test.py/reilly/12?id=234454

It will display:

PATH_INFO = /reilly/12
QUERY_STRING = id=234454

Note how the piece of the path below test.py has been stripped off and made available via PATH_INFO, while the query parameters are stored in the QUERY_STRING environment variable.

So HTTP, via the structure of a URI, gives you two distinct places to store information, one in the path and the second in the query parameters. This isn't even the full story, because if you are running Apache and have the ability to use .htaccess files you can use mod_rewrite and map URIs so that they appear as paths but show up in the CGI as query parameters, but we won't cover that now.