Automatic proxy HTTP server configuration in web browsers

You've come to this page because you've asked questions similar to the following:

How does automatic proxy HTTP server configuration in web browsers work ? How can I support it ?

This is the Frequently Given Answer to such questions.

Automatic proxy HTTP server configuration involves three things:

Proxy Auto-Configuration files

Most web browsers may be configured manually, with a single, fixed, proxy HTTP server. However, Netscape Navigator version 2.0 and later and Microsoft's Internet Explorer version 3.0a and later may also instead be configured to use Proxy Auto-Configuration (PAC) files.

PAC are files that contain the text of a single JavaScript function, FindProxyForURL(). In theory, every time that a web object is about to be fetched, the JavaScript function is invoked (by the web browser) with two arguments: the URL of the object and the hostname derived from that URL. The result of the function is a string comprising a semi-colon-separated sequence of one or more instructions that determine whence the web browser is to fetch the object from:

Instruction Meaning
DIRECT Fetch the object directly from the content HTTP server denoted by its URL
PROXY name:port Fetch the object via the proxy HTTP server at the given location (name and port)
SOCKS name:port Fetch the object via the SOCKS server at the given location (name and port)

The Netscape 2.0 documentation for PAC scripts describes in detail the JavaScript facilities that are available for use in the FindProxyForURL() function.

Proxy caching in Microsoft's Internet Explorer

In theory, the FindProxyForURL() function is invoked every time that an object is about to be fetched by the web browser.

In practice, however, Microsoft's Internet Explorer has what Microsoft terms an "Automatic Proxy Result Cache". Whenever a proxy HTTP server (located using the results of a call to the FindProxyForURL() function or otherwise) is successfully contacted to fetch an object, the APR cache is updated to contain that <hostname,server> pair. If, when about to call the FindProxyForURL() function, Internet Explorer finds the host already listed in the APR cache, it uses the proxy HTTP server listed in the APR cache entry instead of calling the FindProxyForURL() function again for the same host. (The intent of the APR cache is to attempt to reduce the number of times that the JavaScript function has to be run, and thus reduce the overhead of fetching objects.)

Because Internet Explorer's APR cache is indexed by hostname, this means that it is impossible for a PAC script to reliably yield multiple different results according to any part of a URL in addition to the hostname. It is impossible, for example, to provide different proxy configurations according to the path portions of URLs on a single host.

Because Internet Explorer's APR cache caches the proxy HTTP server rather than the full results of the FindProxyForURL() function, this means that fallback from one proxy HTTP server to another does not occur in the event of a problem, even if the FindProxyForURL() function returned a list of several proxy HTTP servers.

Microsoft's KnowledgeBase article #271361 summarizes these problems and describes how to turn Internet Explorer's APR cache off.

Microsoft's Internet Explorer also caches information about "bad" proxy HTTP servers for 30 minutes. This has no direct bearing upon PAC scripts, except that it often causes confusion when people are setting up a proxy HTTP server and creating a PAC script at the same time, and a problem with the proxy HTTP server, causing it to be cached as "bad" for 30 minutes, is misdiagnosed as a problem with the PAC script.

Examples of PAC scripts

This article from Microsoft Internet Developer contains a few examples of PAC scripts, as does the Microsoft Internet Explorer 6 documentation.

John R. LoVerso has created a PAC script that recognizes the URLs of many advertisement publishing services and redirects them, effectively removing banner advertisements by stopping the web browser from even trying to contact the advertisement publishing service.

"Bruce" has created a similar, shorter and less comprehensive, PAC script for the same thing.

Publishing PAC scripts for web browsers to download

Web browsers download PAC scripts once either at program startup or (as is the case with Mozilla) when the web browser component is first invoked; and again when explicity instructed to "re-load" them by the user. (Users thus have to manually re-load PAC scripts when their view of proxy HTTP services changes, such as when they reconnect a machine via a different ISP for example.)

Officially, web browsers obtain PAC scripts via HTTP, requiring that PAC scripts be published by content HTTP servers that are directly reachable by the web browsers. (In practice, Mozilla, for one, is also happy to read PAC scripts directly from a file on the machine or to use other protocols. Microsoft's Internet Explorer versions 5 and later are similarly permissive, although version 4 is strict about the HTTP requirement.)

PAC scripts are not required to have any particular names. However, they are officially required to have the application/x-ns-proxy-autoconfig MIME type. (In practice, Mozilla, for one, is lax about the MIME type, and will have no problems with PAC scripts that have other MIME types such as text/plain. Netscape Navigator is reported to be somewhat stricter about the MIME type, however.)

Since it is easiest with most content HTTP server softwares to configure MIME types to be automatically determined from filename extensions, the common convention is for PAC scripts to have names that end in .pac, and for the content HTTP server software to associate the .pac extension with the application/x-ns-proxy-autoconfig MIME type.

Of course, the .pac filename extension is only a convention. With some content HTTP server softwares the application/x-ns-proxy-autoconfig MIME type can be directly associated with the file, and the name can be anything one chooses.

The effects of problems with the content HTTP service publishing the PAC script

The content HTTP server publishing a PAC script (assuming that the web browser is using HTTP to obtain it) should be reliable, directly accessible, and continuously available whilst web browsers configured to download the PAC script from it may be operating. This is because web browsers do not have good failure modes if they are unable to download PAC scripts:

One example of a situation where such problems will become visible is attempting to use a web browser to view a local HTML document whilst the machine is disconnected from the network.

The squid FAQ document's recommendation for dealing with this is wrong

The example of providing redundant content HTTP servers that is given in § 5.4 of the squid FAQ document is wrong. It relies upon the notion of "multiple CNAMEs", that is contrary to the DNS paradigm, that only ever "worked" in one particular DNS server software (ISC's BIND) because of a bug that we were warned many years ago would be fixed and which has now been fixed, and that has never worked with any other DNS server softwares. Client-side aliases in the DNS are one-to-one mappings. To provide one-to-many mappings, use multiple A resource records.

Avoiding promiscuous proxy HTTP service

Best practice is not to provide any sort of promiscuous proxy service to the whole of Internet. This includes proxy HTTP service. Best practice is to have all of one's proxy servers (proxy HTTP servers, proxy DNS servers, and so forth) listening on IP addresses that are not reachable by the rest of Internet.

As such, one might find onesself in the situation where a "roaming" user has left xyr web browser configured to use one's PAC script (served up by one's content HTTP server, of course, and which thus may be publically reachable), which is directing it to use a proxy HTTP server that the user, not being "internally" connected to one's organisation, has no actual access to.

One way to avoid this problem is to employ "split horizon" HTTP service, with one version of the PAC script, containing the real proxy information, being published to "internal" users, and another version of the PAC script, containing a FindProxyForURL() function that always returns "DIRECT", being published to the rest of Internet.

Another way to avoid this problem is to use Web Proxy Auto-Discovery via DHCP, so that "roaming" users are only configured to use one's PAC script when they have actually obtained a lease for an IP address off one's DHCP server and are thus "internally" connected to one's organization.

Configuring web servers manually with the locations of PAC scripts

The simplest way to configure web browsers to download and use PAC scripts is manually. The service provider publishes the PAC script on a suitable content HTTP server with the appropriate MIME type, and informs web browser users of its URL. Web browser users then enter that URL into their web browser configuration settings.

Web Proxy Auto-Discovery protocol

Microsoft's Internet Explorer supports two mechanisms for automatically configuring it to download PAC scripts, under the banner of the Web Proxy Auto-Discovery (WPAD) protocol, which is described in detail here. With both mechanisms, Internet Explorer automatically determines the URL of the PAC script, without the user having to enter it manually.

WPAD mechanism 1: "DNS based"

The "DNS based" WPAD mechanism simply constructs a series of "well-known" URLs, starting with the machine's full primary domain name sans the initial label and proceeding to progressively shorter suffixes thereof until only a single label is left, as follows:

So, for example, if the machine's full primary domain name were workstation.division.country.example.com., the URLs would be

  1. http://wpad.division.country.example.com./wpad.dat
  2. http://wpad.country.example.com./wpad.dat
  3. http://wpad.example.com./wpad.dat

The web browser attempts to download a PAC script from each "well-known" URL in turn until it either succeeds or runs out of URLs.

It is thus necessary for the proxy server administrator to do the following:

WPAD mechanism 2: "DHCP based"

The "DHCP based" WPAD mechanism simply passes the URL of the PAC script as option number 252 in the DHCP lease granted to the machine. The web browser obtains the URL from the lease, and simply downloads the PAC script from there.

It is thus necessary for the proxy server administrator to ensure that the DHCP Server is configured to hand out option 252 in the leases that it grants, containing the URL of the PAC script.

One caveat: Microsoft's Internet Explorer version 6.01 expects the string in option 252 to be NUL-terminated. As such, it unconditionally strips off the final octet of the string before using it. Earlier versions of Microsoft's Internet Explorer do not do this. To satisfy all versions, simply explicitly include a NUL as the last octet of the string.

Security considerations

Web browsers have been a major source of security headaches over the years. Many of these headaches have been attributable to the simple bad design of having a web browser download from somewhere else on the network a program, whose code is written by and whose actions are thus determined by someone else, and then automatically run it on the local machine.

Unfortunately, PAC scripts, being JavaScript programs that are downloaded and automatically run by web browsers (every time that they wish to fetch web objects), employ exactly that bad design. Essentially: By employing a PAC script, a web browser user is running a program, written by a third party and downloaded from a web site on the network, on xyr machine under the aegis of xyr user account, allowing it to do everything that xe can do.

This is a shame. Most PAC scripts are little more than long lists of if statements ("If the URL matches this pattern, return this result."), and are little more than glorified sequential lookup tables. A far better design would have had PAC scripts be only data, not executable code, comprising just the lookup table and not the code to search it. (The access control rules database in UCSPI-TCP are a good example of a simple ruleset database design that could have been followed.) The possibilities for malicious use would have been far fewer.

Given the combination of this bad design and the quirky "search path" behaviour of Netscape Navigator, when it fails to locate a PAC script at the URL that it is given, and the automated search behaviour of "DNS based" Web Proxy Auto-Discovery, allowing anyone who can set up a content HTTP server for a suitable hostname to give Microsoft's Internet Explorer an arbitrary JavaScript program to run (and which will be run even if JavaScript is otherwise turned off); PAC scripts are a security nightmare.

Microsoft keeps trying to fix the security problems with "DNS based" WPAD and missing. (One of its purported fixes actually makes the problem worse.) In part, this is because it keeps fixing the wrong thing. Microsoft, the problem is in Internet Explorer, and that is what needs fixing. Stop fixing the wrong components. This isn't a DNS Client or a DNS Server flaw. It's a web browser flaw. Fix your web browser.

Some security advice:


© Copyright 2004,2009 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.