HTTP
July 10, 2016 ยท View on GitHub
HTTP is the protocol which browsers use to request pages from servers.
HTTP is part of the application layer.
The port is 80/TCP.
It is the main way that applications talk to servers.
HTTP does not find a server: it only determines exactly which characters must be passed to a server to get the data what one wants.
A major application is to request that a server send a web page. Such requests are made by browsers whenever you open a web page.
There are however many other applications outside browsers: any program can send an HTTP request. One important example are REST interfaces, which allows programs to talk to servers. One example of a REST interface is the GitHub API https://developer.github.com/v3, which allows programs to do anything that can be done through the browser on GitHub. This can be used to create applications that interact with GitHub's data such as third party analytics tools like http://osrc.dfm.io/cirosantilli/.
You can use the nc utility to both send and receive low level HTTP requests to understand what is going on.
Standards
HTTP is specified by IETF.
There are currently two main versions HTTP/1.0 and HTTP/1.1 which is the most popular one today.
The HTTP/1.1 specification can be found at RFC 2616 http://tools.ietf.org/html/rfc2616, http://www.w3.org/Protocols/rfc2616/rfc2616.html (1999).
Many modifications have since been made through other RFCs:
- https://tools.ietf.org/html/rfc4918 extensions, like 422.
- PATCH method https://tools.ietf.org/html/rfc5789
multipart/form-datacontent-typehttp://www.ietf.org/rfc/rfc2388
Example
TODO simple concrete example to tie in all concepts for a beginner.
A sample GET request that a browser can send to a server looks like:
TODO
A sample POST request that a browser can send to a server looks like:
TODO
Minimal HTTP request
GET / HTTP/1.1
Host: example.com
Or more precisely:
printf 'GET / HTTP/1.1\r\nHost: example.com\r\n\r\n' | nc example.com 80
The Host is required, although some permissive servers may ignore it. example.com does not accept it.
Minimal HTTP response
http://stackoverflow.com/questions/4726515/what-http-response-headers-are-required
No headers are required in all situations, although some headers are required in some situations.
Transport
By far the most common is TCP, not UDP: http://stackoverflow.com/questions/323351/does-http-use-udp
Quoting the spec:
This does not preclude HTTP from being implemented on top of any other protocol on the Internet, or on other networks. HTTP only presumes a reliable transport
Since UDP is not "reliable transport", it should not be used.
HTTP is a synchronous request-response protocol https://en.wikipedia.org/wiki/Request-response. A simplified transaction goes like this:
- client opens the connection with 3-way handshake
- client sends the request
- server ACKs the request
- server send the reply
- client ACKs the reply
- client makes further requests (images, CSS, etc.) until satisfied, and server replies
- client starts the closing handshake
To observe a transaction in detail, use Wireshark:
- set a Wireshark display filter by address
ip.host == example.com curl example.com
It says that the connection went like this:
192.168.0.10 example.com TCP 47996 > http [SYN] Seq=0
example.com 192.168.0.10 TCP http > 47996 [SYN, ACK] Seq=0 Ack=1
192.168.0.10 example.com TCP 47996 > http [ACK] Seq=1 Ack=1
192.168.0.10 example.com HTTP GET / HTTP/1.1
example.com 192.168.0.10 TCP http > 47996 [ACK] Seq=1 Ack=76
example.com 192.168.0.10 HTTP HTTP/1.1 200 OK (text/html)
192.168.0.10 example.com TCP 47996 > http [ACK] Seq=76 Ack=1592
192.168.0.10 example.com TCP 47996 > http [FIN, ACK] Seq=76 Ack=1592
example.com 192.168.0.10 TCP http > 47996 [FIN, ACK] Seq=1592 Ack=77
192.168.0.10 example.com TCP 47996 > http [ACK] Seq=77 Ack=1593
Connection header
Keep-Alive
Persistent connection
In HTTP 1.1, multiple HTTP requests are done by default on a single connection, since almost all pages require the loading of multiple resources from a single server: image, CSS, etc. In HTTP 1.0, this had to be indicated through the Connection: Keep-Alive header, which most browsers still send as of 2014.
Demo:
Two replies, not mandatory on 1.1:
(
printf 'GET / HTTP/1.1\r\nHost: example.com\r\n\r\n';
printf 'GET / HTTP/1.1\r\nHost: example.com\r\n\r\n';
) | nc example.com 80 | grep HTTP
Output:
HTTP/1.1 200 OK
HTTP/1.1 200 OK
And Wireshark tells me the connection went like this:
192.168.0.10 example.com TCP 47741 > http [SYN] Seq=0
example.com 192.168.0.10 TCP http > 47741 [SYN, ACK] Seq=0 Ack=1
192.168.0.10 example.com TCP 47741 > http [ACK] Seq=1 Ack=1
192.168.0.10 example.com HTTP GET / HTTP/1.1 GET / HTTP/1.1
192.168.0.10 example.com TCP 47741 > http [FIN, ACK] Seq=75 Ack=1
example.com 192.168.0.10 TCP http > 47741 [ACK] Seq=1 Ack=75
example.com 192.168.0.10 HTTP HTTP/1.1 200 OK (text/html)
192.168.0.10 example.com TCP 47741 > http [ACK] Seq=76 Ack=1592
example.com 192.168.0.10 HTTP HTTP/1.1 200 OK (text/html)
192.168.0.10 example.com TCP 47741 > http [ACK] Seq=76 Ack=3183
example.com 192.168.0.10 TCP http > 47741 [FIN, ACK] Seq=3183 Ack=76
192.168.0.10 example.com TCP 47741 > http [ACK] Seq=76 Ack=3184
so single SYN and thus single connection.
On HTTP 1.0, Keep-Alive is mandatory:
(
printf 'GET / HTTP/1.0\r\nHost: example.com\r\n\r\n';
printf 'GET / HTTP/1.0\r\nHost: example.com\r\n\r\n';
) | nc example.com 80 | grep HTTP
Output:
HTTP/1.1 200 OK
Two replies:
(
printf 'GET / HTTP/1.0\r\nHost: example.com\r\nConnection: keep-alive\r\n\r\n';
printf 'GET / HTTP/1.0\r\nHost: example.com\r\nConnection: keep-alive\r\n\r\n';
) | nc example.com 80 | grep HTTP
Output:
HTTP/1.1 200 OK
HTTP/1.1 200 OK
Close
In HTTP, Connection: Close on the request or response
says that the sender does not want a persistent connection.
This is may be necessary because in HTTP 1.1 the connection is persistent by default.
Newlines
Every newline is a CRLF. Tools such as curl convert \n to \r\n.
First line
The first line is different for requests and responses.
Initial request line
An initial request line looks something like:
GET /path/to/file.html HTTP/1.0
1 2 3
Or:
POST /path/to/resource HTTP/1.1
1 2 3
Where:
- method
- path
- HTTP version
Initial response line
HTTP/1.0 200 OK
^ ^ ^
1 2 3
-
protocol and version
-
status code. Programmatic use, so get it right.
-
Reason Phrase. Each status code has a default reason phrase, but any phrase may be used, the value is not fixed by the protocol.
Method
Determines in general terms what the request is about.
RFC 2616 specifies the following methods:
-
GET: get information from serverCan contain a body, but it should be ignored by servers: http://stackoverflow.com/questions/978061/http-get-with-request-body
-
HEAD: only get header information from server -
POST: send data to server to create new objects. E.g.: you click on the submit button of an HTMLformwithmethod="post". -
PUT: update or create entire objects on server. Both PUT and POST can be used to create object, butPUTis idempotent. -
DELETE: remove objects from server. Idempotent. -
TRACE: TODO -
CONNECT: TODO
There are also proposed methods in other RFCs:
-
PATCHin RFC 5789 https://tools.ietf.org/html/rfc5789Update object on server.
Vs
PUT: only attributes of the objects which are sent are modified.In
PUT, attributes not given are set to default values.In
PATCH, attributes not given are not modified.
Safe Methods
Methods that shouldn't change the server state significantly. E.g.: GET, HEAD.
In practice however, many GET requests do change the server state: a common example being page view count. However this operation is not considered drastic, and it is generally accepted to be done via GET.
Use safe methods whenever possible.
Idempotent Method
If a request is sent twice with an idempotent method, it should have the same effect as if it were sent once.
E.g.: PUT, DELETE and all safe methods.
Both PUT and POST can be used for object creation: the difference often comes down to: does the user control the unique ID that identifies the created object, often the one shown on the URL to the object? Or is it the server that automatically generates the ID? If the user controls it, a double request is idempotent because the second would be an update with the same data as the first, so PUT is the best choice.
Status line
An initial response line looks something like:
HTTP/1.0 200 OK
1 2 3
Or:
HTTP/1.1 404 Not Found
1 2 3
Where:
- HTTP version
- status code
- status code name. There is only one possible name for every status code.
Status code
All status codes can be found here: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes
Some of the more interesting ones are commented here.
Redirect
3xx
Very readable spec: http://tools.ietf.org/html/rfc2616#section-10.3
Redirect to another page for a given reason.
All of those statuses require the Location header that indicates where to redirect to.
The spec recommends that an HTML redirect page be given in the body in case the user agent does not follow redirects. Most browsers do so by default, and don't ever show Moved pages.
HTTP/1.1 301 Moved Permanently
Location: http://www.example.org/
Content-Type: text/html
Content-Length: 174
<html>
<head>
<title>Moved</title>
</head>
<body>
<h1>Moved</h1>
<p>This page has moved to <a href="http://www.example.org/">http://www.example.org/</a>.</p>
</body>
</html>
By default, the new request should not change the initial method, unless stated otherwise as in 303.
If the new method would not be a GET or HEAD request, the UA can only carry it out if it is possible to ask for confirmation from the user, since something like POST may have unwanted consequences.
User agents should detect infinite redirect loops.
Post Redirect Get pattern
http://en.wikipedia.org/wiki/Post/Redirect/Get
301
Permanent redirect: the page has been moved forever.
Next time, UAs can cache this redirect, and may not even bother sending a request to the first location.
Only works for GET requests.
TODO how to revert this if you change your mind later? http://getluky.net/2010/12/14/301-redirects-cannot-be-undon/ suggests it is impossible.
302
Hell.
In the spec, exact same as 307.
In practice, implemented exactly as 303, even on 2014 browsers, and nothing can be done to change that without breaking things.
Still the most common response promoted by 2014 web frameworks. E.g., Rail's most common redirection method redirect_to uses it. TODO why? Compatibility with HTTP 1.0 which does not have 303 and 307, even if 1.1 is very widely implemented?
303
Redirect to Location, and use GET whatever the first method was.
Introduced in HTTP 1.1 do disambiguate 302. Counterpart of 307.
In practice, this is what all 302 implementations do.
This is the most correct response after an user submitted a POST form and you want him to see the object he created, although in practice 302 still is more common.
Perhaps someday we will be able to use XHR and Javascript pushState for form submissions, thus saving one extra request.
304
Not modified.
Application:
The user pressed the refresh key on an open browser page.
It would be wasteful to refecth the current page if it was not modified since it last fetched. 304 exists to avoid just that.
If the client already has an older version of the resource cached, it can send in the request one of the fields If-Modified-Since or If-Match containing the date at which the resource was obtained.
The server sees if the resource has been updated since that date, and if not can return a 304.
307
In the spec, exact same as 302.
Introduced in HTTP 1.1 with 303 because all browsers treat 302 as 303, and nothing can be done to change it now.
This time we hope browsers will follow the spec.
4xx
400
Bad syntax. E.g., malformed JSON.
In theory, not for invalid values, e.g., invalid email address. Use 422 for that if supported.
401
Server should include a WWW-Authenticate field specifying what kind of authentication is required.
Try this with:
curl -I localhost/location/that/requires/auth
I get for example:
WWW-Authenticate: Basic realm="AuthName value"
so the type is Basic
AuthName value is a any descriptive string set by the server operators. In Apache it is given by the AuthName directive.
403
Vs. 401: http://stackoverflow.com/questions/3297048/403-forbidden-vs-401-unauthorized-http-responses
422
WebDAV extension: https://tools.ietf.org/html/rfc4918#section-11.2
Returned for invalid parameters sent, e.g., invalid email address sent.
Returned by Rails 4 on form validation.
Headers
Certain headers can only be used on requests, other only on responses, and a few on both.
Headers are case insensitive. A common style is to capitalize the first letter of each word.
Headers that can be used on both request and response
Content-Type
http://tools.ietf.org/html/rfc2616#section-14.17
The MIME type of the data being sent on the body.
Not mandatory, but almost always provided since it is very helpful to the client.
If not present, the client can read the data and try to decide the content type.
Large list with simple explanations: http://en.wikipedia.org/wiki/Internet_media_type#Type_text
-
text/html: HTML document. Browsers interpret body as HTML and renders it. -
application/xhtml+xml: XHTML. -
text/plain: browser pastes to screen, no HTML rendering. So you will see tags like<h1>on screen. -
application/x-www-form-urlencoded: key value pairs in the same format that can be given on an URL.The default
content-typeformethod="post"on HTML forms.The
content-typecan be modified via theenctypeformattribute. -
application/json: popular choice for rest APIs. Used on the GitHub API. -
text/css -
application/pdf -
application/javascript -
application/octet-stream: TODO
multipart/form-data
Specified at RFC 2388 http://www.ietf.org/rfc/rfc2388.
Encapsulates multiple header / body pairs into one body HTTP body.
Advantage of multipart/form-data:
-
Huge memory savings for binary files, in which:
-
many bytes would have to be URL encoded for
application/x-www-form-urlencoded(3 bytes per encoded byte). -
the size would be 33% larger with
application/x-www-form-urlencoded+base64encode.
-
Disadvantages of multipart/form-data:
-
Each field has a data overhead for the boundary and the sub headers.
This is easily overcome by memory gains of using it if there is a file.
Therefore: use it iff upload of a binary file is possible on the request.
Sample request: http://stackoverflow.com/a/23518189/895245
boundary
This is the only type of request does not have an empty line after the headers: boundary comes directly.
boundary specifies a sequence of bytes which separates each of the bodies.
The boundary is always surrounded by CRLFs which are not part of the data.
The boundary cannot appear inside the data: the user agent must chose it appropriately.
The trailing hyphens of the boundary are often added for partial backward compatibility with older multipart RFCs, and to improve readability. TODO are they mandatory?
Content-Length
http://tools.ietf.org/html/rfc2616#section-14.13
Length of the body in bytes.
Very important when body is present because it allows the receiver to allocate memory at once.
Was mandatory on HTTP 1.0, but not on HTTP 1.1, where Transfer-Encoding: chunked was added, and it is specified that clients can simply read up to the connection close if it is not given.
Content-Encoding
Must be one of the request Accept-Encoding values.
-
gzip: the content was gzipped before sending it to the browser.Supported by almost all modern browsers: Firefox sends it by default, Apache has
mod_gzipwhich gzips everything when possible.
Transfer-Encoding
http://tools.ietf.org/html/rfc2616#section-14.41
Vs Content-Encoding: Transfer-Encoding is a property of the way the message is transmitted,
Content-Encoding is metadata about the message.
For example, you could use both a transfer encoding and a content encoding on a single request.
Chunked
Transfer-Encoding: chunked encodes the body as follows:
printf 'HTTP/1.1 200 OK
Transfer-Encoding: chunked
4
some
8
encoded
15
text with
a newline
0
' | sed -E 's/$/\r/' | nc -l 8000
Then:
curl localhost:8000
Outputs:
some encoded text with
a newline
Each line of the body gets prefixed by the length of the line in hexadecimal on it's own line.
The last line must have length 0, followed by two CRLFs.
Advantage: allows the server to respond progressively if it does not know the content length already for the Content-Length header.
Request headers
Host
The only mandatory header in HTTP 1.1.
With curl http://localhost:8000:
Host: localhost:8000
The port should be included if it's not the default 80: http://stackoverflow.com/questions/3364144/is-port-number-required-in-http-host-header-parameter
With curl 127.0.0.1:
Host: 127.0.0.1
With curl http://google.com:
Host: google.com
Some hosts are very picky about the Host header. E.g., example.com give a 404 if you don't set it, or set it wrong:
curl `dig +short example.com`
results in 404, while:
curl example.com
works.
The purpose of this annoying mandatory header is to allow for name based virtual hosting: http://en.wikipedia.org/wiki/Virtual_hosting, i.e., multiple hostnames on a single IP.
TODO: how are browsers able to decide this? If you click on a link, browsers send the same Host as the hostname. Is it possible to change that?
User-Agent
Description of the user agent that sent the request.
Request counterpart of Server.
Firefox 29:
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0
Curl 7.22:
User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
Accept
A comma separated list of MIME type that the client wants as a response.
Specified on the response by the Content-Type header.
It is possible that a single URL is able to return several types.
In this case this field can be used by the server to determine the type to serve.
Firefox 29:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
The q= are the quality factors, that indicate the level of preference for each listed media type: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1
Accept-Language
Comma separated list of accepted Content-Language for the response.
Firefox 29:
Accept-Language: en-US,en;q=0.5
Accept-Encoding
Comma separated list of accepted Content-Encoding for the response.
Firefox 29:
Accept-Encoding: gzip, deflate
Extensions
Forwarded-For
http://en.wikipedia.org/wiki/X-Forwarded-For
Usually set by reverse proxies.
Holds the IP of the original request.
Response headers
Content-Disposition
Suggests to the browser what to do to certain types of data
attachment suggests to the browser to download the file with given filename:
Content-Disposition: attachment; filename=genome.jpeg;
inline suggest to the browser to show the content inline if it has that capability:
Content-Disposition: inline
In Firefox, the browser preferences under Edit > Preferences > Application determine what to do for each MIME type, and overrides this header.
http://i8jesus.com/?p=64 suggests that Content-Disposition is not sufficient protection from unrestricted file upload XSS.
Server
Software that sent the response.
google.com:gws. Google Web Server, closed source.twitter.com:tfe
Not mandatory: Facebook does not return it.
Referer
Sent by the UA containing the page from which the request was sent, so in the case of hyperlinks the page who refered the user to the new one.
Optional, so don't rely on it: http://stackoverflow.com/questions/6023941/how-reliable-is-http-referer
Was originally misspelled as referrer, with two R's on older specs, but this has been corrected in newer specs.
Used by Rails redirect_to :back shortcut.
Extensions
Sendfile
Accel
https://tn123.org/mod_xsendfile/
If the response contains this header, the rest of the response is ignored and the file is served instead, and the server uses all of it's optimizations.
Normally this the reverse proxy is configured to consider this header only on the response of the server.
This header is useful for serving static files for which the server has to control access: there fore the request must pass through the server first. But it is inefficient to serve files from the backend server, so the backend server passes this header and delegates to the reverse proxy.
CORS headers
The following headers are used for CORS requests and responses:
Origin
Access-Control-Allow-Origin: http://api.bob.com
Access-Control-Allow-Credentials: true
Access-Control-Expose-Headers: FooBar
Custom headers
http://stackoverflow.com/questions/3561381/custom-http-headers-naming-conventions
In the past, prefix by X-.
After 2012: cross your fingers and pick a name.
Body
TODO
HTTPS
HTTP over TLS.
Encrypts both body and headers.
Downside: encrypt/decrypt costs time.
HTTPS runs on port 443 instead of 80.
HTTP authentication
Authentication that is sent over the HTTP header.
Sources
http://www.httpwatch.com/httpgallery/authentication/
Comparison to form auth, nice diagrams: http://docs.oracle.com/javaee/1.4/tutorial/doc/Security5.html
Great post: http://stackoverflow.com/questions/549/the-definitive-guide-to-forms-based-website-authentication
Downsides of HTTP auth
Parameters are left to the browser:
- the appearance of the login page
- the time for which the user stays authenticated (time for which browser keeps resending
user:passautomatically).
You might have seen this on a website in which your browser just opens up a weird looking window and asks you for username / password.
Therefore, you cannot customize them and users will get different interfaces on different browsers, bad user interface consistency.
For those reasons, form authentication is used on most large sites today.
Upside of HTTP auth
Simple.
Basic authentication
Authentication is sent on the header unencrypted!
Example:
curl -vu u:p google.com
You see the header line:
Authorization: Basic dTpw
^^^^^ ^^^^
1 2
where:
- 1: authentication type
- 2: base 64 of u:p. not encryption!
Just checking:
assert [ "`echo dTpw | base64 -d`" = "u:p" ]
URL convention
Many programs accept URLs strings with user/pass included:
curl -v u:p@google.com
This is however just a convention, since programs that accept it parse the string to extract the u:p part, and then send it on the header.
Digest authentication
Pretty cool concept
See: http://en.wikipedia.org/wiki/Digest_access_authentication
Authentication is sent on the header md5 hashed:
curl --digest -vu u:p google.com
Why it works
Data is appended to the authentication with : before hashing:
- domain (
www.google.com) - method (GET, POST, etc.)
- nonce
- nonce is sent to client from server.
- nonces can only be used once per client!!
- nonce prevents requests from being repeated with an old captured hashed string!
- also increases the difficulty of cracking each user/pass
This way, the unknown user and pass get mixed up with the extra data in the hash and it is very hard to separate them. and the nonce makes sure requests cannot be remade by resending the hash.
Merits:
- simpler than a full SSL
NTML
Safer than digest: replay attacks impossible.
Requires server state, so HTTP 1.1 only.
Little current support/usage.
CSRF
Cross site request forgery
The attacker Bob posts on a website:
http://bank.com?transfer-ammount=10000&to=bob
Alice clicks on the disguised link, her browser sends the authentication cookies, authenticates and makes the request.
How to prevent it: https://www.owasp.org/index.php/Cross-Site_Request_Forgery_%28CSRF%29_Prevention_Cheat_Sheet
Synchronizer Token Pattern.
The most popular prevention mechanism.
Send extra randomly generated data with valid forms.
Generation can be per session or per form, which is more secure but less efficient and may have usability impact.
The method relies on the same origin policy: if it did not exist the attacker would be able to obtain the token.
This method is implemented by default in most web frameworks, either by adding a hidden form field (Django) or meta elements like Rails:
<meta content="authenticity_token" name="csrf-param" />
<meta content="asdfqwerASDFQWER12345678" name="csrf-token" />
Reauthenticate
Ask user for password again before critical operations.
Used in all online banking systems.
Inconvenient for users since an extra action is required before an authentic request.
Sources
Good intro tutorial: http://www.jmarshall.com/easy/http/#structure