ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/AnyEvent-HTTP/README
Revision: 1.21
Committed: Tue Jun 14 05:23:12 2011 UTC (12 years, 11 months ago) by root
Branch: MAIN
CVS Tags: rel-2_12
Changes since 1.20: +1 -1 lines
Log Message:
2.12

File Contents

# User Rev Content
1 root 1.1 NAME
2 root 1.2 AnyEvent::HTTP - simple but non-blocking HTTP/HTTPS client
3 root 1.1
4     SYNOPSIS
5 root 1.2 use AnyEvent::HTTP;
6 root 1.1
7 root 1.3 http_get "http://www.nethype.de/", sub { print $_[1] };
8    
9     # ... do something else here
10    
11 root 1.1 DESCRIPTION
12     This module is an AnyEvent user, you need to make sure that you use and
13     run a supported event loop.
14    
15 root 1.2 This module implements a simple, stateless and non-blocking HTTP client.
16     It supports GET, POST and other request methods, cookies and more, all
17 root 1.17 on a very low level. It can follow redirects, supports proxies, and
18 root 1.2 automatically limits the number of connections to the values specified
19     in the RFC.
20    
21     It should generally be a "good client" that is enough for most HTTP
22     tasks. Simple tasks should be simple, but complex tasks should still be
23     possible as the user retains control over request and response headers.
24    
25     The caller is responsible for authentication management, cookies (if the
26     simplistic implementation in this module doesn't suffice), referer and
27     other high-level protocol details for which this module offers only
28     limited support.
29    
30     METHODS
31     http_get $url, key => value..., $cb->($data, $headers)
32     Executes an HTTP-GET request. See the http_request function for
33 root 1.5 details on additional parameters and the return value.
34 root 1.2
35     http_head $url, key => value..., $cb->($data, $headers)
36     Executes an HTTP-HEAD request. See the http_request function for
37 root 1.5 details on additional parameters and the return value.
38 root 1.2
39     http_post $url, $body, key => value..., $cb->($data, $headers)
40 root 1.4 Executes an HTTP-POST request with a request body of $body. See the
41 root 1.5 http_request function for details on additional parameters and the
42     return value.
43 root 1.2
44     http_request $method => $url, key => value..., $cb->($data, $headers)
45     Executes a HTTP request of type $method (e.g. "GET", "POST"). The
46     URL must be an absolute http or https URL.
47    
48 root 1.5 When called in void context, nothing is returned. In other contexts,
49     "http_request" returns a "cancellation guard" - you have to keep the
50     object at least alive until the callback get called. If the object
51 root 1.14 gets destroyed before the callback is called, the request will be
52 root 1.5 cancelled.
53    
54 root 1.8 The callback will be called with the response body data as first
55     argument (or "undef" if an error occured), and a hash-ref with
56 root 1.15 response headers (and trailers) as second argument.
57 root 1.2
58     All the headers in that hash are lowercased. In addition to the
59 root 1.13 response headers, the "pseudo-headers" (uppercase to avoid clashing
60     with possible response headers) "HTTPVersion", "Status" and "Reason"
61 root 1.14 contain the three parts of the HTTP Status-Line of the same name. If
62     an error occurs during the body phase of a request, then the
63     original "Status" and "Reason" values from the header are available
64     as "OrigStatus" and "OrigReason".
65 root 1.13
66     The pseudo-header "URL" contains the actual URL (which can differ
67     from the requested URL when following redirects - for example, you
68     might get an error that your URL scheme is not supported even though
69     your URL is a valid http URL because it redirected to an ftp URL, in
70     which case you can look at the URL pseudo header).
71    
72     The pseudo-header "Redirect" only exists when the request was a
73     result of an internal redirect. In that case it is an array
74     reference with the "($data, $headers)" from the redirect response.
75     Note that this response could in turn be the result of a redirect
76     itself, and "$headers->{Redirect}[1]{Redirect}" will then contain
77     the original response, and so on.
78 root 1.3
79 root 1.6 If the server sends a header multiple times, then their contents
80     will be joined together with a comma (","), as per the HTTP spec.
81 root 1.2
82     If an internal error occurs, such as not being able to resolve a
83     hostname, then $data will be "undef", "$headers->{Status}" will be
84 root 1.15 590-599 and the "Reason" pseudo-header will contain an error
85     message. Currently the following status codes are used:
86    
87     595 - errors during connection etsbalishment, proxy handshake.
88     596 - errors during TLS negotiation, request sending and header
89     processing.
90     597 - errors during body receiving or processing.
91     598 - user aborted request via "on_header" or "on_body".
92     599 - other, usually nonretryable, errors (garbled URL etc.).
93 root 1.2
94     A typical callback might look like this:
95    
96     sub {
97     my ($body, $hdr) = @_;
98    
99     if ($hdr->{Status} =~ /^2/) {
100     ... everything should be ok
101     } else {
102     print "error, $hdr->{Status} $hdr->{Reason}\n";
103     }
104     }
105    
106     Additional parameters are key-value pairs, and are fully optional.
107     They include:
108    
109     recurse => $count (default: $MAX_RECURSE)
110     Whether to recurse requests or not, e.g. on redirects,
111     authentication retries and so on, and how often to do so.
112    
113     headers => hashref
114     The request headers to use. Currently, "http_request" may
115     provide its own "Host:", "Content-Length:", "Connection:" and
116 root 1.15 "Cookie:" headers and will provide defaults at least for "TE:",
117     "Referer:" and "User-Agent:" (this can be suppressed by using
118     "undef" for these headers in which case they won't be sent at
119     all).
120    
121     You really should provide your own "User-Agent:" header value
122     that is appropriate for your program - I wouldn't be surprised
123     if the default AnyEvent string gets blocked by webservers sooner
124     or later.
125 root 1.2
126 root 1.20 Also, make sure that your headers names and values do not
127     contain any embedded newlines.
128    
129 root 1.2 timeout => $seconds
130     The time-out to use for various stages - each connect attempt
131 root 1.11 will reset the timeout, as will read or write activity, i.e.
132     this is not an overall timeout.
133    
134     Default timeout is 5 minutes.
135 root 1.2
136     proxy => [$host, $port[, $scheme]] or undef
137 root 1.19 Use the given http proxy for all requests, or no proxy if
138     "undef" is used.
139 root 1.2
140 root 1.15 $scheme must be either missing or must be "http" for HTTP.
141 root 1.2
142 root 1.19 If not specified, then the default proxy is used (see
143     "AnyEvent::HTTP::set_proxy").
144    
145 root 1.2 body => $string
146 root 1.15 The request body, usually empty. Will be sent as-is (future
147 root 1.2 versions of this module might offer more options).
148    
149     cookie_jar => $hash_ref
150     Passing this parameter enables (simplified) cookie-processing,
151     loosely based on the original netscape specification.
152    
153     The $hash_ref must be an (initially empty) hash reference which
154     will get updated automatically. It is possible to save the
155 root 1.15 cookie jar to persistent storage with something like JSON or
156     Storable - see the "AnyEvent::HTTP::cookie_jar_expire" function
157     if you wish to remove expired or session-only cookies, and also
158     for documentation on the format of the cookie jar.
159    
160     Note that this cookie implementation is not meant to be
161     complete. If you want complete cookie management you have to do
162     that on your own. "cookie_jar" is meant as a quick fix to get
163     most cookie-using sites working. Cookies are a privacy disaster,
164     do not use them unless required to.
165    
166     When cookie processing is enabled, the "Cookie:" and
167     "Set-Cookie:" headers will be set and handled by this module,
168     otherwise they will be left untouched.
169 root 1.2
170 root 1.8 tls_ctx => $scheme | $tls_ctx
171     Specifies the AnyEvent::TLS context to be used for https
172     connections. This parameter follows the same rules as the
173     "tls_ctx" parameter to AnyEvent::Handle, but additionally, the
174     two strings "low" or "high" can be specified, which give you a
175     predefined low-security (no verification, highest compatibility)
176     and high-security (CA and common-name verification) TLS context.
177    
178     The default for this option is "low", which could be interpreted
179     as "give me the page, no matter what".
180    
181 root 1.15 See also the "sessionid" parameter.
182    
183     session => $string
184     The module might reuse connections to the same host internally.
185     Sometimes (e.g. when using TLS), you do not want to reuse
186     connections from other sessions. This can be achieved by setting
187     this parameter to some unique ID (such as the address of an
188     object storing your state data, or the TLS context) - only
189     connections using the same unique ID will be reused.
190    
191 root 1.11 on_prepare => $callback->($fh)
192     In rare cases you need to "tune" the socket before it is used to
193     connect (for exmaple, to bind it on a given IP address). This
194     parameter overrides the prepare callback passed to
195     "AnyEvent::Socket::tcp_connect" and behaves exactly the same way
196     (e.g. it has to provide a timeout). See the description for the
197     $prepare_cb argument of "AnyEvent::Socket::tcp_connect" for
198     details.
199    
200 root 1.14 tcp_connect => $callback->($host, $service, $connect_cb,
201     $prepare_cb)
202     In even rarer cases you want total control over how
203     AnyEvent::HTTP establishes connections. Normally it uses
204     AnyEvent::Socket::tcp_connect to do this, but you can provide
205     your own "tcp_connect" function - obviously, it has to follow
206     the same calling conventions, except that it may always return a
207     connection guard object.
208    
209     There are probably lots of weird uses for this function,
210     starting from tracing the hosts "http_request" actually tries to
211     connect, to (inexact but fast) host => IP address caching or
212     even socks protocol support.
213    
214 root 1.8 on_header => $callback->($headers)
215     When specified, this callback will be called with the header
216     hash as soon as headers have been successfully received from the
217     remote server (not on locally-generated errors).
218    
219     It has to return either true (in which case AnyEvent::HTTP will
220     continue), or false, in which case AnyEvent::HTTP will cancel
221     the download (and call the finish callback with an error code of
222     598).
223    
224     This callback is useful, among other things, to quickly reject
225     unwanted content, which, if it is supposed to be rare, can be
226     faster than first doing a "HEAD" request.
227    
228 root 1.15 The downside is that cancelling the request makes it impossible
229     to re-use the connection. Also, the "on_header" callback will
230     not receive any trailer (headers sent after the response body).
231    
232 root 1.8 Example: cancel the request unless the content-type is
233     "text/html".
234    
235     on_header => sub {
236     $_[0]{"content-type"} =~ /^text\/html\s*(?:;|$)/
237     },
238    
239     on_body => $callback->($partial_body, $headers)
240     When specified, all body data will be passed to this callback
241     instead of to the completion callback. The completion callback
242     will get the empty string instead of the body data.
243    
244     It has to return either true (in which case AnyEvent::HTTP will
245     continue), or false, in which case AnyEvent::HTTP will cancel
246     the download (and call the completion callback with an error
247     code of 598).
248    
249 root 1.15 The downside to cancelling the request is that it makes it
250     impossible to re-use the connection.
251    
252 root 1.8 This callback is useful when the data is too large to be held in
253     memory (so the callback writes it to a file) or when only some
254     information should be extracted, or when the body should be
255     processed incrementally.
256    
257     It is usually preferred over doing your own body handling via
258 root 1.9 "want_body_handle", but in case of streaming APIs, where HTTP is
259     only used to create a connection, "want_body_handle" is the
260     better alternative, as it allows you to install your own event
261     handler, reducing resource usage.
262 root 1.8
263     want_body_handle => $enable
264     When enabled (default is disabled), the behaviour of
265     AnyEvent::HTTP changes considerably: after parsing the headers,
266     and instead of downloading the body (if any), the completion
267     callback will be called. Instead of the $body argument
268     containing the body data, the callback will receive the
269     AnyEvent::Handle object associated with the connection. In error
270     cases, "undef" will be passed. When there is no body (e.g.
271     status 304), the empty string will be passed.
272    
273     The handle object might or might not be in TLS mode, might be
274 root 1.15 connected to a proxy, be a persistent connection, use chunked
275     transfer encoding etc., and configured in unspecified ways. The
276     user is responsible for this handle (it will not be used by this
277     module anymore).
278 root 1.8
279     This is useful with some push-type services, where, after the
280     initial headers, an interactive protocol is used (typical
281     example would be the push-style twitter API which starts a
282     JSON/XML stream).
283    
284     If you think you need this, first have a look at "on_body", to
285 root 1.9 see if that doesn't solve your problem in a better way.
286 root 1.8
287 root 1.15 persistent => $boolean
288     Try to create/reuse a persistent connection. When this flag is
289     set (default: true for idempotent requests, false for all
290     others), then "http_request" tries to re-use an existing
291     (previously-created) persistent connection to the host and,
292     failing that, tries to create a new one.
293    
294     Requests failing in certain ways will be automatically retried
295     once, which is dangerous for non-idempotent requests, which is
296     why it defaults to off for them. The reason for this is because
297     the bozos who designed HTTP/1.1 made it impossible to
298     distinguish between a fatal error and a normal connection
299     timeout, so you never know whether there was a problem with your
300     request or not.
301    
302     When reusing an existent connection, many parameters (such as
303     TLS context) will be ignored. See the "session" parameter for a
304     workaround.
305    
306     keepalive => $boolean
307     Only used when "persistent" is also true. This parameter decides
308     whether "http_request" tries to handshake a HTTP/1.0-style
309     keep-alive connection (as opposed to only a HTTP/1.1 persistent
310     connection).
311    
312     The default is true, except when using a proxy, in which case it
313     defaults to false, as HTTP/1.0 proxies cannot support this in a
314     meaningful way.
315    
316     handle_params => { key => value ... }
317     The key-value pairs in this hash will be passed to any
318     AnyEvent::Handle constructor that is called - not all requests
319     will create a handle, and sometimes more than one is created, so
320     this parameter is only good for setting hints.
321    
322     Example: set the maximum read size to 4096, to potentially
323     conserve memory at the cost of speed.
324    
325     handle_params => {
326     max_read_size => 4096,
327     },
328    
329     Example: do a simple HTTP GET request for http://www.nethype.de/ and
330     print the response body.
331 root 1.2
332     http_request GET => "http://www.nethype.de/", sub {
333     my ($body, $hdr) = @_;
334     print "$body\n";
335     };
336    
337 root 1.15 Example: do a HTTP HEAD request on https://www.google.com/, use a
338 root 1.2 timeout of 30 seconds.
339    
340     http_request
341     GET => "https://www.google.com",
342 root 1.15 headers => { "user-agent" => "MySearchClient 1.0" },
343 root 1.2 timeout => 30,
344     sub {
345     my ($body, $hdr) = @_;
346     use Data::Dumper;
347     print Dumper $hdr;
348     }
349     ;
350    
351 root 1.15 Example: do another simple HTTP GET request, but immediately try to
352     cancel it.
353 root 1.5
354     my $request = http_request GET => "http://www.nethype.de/", sub {
355     my ($body, $hdr) = @_;
356     print "$body\n";
357     };
358    
359     undef $request;
360    
361 root 1.13 DNS CACHING
362     AnyEvent::HTTP uses the AnyEvent::Socket::tcp_connect function for the
363     actual connection, which in turn uses AnyEvent::DNS to resolve
364     hostnames. The latter is a simple stub resolver and does no caching on
365     its own. If you want DNS caching, you currently have to provide your own
366     default resolver (by storing a suitable resolver object in
367 root 1.15 $AnyEvent::DNS::RESOLVER) or your own "tcp_connect" callback.
368 root 1.13
369 root 1.2 GLOBAL FUNCTIONS AND VARIABLES
370     AnyEvent::HTTP::set_proxy "proxy-url"
371     Sets the default proxy server to use. The proxy-url must begin with
372 root 1.15 a string of the form "http://host:port", croaks otherwise.
373 root 1.12
374     To clear an already-set proxy, use "undef".
375 root 1.2
376 root 1.19 When AnyEvent::HTTP is laoded for the first time it will query the
377     default proxy from the operating system, currently by looking at
378     "$ENV{http_proxy"}.
379    
380 root 1.15 AnyEvent::HTTP::cookie_jar_expire $jar[, $session_end]
381     Remove all cookies from the cookie jar that have been expired. If
382     $session_end is given and true, then additionally remove all session
383     cookies.
384    
385     You should call this function (with a true $session_end) before you
386     save cookies to disk, and you should call this function after
387     loading them again. If you have a long-running program you can
388     additonally call this function from time to time.
389    
390     A cookie jar is initially an empty hash-reference that is managed by
391     this module. It's format is subject to change, but currently it is
392     like this:
393    
394     The key "version" has to contain 1, otherwise the hash gets emptied.
395     All other keys are hostnames or IP addresses pointing to
396     hash-references. The key for these inner hash references is the
397     server path for which this cookie is meant, and the values are again
398     hash-references. The keys of those hash-references is the cookie
399     name, and the value, you guessed it, is another hash-reference, this
400     time with the key-value pairs from the cookie, except for "expires"
401     and "max-age", which have been replaced by a "_expires" key that
402     contains the cookie expiry timestamp.
403    
404     Here is an example of a cookie jar with a single cookie, so you have
405     a chance of understanding the above paragraph:
406    
407     {
408     version => 1,
409     "10.0.0.1" => {
410     "/" => {
411     "mythweb_id" => {
412     _expires => 1293917923,
413     value => "ooRung9dThee3ooyXooM1Ohm",
414     },
415     },
416     },
417     }
418    
419 root 1.14 $date = AnyEvent::HTTP::format_date $timestamp
420     Takes a POSIX timestamp (seconds since the epoch) and formats it as
421     a HTTP Date (RFC 2616).
422    
423     $timestamp = AnyEvent::HTTP::parse_date $date
424 root 1.15 Takes a HTTP Date (RFC 2616) or a Cookie date (netscape cookie spec)
425     or a bunch of minor variations of those, and returns the
426     corresponding POSIX timestamp, or "undef" if the date cannot be
427     parsed.
428 root 1.14
429 root 1.2 $AnyEvent::HTTP::MAX_RECURSE
430     The default value for the "recurse" request parameter (default: 10).
431    
432 root 1.15 $AnyEvent::HTTP::TIMEOUT
433     The default timeout for conenction operations (default: 300).
434    
435 root 1.2 $AnyEvent::HTTP::USERAGENT
436     The default value for the "User-Agent" header (the default is
437 root 1.8 "Mozilla/5.0 (compatible; U; AnyEvent-HTTP/$VERSION;
438 root 1.2 +http://software.schmorp.de/pkg/AnyEvent)").
439    
440 root 1.8 $AnyEvent::HTTP::MAX_PER_HOST
441 root 1.10 The maximum number of concurrent connections to the same host
442 root 1.8 (identified by the hostname). If the limit is exceeded, then the
443     additional requests are queued until previous connections are
444 root 1.15 closed. Both persistent and non-persistent connections are counted
445     in this limit.
446 root 1.2
447 root 1.8 The default value for this is 4, and it is highly advisable to not
448 root 1.15 increase it much.
449    
450     For comparison: the RFC's recommend 4 non-persistent or 2 persistent
451     connections, older browsers used 2, newers (such as firefox 3)
452     typically use 6, and Opera uses 8 because like, they have the
453     fastest browser and give a shit for everybody else on the planet.
454    
455     $AnyEvent::HTTP::PERSISTENT_TIMEOUT
456     The time after which idle persistent conenctions get closed by
457     AnyEvent::HTTP (default: 3).
458 root 1.2
459     $AnyEvent::HTTP::ACTIVE
460     The number of active connections. This is not the number of
461     currently running requests, but the number of currently open and
462 root 1.15 non-idle TCP connections. This number can be useful for
463 root 1.2 load-leveling.
464 root 1.1
465 root 1.16 SHOWCASE
466     This section contaisn some more elaborate "real-world" examples or code
467     snippets.
468    
469     HTTP/1.1 FILE DOWNLOAD
470 root 1.18 Downloading files with HTTP can be quite tricky, especially when
471 root 1.19 something goes wrong and you want to resume.
472 root 1.16
473     Here is a function that initiates and resumes a download. It uses the
474     last modified time to check for file content changes, and works with
475     many HTTP/1.0 servers as well, and usually falls back to a complete
476     re-download on older servers.
477    
478     It calls the completion callback with either "undef", which means a
479     nonretryable error occured, 0 when the download was partial and should
480     be retried, and 1 if it was successful.
481    
482     use AnyEvent::HTTP;
483    
484     sub download($$$) {
485     my ($url, $file, $cb) = @_;
486    
487     open my $fh, "+<", $file
488     or die "$file: $!";
489    
490     my %hdr;
491     my $ofs = 0;
492    
493     warn stat $fh;
494     warn -s _;
495     if (stat $fh and -s _) {
496     $ofs = -s _;
497 root 1.21 warn "-s is ", $ofs;
498 root 1.16 $hdr{"if-unmodified-since"} = AnyEvent::HTTP::format_date +(stat _)[9];
499     $hdr{"range"} = "bytes=$ofs-";
500     }
501    
502     http_get $url,
503     headers => \%hdr,
504     on_header => sub {
505     my ($hdr) = @_;
506    
507     if ($hdr->{Status} == 200 && $ofs) {
508     # resume failed
509     truncate $fh, $ofs = 0;
510     }
511    
512     sysseek $fh, $ofs, 0;
513    
514     1
515     },
516     on_body => sub {
517     my ($data, $hdr) = @_;
518    
519     if ($hdr->{Status} =~ /^2/) {
520     length $data == syswrite $fh, $data
521     or return; # abort on write errors
522     }
523    
524     1
525     },
526     sub {
527     my (undef, $hdr) = @_;
528    
529     my $status = $hdr->{Status};
530    
531     if (my $time = AnyEvent::HTTP::parse_date $hdr->{"last-modified"}) {
532     utime $fh, $time, $time;
533     }
534    
535     if ($status == 200 || $status == 206 || $status == 416) {
536     # download ok || resume ok || file already fully downloaded
537     $cb->(1, $hdr);
538    
539     } elsif ($status == 412) {
540     # file has changed while resuming, delete and retry
541     unlink $file;
542     $cb->(0, $hdr);
543    
544     } elsif ($status == 500 or $status == 503 or $status =~ /^59/) {
545     # retry later
546     $cb->(0, $hdr);
547    
548     } else {
549     $cb->(undef, $hdr);
550     }
551     }
552     ;
553     }
554    
555     download "http://server/somelargefile", "/tmp/somelargefile", sub {
556     if ($_[0]) {
557     print "OK!\n";
558     } elsif (defined $_[0]) {
559     print "please retry later\n";
560     } else {
561     print "ERROR\n";
562     }
563     };
564    
565     SOCKS PROXIES
566 root 1.14 Socks proxies are not directly supported by AnyEvent::HTTP. You can
567     compile your perl to support socks, or use an external program such as
568     socksify (dante) or tsocks to make your program use a socks proxy
569     transparently.
570    
571     Alternatively, for AnyEvent::HTTP only, you can use your own
572     "tcp_connect" function that does the proxy handshake - here is an
573     example that works with socks4a proxies:
574    
575     use Errno;
576     use AnyEvent::Util;
577     use AnyEvent::Socket;
578     use AnyEvent::Handle;
579    
580     # host, port and username of/for your socks4a proxy
581     my $socks_host = "10.0.0.23";
582     my $socks_port = 9050;
583     my $socks_user = "";
584    
585     sub socks4a_connect {
586     my ($host, $port, $connect_cb, $prepare_cb) = @_;
587    
588     my $hdl = new AnyEvent::Handle
589     connect => [$socks_host, $socks_port],
590     on_prepare => sub { $prepare_cb->($_[0]{fh}) },
591     on_error => sub { $connect_cb->() },
592     ;
593    
594     $hdl->push_write (pack "CCnNZ*Z*", 4, 1, $port, 1, $socks_user, $host);
595    
596     $hdl->push_read (chunk => 8, sub {
597     my ($hdl, $chunk) = @_;
598     my ($status, $port, $ipn) = unpack "xCna4", $chunk;
599    
600     if ($status == 0x5a) {
601     $connect_cb->($hdl->{fh}, (format_address $ipn) . ":$port");
602     } else {
603     $! = Errno::ENXIO; $connect_cb->();
604     }
605     });
606    
607     $hdl
608     }
609    
610     Use "socks4a_connect" instead of "tcp_connect" when doing
611     "http_request"s, possibly after switching off other proxy types:
612    
613     AnyEvent::HTTP::set_proxy undef; # usually you do not want other proxies
614    
615     http_get 'http://www.google.com', tcp_connect => \&socks4a_connect, sub {
616     my ($data, $headers) = @_;
617     ...
618     };
619    
620 root 1.1 SEE ALSO
621 root 1.2 AnyEvent.
622 root 1.1
623     AUTHOR
624 root 1.3 Marc Lehmann <schmorp@schmorp.de>
625     http://home.schmorp.de/
626 root 1.1
627 root 1.7 With many thanks to Дмитрий Шалашов, who provided
628     countless testcases and bugreports.
629