ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/AnyEvent-HTTP/README
Revision: 1.28
Committed: Mon Apr 27 12:14:12 2020 UTC (4 years ago) by root
Branch: MAIN
CVS Tags: rel-2_25, HEAD
Changes since 1.27: +29 -13 lines
Log Message:
2.25

File Contents

# Content
1 NAME
2 AnyEvent::HTTP - simple but non-blocking HTTP/HTTPS client
3
4 SYNOPSIS
5 use AnyEvent::HTTP;
6
7 http_get "http://www.nethype.de/", sub { print $_[1] };
8
9 # ... do something else here
10
11 DESCRIPTION
12 This module is an AnyEvent user, you need to make sure that you use and
13 run a supported event loop.
14
15 This module implements a simple, stateless and non-blocking HTTP client.
16 It supports GET, POST and other request methods, cookies and more, all
17 on a very low level. It can follow redirects, supports proxies, and
18 automatically limits the number of connections to the values specified
19 in the RFC.
20
21 It should generally be a "good client" that is enough for most HTTP
22 tasks. Simple tasks should be simple, but complex tasks should still be
23 possible as the user retains control over request and response headers.
24
25 The caller is responsible for authentication management, cookies (if the
26 simplistic implementation in this module doesn't suffice), referer and
27 other high-level protocol details for which this module offers only
28 limited support.
29
30 METHODS
31 http_get $url, key => value..., $cb->($data, $headers)
32 Executes an HTTP-GET request. See the http_request function for
33 details on additional parameters and the return value.
34
35 http_head $url, key => value..., $cb->($data, $headers)
36 Executes an HTTP-HEAD request. See the http_request function for
37 details on additional parameters and the return value.
38
39 http_post $url, $body, key => value..., $cb->($data, $headers)
40 Executes an HTTP-POST request with a request body of $body. See the
41 http_request function for details on additional parameters and the
42 return value.
43
44 http_request $method => $url, key => value..., $cb->($data, $headers)
45 Executes a HTTP request of type $method (e.g. "GET", "POST"). The
46 URL must be an absolute http or https URL.
47
48 When called in void context, nothing is returned. In other contexts,
49 "http_request" returns a "cancellation guard" - you have to keep the
50 object at least alive until the callback get called. If the object
51 gets destroyed before the callback is called, the request will be
52 cancelled.
53
54 The callback will be called with the response body data as first
55 argument (or "undef" if an error occurred), and a hash-ref with
56 response headers (and trailers) as second argument.
57
58 All the headers in that hash are lowercased. In addition to the
59 response headers, the "pseudo-headers" (uppercase to avoid clashing
60 with possible response headers) "HTTPVersion", "Status" and "Reason"
61 contain the three parts of the HTTP Status-Line of the same name. If
62 an error occurs during the body phase of a request, then the
63 original "Status" and "Reason" values from the header are available
64 as "OrigStatus" and "OrigReason".
65
66 The pseudo-header "URL" contains the actual URL (which can differ
67 from the requested URL when following redirects - for example, you
68 might get an error that your URL scheme is not supported even though
69 your URL is a valid http URL because it redirected to an ftp URL, in
70 which case you can look at the URL pseudo header).
71
72 The pseudo-header "Redirect" only exists when the request was a
73 result of an internal redirect. In that case it is an array
74 reference with the "($data, $headers)" from the redirect response.
75 Note that this response could in turn be the result of a redirect
76 itself, and "$headers->{Redirect}[1]{Redirect}" will then contain
77 the original response, and so on.
78
79 If the server sends a header multiple times, then their contents
80 will be joined together with a comma (","), as per the HTTP spec.
81
82 If an internal error occurs, such as not being able to resolve a
83 hostname, then $data will be "undef", "$headers->{Status}" will be
84 590-599 and the "Reason" pseudo-header will contain an error
85 message. Currently the following status codes are used:
86
87 595 - errors during connection establishment, proxy handshake.
88 596 - errors during TLS negotiation, request sending and header
89 processing.
90 597 - errors during body receiving or processing.
91 598 - user aborted request via "on_header" or "on_body".
92 599 - other, usually nonretryable, errors (garbled URL etc.).
93
94 A typical callback might look like this:
95
96 sub {
97 my ($body, $hdr) = @_;
98
99 if ($hdr->{Status} =~ /^2/) {
100 ... everything should be ok
101 } else {
102 print "error, $hdr->{Status} $hdr->{Reason}\n";
103 }
104 }
105
106 Additional parameters are key-value pairs, and are fully optional.
107 They include:
108
109 recurse => $count (default: $MAX_RECURSE)
110 Whether to recurse requests or not, e.g. on redirects,
111 authentication and other retries and so on, and how often to do
112 so.
113
114 Only redirects to http and https URLs are supported. While most
115 common redirection forms are handled entirely within this
116 module, some require the use of the optional URI module. If it
117 is required but missing, then the request will fail with an
118 error.
119
120 headers => hashref
121 The request headers to use. Currently, "http_request" may
122 provide its own "Host:", "Content-Length:", "Connection:" and
123 "Cookie:" headers and will provide defaults at least for "TE:",
124 "Referer:" and "User-Agent:" (this can be suppressed by using
125 "undef" for these headers in which case they won't be sent at
126 all).
127
128 You really should provide your own "User-Agent:" header value
129 that is appropriate for your program - I wouldn't be surprised
130 if the default AnyEvent string gets blocked by webservers sooner
131 or later.
132
133 Also, make sure that your headers names and values do not
134 contain any embedded newlines.
135
136 timeout => $seconds
137 The time-out to use for various stages - each connect attempt
138 will reset the timeout, as will read or write activity, i.e.
139 this is not an overall timeout.
140
141 Default timeout is 5 minutes.
142
143 proxy => [$host, $port[, $scheme]] or undef
144 Use the given http proxy for all requests, or no proxy if
145 "undef" is used.
146
147 $scheme must be either missing or must be "http" for HTTP.
148
149 If not specified, then the default proxy is used (see
150 "AnyEvent::HTTP::set_proxy").
151
152 Currently, if your proxy requires authorization, you have to
153 specify an appropriate "Proxy-Authorization" header in every
154 request.
155
156 Note that this module will prefer an existing persistent
157 connection, even if that connection was made using another
158 proxy. If you need to ensure that a new connection is made in
159 this case, you can either force "persistent" to false or e.g.
160 use the proxy address in your "sessionid".
161
162 body => $string
163 The request body, usually empty. Will be sent as-is (future
164 versions of this module might offer more options).
165
166 cookie_jar => $hash_ref
167 Passing this parameter enables (simplified) cookie-processing,
168 loosely based on the original netscape specification.
169
170 The $hash_ref must be an (initially empty) hash reference which
171 will get updated automatically. It is possible to save the
172 cookie jar to persistent storage with something like JSON or
173 Storable - see the "AnyEvent::HTTP::cookie_jar_expire" function
174 if you wish to remove expired or session-only cookies, and also
175 for documentation on the format of the cookie jar.
176
177 Note that this cookie implementation is not meant to be
178 complete. If you want complete cookie management you have to do
179 that on your own. "cookie_jar" is meant as a quick fix to get
180 most cookie-using sites working. Cookies are a privacy disaster,
181 do not use them unless required to.
182
183 When cookie processing is enabled, the "Cookie:" and
184 "Set-Cookie:" headers will be set and handled by this module,
185 otherwise they will be left untouched.
186
187 tls_ctx => $scheme | $tls_ctx
188 Specifies the AnyEvent::TLS context to be used for https
189 connections. This parameter follows the same rules as the
190 "tls_ctx" parameter to AnyEvent::Handle, but additionally, the
191 two strings "low" or "high" can be specified, which give you a
192 predefined low-security (no verification, highest compatibility)
193 and high-security (CA and common-name verification) TLS context.
194
195 The default for this option is "low", which could be interpreted
196 as "give me the page, no matter what".
197
198 See also the "sessionid" parameter.
199
200 sessionid => $string
201 The module might reuse connections to the same host internally
202 (regardless of other settings, such as "tcp_connect" or
203 "proxy"). Sometimes (e.g. when using TLS or a specfic proxy),
204 you do not want to reuse connections from other sessions. This
205 can be achieved by setting this parameter to some unique ID
206 (such as the address of an object storing your state data or the
207 TLS context, or the proxy IP) - only connections using the same
208 unique ID will be reused.
209
210 on_prepare => $callback->($fh)
211 In rare cases you need to "tune" the socket before it is used to
212 connect (for example, to bind it on a given IP address). This
213 parameter overrides the prepare callback passed to
214 "AnyEvent::Socket::tcp_connect" and behaves exactly the same way
215 (e.g. it has to provide a timeout). See the description for the
216 $prepare_cb argument of "AnyEvent::Socket::tcp_connect" for
217 details.
218
219 tcp_connect => $callback->($host, $service, $connect_cb,
220 $prepare_cb)
221 In even rarer cases you want total control over how
222 AnyEvent::HTTP establishes connections. Normally it uses
223 AnyEvent::Socket::tcp_connect to do this, but you can provide
224 your own "tcp_connect" function - obviously, it has to follow
225 the same calling conventions, except that it may always return a
226 connection guard object.
227
228 The connections made by this hook will be treated as equivalent
229 to connections made the built-in way, specifically, they will be
230 put into and taken from the persistent connection cache. If your
231 $tcp_connect function is incompatible with this kind of re-use,
232 consider switching off "persistent" connections and/or providing
233 a "sessionid" identifier.
234
235 There are probably lots of weird uses for this function,
236 starting from tracing the hosts "http_request" actually tries to
237 connect, to (inexact but fast) host => IP address caching or
238 even socks protocol support.
239
240 on_header => $callback->($headers)
241 When specified, this callback will be called with the header
242 hash as soon as headers have been successfully received from the
243 remote server (not on locally-generated errors).
244
245 It has to return either true (in which case AnyEvent::HTTP will
246 continue), or false, in which case AnyEvent::HTTP will cancel
247 the download (and call the finish callback with an error code of
248 598).
249
250 This callback is useful, among other things, to quickly reject
251 unwanted content, which, if it is supposed to be rare, can be
252 faster than first doing a "HEAD" request.
253
254 The downside is that cancelling the request makes it impossible
255 to re-use the connection. Also, the "on_header" callback will
256 not receive any trailer (headers sent after the response body).
257
258 Example: cancel the request unless the content-type is
259 "text/html".
260
261 on_header => sub {
262 $_[0]{"content-type"} =~ /^text\/html\s*(?:;|$)/
263 },
264
265 on_body => $callback->($partial_body, $headers)
266 When specified, all body data will be passed to this callback
267 instead of to the completion callback. The completion callback
268 will get the empty string instead of the body data.
269
270 It has to return either true (in which case AnyEvent::HTTP will
271 continue), or false, in which case AnyEvent::HTTP will cancel
272 the download (and call the completion callback with an error
273 code of 598).
274
275 The downside to cancelling the request is that it makes it
276 impossible to re-use the connection.
277
278 This callback is useful when the data is too large to be held in
279 memory (so the callback writes it to a file) or when only some
280 information should be extracted, or when the body should be
281 processed incrementally.
282
283 It is usually preferred over doing your own body handling via
284 "want_body_handle", but in case of streaming APIs, where HTTP is
285 only used to create a connection, "want_body_handle" is the
286 better alternative, as it allows you to install your own event
287 handler, reducing resource usage.
288
289 want_body_handle => $enable
290 When enabled (default is disabled), the behaviour of
291 AnyEvent::HTTP changes considerably: after parsing the headers,
292 and instead of downloading the body (if any), the completion
293 callback will be called. Instead of the $body argument
294 containing the body data, the callback will receive the
295 AnyEvent::Handle object associated with the connection. In error
296 cases, "undef" will be passed. When there is no body (e.g.
297 status 304), the empty string will be passed.
298
299 The handle object might or might not be in TLS mode, might be
300 connected to a proxy, be a persistent connection, use chunked
301 transfer encoding etc., and configured in unspecified ways. The
302 user is responsible for this handle (it will not be used by this
303 module anymore).
304
305 This is useful with some push-type services, where, after the
306 initial headers, an interactive protocol is used (typical
307 example would be the push-style twitter API which starts a
308 JSON/XML stream).
309
310 If you think you need this, first have a look at "on_body", to
311 see if that doesn't solve your problem in a better way.
312
313 persistent => $boolean
314 Try to create/reuse a persistent connection. When this flag is
315 set (default: true for idempotent requests, false for all
316 others), then "http_request" tries to re-use an existing
317 (previously-created) persistent connection to same host (i.e.
318 identical URL scheme, hostname, port and sessionid) and, failing
319 that, tries to create a new one.
320
321 Requests failing in certain ways will be automatically retried
322 once, which is dangerous for non-idempotent requests, which is
323 why it defaults to off for them. The reason for this is because
324 the bozos who designed HTTP/1.1 made it impossible to
325 distinguish between a fatal error and a normal connection
326 timeout, so you never know whether there was a problem with your
327 request or not.
328
329 When reusing an existent connection, many parameters (such as
330 TLS context) will be ignored. See the "sessionid" parameter for
331 a workaround.
332
333 keepalive => $boolean
334 Only used when "persistent" is also true. This parameter decides
335 whether "http_request" tries to handshake a HTTP/1.0-style
336 keep-alive connection (as opposed to only a HTTP/1.1 persistent
337 connection).
338
339 The default is true, except when using a proxy, in which case it
340 defaults to false, as HTTP/1.0 proxies cannot support this in a
341 meaningful way.
342
343 handle_params => { key => value ... }
344 The key-value pairs in this hash will be passed to any
345 AnyEvent::Handle constructor that is called - not all requests
346 will create a handle, and sometimes more than one is created, so
347 this parameter is only good for setting hints.
348
349 Example: set the maximum read size to 4096, to potentially
350 conserve memory at the cost of speed.
351
352 handle_params => {
353 max_read_size => 4096,
354 },
355
356 Example: do a simple HTTP GET request for http://www.nethype.de/ and
357 print the response body.
358
359 http_request GET => "http://www.nethype.de/", sub {
360 my ($body, $hdr) = @_;
361 print "$body\n";
362 };
363
364 Example: do a HTTP HEAD request on https://www.google.com/, use a
365 timeout of 30 seconds.
366
367 http_request
368 HEAD => "https://www.google.com",
369 headers => { "user-agent" => "MySearchClient 1.0" },
370 timeout => 30,
371 sub {
372 my ($body, $hdr) = @_;
373 use Data::Dumper;
374 print Dumper $hdr;
375 }
376 ;
377
378 Example: do another simple HTTP GET request, but immediately try to
379 cancel it.
380
381 my $request = http_request GET => "http://www.nethype.de/", sub {
382 my ($body, $hdr) = @_;
383 print "$body\n";
384 };
385
386 undef $request;
387
388 DNS CACHING
389 AnyEvent::HTTP uses the AnyEvent::Socket::tcp_connect function for the
390 actual connection, which in turn uses AnyEvent::DNS to resolve
391 hostnames. The latter is a simple stub resolver and does no caching on
392 its own. If you want DNS caching, you currently have to provide your own
393 default resolver (by storing a suitable resolver object in
394 $AnyEvent::DNS::RESOLVER) or your own "tcp_connect" callback.
395
396 GLOBAL FUNCTIONS AND VARIABLES
397 AnyEvent::HTTP::set_proxy "proxy-url"
398 Sets the default proxy server to use. The proxy-url must begin with
399 a string of the form "http://host:port", croaks otherwise.
400
401 To clear an already-set proxy, use "undef".
402
403 When AnyEvent::HTTP is loaded for the first time it will query the
404 default proxy from the operating system, currently by looking at
405 "$ENV{http_proxy"}.
406
407 AnyEvent::HTTP::cookie_jar_expire $jar[, $session_end]
408 Remove all cookies from the cookie jar that have been expired. If
409 $session_end is given and true, then additionally remove all session
410 cookies.
411
412 You should call this function (with a true $session_end) before you
413 save cookies to disk, and you should call this function after
414 loading them again. If you have a long-running program you can
415 additionally call this function from time to time.
416
417 A cookie jar is initially an empty hash-reference that is managed by
418 this module. Its format is subject to change, but currently it is as
419 follows:
420
421 The key "version" has to contain 2, otherwise the hash gets cleared.
422 All other keys are hostnames or IP addresses pointing to
423 hash-references. The key for these inner hash references is the
424 server path for which this cookie is meant, and the values are again
425 hash-references. Each key of those hash-references is a cookie name,
426 and the value, you guessed it, is another hash-reference, this time
427 with the key-value pairs from the cookie, except for "expires" and
428 "max-age", which have been replaced by a "_expires" key that
429 contains the cookie expiry timestamp. Session cookies are indicated
430 by not having an "_expires" key.
431
432 Here is an example of a cookie jar with a single cookie, so you have
433 a chance of understanding the above paragraph:
434
435 {
436 version => 2,
437 "10.0.0.1" => {
438 "/" => {
439 "mythweb_id" => {
440 _expires => 1293917923,
441 value => "ooRung9dThee3ooyXooM1Ohm",
442 },
443 },
444 },
445 }
446
447 $date = AnyEvent::HTTP::format_date $timestamp
448 Takes a POSIX timestamp (seconds since the epoch) and formats it as
449 a HTTP Date (RFC 2616).
450
451 $timestamp = AnyEvent::HTTP::parse_date $date
452 Takes a HTTP Date (RFC 2616) or a Cookie date (netscape cookie spec)
453 or a bunch of minor variations of those, and returns the
454 corresponding POSIX timestamp, or "undef" if the date cannot be
455 parsed.
456
457 $AnyEvent::HTTP::MAX_RECURSE
458 The default value for the "recurse" request parameter (default: 10).
459
460 $AnyEvent::HTTP::TIMEOUT
461 The default timeout for connection operations (default: 300).
462
463 $AnyEvent::HTTP::USERAGENT
464 The default value for the "User-Agent" header (the default is
465 "Mozilla/5.0 (compatible; U; AnyEvent-HTTP/$VERSION;
466 +http://software.schmorp.de/pkg/AnyEvent)").
467
468 $AnyEvent::HTTP::MAX_PER_HOST
469 The maximum number of concurrent connections to the same host
470 (identified by the hostname). If the limit is exceeded, then
471 additional requests are queued until previous connections are
472 closed. Both persistent and non-persistent connections are counted
473 in this limit.
474
475 The default value for this is 4, and it is highly advisable to not
476 increase it much.
477
478 For comparison: the RFC's recommend 4 non-persistent or 2 persistent
479 connections, older browsers used 2, newer ones (such as firefox 3)
480 typically use 6, and Opera uses 8 because like, they have the
481 fastest browser and give a shit for everybody else on the planet.
482
483 $AnyEvent::HTTP::PERSISTENT_TIMEOUT
484 The time after which idle persistent connections get closed by
485 AnyEvent::HTTP (default: 3).
486
487 $AnyEvent::HTTP::ACTIVE
488 The number of active connections. This is not the number of
489 currently running requests, but the number of currently open and
490 non-idle TCP connections. This number can be useful for
491 load-leveling.
492
493 SHOWCASE
494 This section contains some more elaborate "real-world" examples or code
495 snippets.
496
497 HTTP/1.1 FILE DOWNLOAD
498 Downloading files with HTTP can be quite tricky, especially when
499 something goes wrong and you want to resume.
500
501 Here is a function that initiates and resumes a download. It uses the
502 last modified time to check for file content changes, and works with
503 many HTTP/1.0 servers as well, and usually falls back to a complete
504 re-download on older servers.
505
506 It calls the completion callback with either "undef", which means a
507 nonretryable error occurred, 0 when the download was partial and should
508 be retried, and 1 if it was successful.
509
510 use AnyEvent::HTTP;
511
512 sub download($$$) {
513 my ($url, $file, $cb) = @_;
514
515 open my $fh, "+<", $file
516 or die "$file: $!";
517
518 my %hdr;
519 my $ofs = 0;
520
521 if (stat $fh and -s _) {
522 $ofs = -s _;
523 warn "-s is ", $ofs;
524 $hdr{"if-unmodified-since"} = AnyEvent::HTTP::format_date +(stat _)[9];
525 $hdr{"range"} = "bytes=$ofs-";
526 }
527
528 http_get $url,
529 headers => \%hdr,
530 on_header => sub {
531 my ($hdr) = @_;
532
533 if ($hdr->{Status} == 200 && $ofs) {
534 # resume failed
535 truncate $fh, $ofs = 0;
536 }
537
538 sysseek $fh, $ofs, 0;
539
540 1
541 },
542 on_body => sub {
543 my ($data, $hdr) = @_;
544
545 if ($hdr->{Status} =~ /^2/) {
546 length $data == syswrite $fh, $data
547 or return; # abort on write errors
548 }
549
550 1
551 },
552 sub {
553 my (undef, $hdr) = @_;
554
555 my $status = $hdr->{Status};
556
557 if (my $time = AnyEvent::HTTP::parse_date $hdr->{"last-modified"}) {
558 utime $time, $time, $fh;
559 }
560
561 if ($status == 200 || $status == 206 || $status == 416) {
562 # download ok || resume ok || file already fully downloaded
563 $cb->(1, $hdr);
564
565 } elsif ($status == 412) {
566 # file has changed while resuming, delete and retry
567 unlink $file;
568 $cb->(0, $hdr);
569
570 } elsif ($status == 500 or $status == 503 or $status =~ /^59/) {
571 # retry later
572 $cb->(0, $hdr);
573
574 } else {
575 $cb->(undef, $hdr);
576 }
577 }
578 ;
579 }
580
581 download "http://server/somelargefile", "/tmp/somelargefile", sub {
582 if ($_[0]) {
583 print "OK!\n";
584 } elsif (defined $_[0]) {
585 print "please retry later\n";
586 } else {
587 print "ERROR\n";
588 }
589 };
590
591 SOCKS PROXIES
592 Socks proxies are not directly supported by AnyEvent::HTTP. You can
593 compile your perl to support socks, or use an external program such as
594 socksify (dante) or tsocks to make your program use a socks proxy
595 transparently.
596
597 Alternatively, for AnyEvent::HTTP only, you can use your own
598 "tcp_connect" function that does the proxy handshake - here is an
599 example that works with socks4a proxies:
600
601 use Errno;
602 use AnyEvent::Util;
603 use AnyEvent::Socket;
604 use AnyEvent::Handle;
605
606 # host, port and username of/for your socks4a proxy
607 my $socks_host = "10.0.0.23";
608 my $socks_port = 9050;
609 my $socks_user = "";
610
611 sub socks4a_connect {
612 my ($host, $port, $connect_cb, $prepare_cb) = @_;
613
614 my $hdl = new AnyEvent::Handle
615 connect => [$socks_host, $socks_port],
616 on_prepare => sub { $prepare_cb->($_[0]{fh}) },
617 on_error => sub { $connect_cb->() },
618 ;
619
620 $hdl->push_write (pack "CCnNZ*Z*", 4, 1, $port, 1, $socks_user, $host);
621
622 $hdl->push_read (chunk => 8, sub {
623 my ($hdl, $chunk) = @_;
624 my ($status, $port, $ipn) = unpack "xCna4", $chunk;
625
626 if ($status == 0x5a) {
627 $connect_cb->($hdl->{fh}, (format_address $ipn) . ":$port");
628 } else {
629 $! = Errno::ENXIO; $connect_cb->();
630 }
631 });
632
633 $hdl
634 }
635
636 Use "socks4a_connect" instead of "tcp_connect" when doing
637 "http_request"s, possibly after switching off other proxy types:
638
639 AnyEvent::HTTP::set_proxy undef; # usually you do not want other proxies
640
641 http_get 'http://www.google.com', tcp_connect => \&socks4a_connect, sub {
642 my ($data, $headers) = @_;
643 ...
644 };
645
646 SEE ALSO
647 AnyEvent.
648
649 AUTHOR
650 Marc Lehmann <schmorp@schmorp.de>
651 http://home.schmorp.de/
652
653 With many thanks to Дмитрий Шалашов, who provided countless testcases
654 and bugreports.
655