ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/AnyEvent-HTTP/HTTP.pm
(Generate patch)

Comparing AnyEvent-HTTP/HTTP.pm (file contents):
Revision 1.9 by root, Wed Jun 4 13:51:53 2008 UTC vs.
Revision 1.20 by root, Mon Jun 9 13:04:23 2008 UTC

3AnyEvent::HTTP - simple but non-blocking HTTP/HTTPS client 3AnyEvent::HTTP - simple but non-blocking HTTP/HTTPS client
4 4
5=head1 SYNOPSIS 5=head1 SYNOPSIS
6 6
7 use AnyEvent::HTTP; 7 use AnyEvent::HTTP;
8
9 http_get "http://www.nethype.de/", sub { print $_[1] };
10
11 # ... do something else here
8 12
9=head1 DESCRIPTION 13=head1 DESCRIPTION
10 14
11This module is an L<AnyEvent> user, you need to make sure that you use and 15This module is an L<AnyEvent> user, you need to make sure that you use and
12run a supported event loop. 16run a supported event loop.
17
18This module implements a simple, stateless and non-blocking HTTP
19client. It supports GET, POST and other request methods, cookies and more,
20all on a very low level. It can follow redirects supports proxies and
21automatically limits the number of connections to the values specified in
22the RFC.
23
24It should generally be a "good client" that is enough for most HTTP
25tasks. Simple tasks should be simple, but complex tasks should still be
26possible as the user retains control over request and response headers.
27
28The caller is responsible for authentication management, cookies (if
29the simplistic implementation in this module doesn't suffice), referer
30and other high-level protocol details for which this module offers only
31limited support.
13 32
14=head2 METHODS 33=head2 METHODS
15 34
16=over 4 35=over 4
17 36
29use AnyEvent::Socket (); 48use AnyEvent::Socket ();
30use AnyEvent::Handle (); 49use AnyEvent::Handle ();
31 50
32use base Exporter::; 51use base Exporter::;
33 52
34our $VERSION = '1.0'; 53our $VERSION = '1.01';
35 54
36our @EXPORT = qw(http_get http_request); 55our @EXPORT = qw(http_get http_post http_head http_request);
37 56
38our $USERAGENT = "Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)"; 57our $USERAGENT = "Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)";
39our $MAX_RECURSE = 10; 58our $MAX_RECURSE = 10;
40our $MAX_PERSISTENT = 8; 59our $MAX_PERSISTENT = 8;
41our $PERSISTENT_TIMEOUT = 2; 60our $PERSISTENT_TIMEOUT = 2;
42our $TIMEOUT = 300; 61our $TIMEOUT = 300;
43 62
44# changing these is evil 63# changing these is evil
45our $MAX_PERSISTENT_PER_HOST = 2; 64our $MAX_PERSISTENT_PER_HOST = 2;
46our $MAX_PER_HOST = 4; # not respected yet :( 65our $MAX_PER_HOST = 4;
47 66
48our $PROXY; 67our $PROXY;
68our $ACTIVE = 0;
49 69
50my %KA_COUNT; # number of open keep-alive connections per host 70my %KA_COUNT; # number of open keep-alive connections per host
71my %CO_SLOT; # number of open connections, and wait queue, per host
51 72
52=item http_get $url, key => value..., $cb->($data, $headers) 73=item http_get $url, key => value..., $cb->($data, $headers)
53 74
54Executes an HTTP-GET request. See the http_request function for details on 75Executes an HTTP-GET request. See the http_request function for details on
55additional parameters. 76additional parameters.
72The callback will be called with the response data as first argument 93The callback will be called with the response data as first argument
73(or C<undef> if it wasn't available due to errors), and a hash-ref with 94(or C<undef> if it wasn't available due to errors), and a hash-ref with
74response headers as second argument. 95response headers as second argument.
75 96
76All the headers in that hash are lowercased. In addition to the response 97All the headers in that hash are lowercased. In addition to the response
77headers, the three "pseudo-headers" C<HTTPVersion>, C<Status> and 98headers, the "pseudo-headers" C<HTTPVersion>, C<Status> and C<Reason>
78C<Reason> contain the three parts of the HTTP Status-Line of the same 99contain the three parts of the HTTP Status-Line of the same name. The
79name. 100pseudo-header C<URL> contains the original URL (which can differ from the
101requested URL when following redirects).
102
103If the server sends a header multiple lines, then their contents will be
104joined together with C<\x00>.
80 105
81If an internal error occurs, such as not being able to resolve a hostname, 106If an internal error occurs, such as not being able to resolve a hostname,
82then C<$data> will be C<undef>, C<< $headers->{Status} >> will be C<599> 107then C<$data> will be C<undef>, C<< $headers->{Status} >> will be C<599>
83and the C<Reason> pseudo-header will contain an error message. 108and the C<Reason> pseudo-header will contain an error message.
84 109
104Whether to recurse requests or not, e.g. on redirects, authentication 129Whether to recurse requests or not, e.g. on redirects, authentication
105retries and so on, and how often to do so. 130retries and so on, and how often to do so.
106 131
107=item headers => hashref 132=item headers => hashref
108 133
109The request headers to use. 134The request headers to use. Currently, C<http_request> may provide its
135own C<Host:>, C<Content-Length:>, C<Connection:> and C<Cookie:> headers
136and will provide defaults for C<User-Agent:> and C<Referer:>.
110 137
111=item timeout => $seconds 138=item timeout => $seconds
112 139
113The time-out to use for various stages - each connect attempt will reset 140The time-out to use for various stages - each connect attempt will reset
114the timeout, as will read or write activity. Default timeout is 5 minutes. 141the timeout, as will read or write activity. Default timeout is 5 minutes.
123 150
124=item body => $string 151=item body => $string
125 152
126The request body, usually empty. Will be-sent as-is (future versions of 153The request body, usually empty. Will be-sent as-is (future versions of
127this module might offer more options). 154this module might offer more options).
155
156=item cookie_jar => $hash_ref
157
158Passing this parameter enables (simplified) cookie-processing, loosely
159based on the original netscape specification.
160
161The C<$hash_ref> must be an (initially empty) hash reference which will
162get updated automatically. It is possible to save the cookie_jar to
163persistent storage with something like JSON or Storable, but this is not
164recommended, as expire times are currently being ignored.
165
166Note that this cookie implementation is not of very high quality, nor
167meant to be complete. If you want complete cookie management you have to
168do that on your own. C<cookie_jar> is meant as a quick fix to get some
169cookie-using sites working. Cookies are a privacy disaster, do not use
170them unless required to.
128 171
129=back 172=back
130 173
131Example: make a simple HTTP GET request for http://www.nethype.de/ 174Example: make a simple HTTP GET request for http://www.nethype.de/
132 175
148 } 191 }
149 ; 192 ;
150 193
151=cut 194=cut
152 195
196sub _slot_schedule;
197sub _slot_schedule($) {
198 my $host = shift;
199
200 while ($CO_SLOT{$host}[0] < $MAX_PER_HOST) {
201 if (my $cb = shift @{ $CO_SLOT{$host}[1] }) {
202 # somebody wants that slot
203 ++$CO_SLOT{$host}[0];
204 ++$ACTIVE;
205
206 $cb->(AnyEvent::Util::guard {
207 --$ACTIVE;
208 --$CO_SLOT{$host}[0];
209 _slot_schedule $host;
210 });
211 } else {
212 # nobody wants the slot, maybe we can forget about it
213 delete $CO_SLOT{$host} unless $CO_SLOT{$host}[0];
214 last;
215 }
216 }
217}
218
219# wait for a free slot on host, call callback
220sub _get_slot($$) {
221 push @{ $CO_SLOT{$_[0]}[1] }, $_[1];
222
223 _slot_schedule $_[0];
224}
225
153sub http_request($$$;@) { 226sub http_request($$@) {
154 my $cb = pop; 227 my $cb = pop;
155 my ($method, $url, %arg) = @_; 228 my ($method, $url, %arg) = @_;
156 229
157 my %hdr; 230 my %hdr;
158 231
164 } 237 }
165 } 238 }
166 239
167 my $recurse = exists $arg{recurse} ? $arg{recurse} : $MAX_RECURSE; 240 my $recurse = exists $arg{recurse} ? $arg{recurse} : $MAX_RECURSE;
168 241
169 return $cb->(undef, { Status => 599, Reason => "recursion limit reached" }) 242 return $cb->(undef, { Status => 599, Reason => "recursion limit reached", URL => $url })
170 if $recurse < 0; 243 if $recurse < 0;
171 244
172 my $proxy = $arg{proxy} || $PROXY; 245 my $proxy = $arg{proxy} || $PROXY;
173 my $timeout = $arg{timeout} || $TIMEOUT; 246 my $timeout = $arg{timeout} || $TIMEOUT;
174 247
175 $hdr{"user-agent"} ||= $USERAGENT; 248 $hdr{"user-agent"} ||= $USERAGENT;
176 249
177 my ($host, $port, $path, $scheme); 250 my ($scheme, $authority, $upath, $query, $fragment) =
251 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
252
253 $scheme = lc $scheme;
254
255 my $uport = $scheme eq "http" ? 80
256 : $scheme eq "https" ? 443
257 : return $cb->(undef, { Status => 599, Reason => "only http and https URL schemes supported", URL => $url });
258
259 $hdr{referer} ||= "$scheme://$authority$upath"; # leave out fragment and query string, just a heuristic
260
261 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x
262 or return $cb->(undef, { Status => 599, Reason => "unparsable URL", URL => $url });
263
264 my $uhost = $1;
265 $uport = $2 if defined $2;
266
267 $uhost =~ s/^\[(.*)\]$/$1/;
268 $upath .= "?$query" if length $query;
269
270 $upath =~ s%^/?%/%;
271
272 # cookie processing
273 if (my $jar = $arg{cookie_jar}) {
274 %$jar = () if $jar->{version} < 1;
275
276 my @cookie;
277
278 while (my ($chost, $v) = each %$jar) {
279 next unless $chost eq substr $uhost, -length $chost;
280 next unless $chost =~ /^\./;
281
282 while (my ($cpath, $v) = each %$v) {
283 next unless $cpath eq substr $upath, 0, length $cpath;
284
285 while (my ($k, $v) = each %$v) {
286 next if $scheme ne "https" && exists $v->{secure};
287 push @cookie, "$k=$v->{value}";
288 }
289 }
290 }
291
292 $hdr{cookie} = join "; ", @cookie
293 if @cookie;
294 }
295
296 my ($rhost, $rport, $rpath); # request host, port, path
178 297
179 if ($proxy) { 298 if ($proxy) {
180 ($host, $port, $scheme) = @$proxy; 299 ($rhost, $rport, $scheme) = @$proxy;
181 $path = $url; 300 $rpath = $url;
182 } else { 301 } else {
183 ($scheme, my $authority, $path, my $query, my $fragment) = 302 ($rhost, $rport, $rpath) = ($uhost, $uport, $upath);
184 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
185
186 $port = $scheme eq "http" ? 80
187 : $scheme eq "https" ? 443
188 : return $cb->(undef, { Status => 599, Reason => "$url: only http and https URLs supported" });
189
190 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x
191 or return $cb->(undef, { Status => 599, Reason => "$url: unparsable URL" });
192
193 $host = $1;
194 $port = $2 if defined $2;
195
196 $host =~ s/^\[(.*)\]$/$1/;
197 $path .= "?$query" if length $query;
198
199 $path = "/" unless $path;
200
201 $hdr{host} = $host = lc $host; 303 $hdr{host} = $uhost;
202 } 304 }
203 305
204 $scheme = lc $scheme;
205
206 my %state;
207
208 $hdr{"content-length"} = length $arg{body}; 306 $hdr{"content-length"} = length $arg{body};
209 307
308 my %state = (connect_guard => 1);
309
310 _get_slot $uhost, sub {
311 $state{slot_guard} = shift;
312
313 return unless $state{connect_guard};
314
210 $state{connect_guard} = AnyEvent::Socket::tcp_connect $host, $port, sub { 315 $state{connect_guard} = AnyEvent::Socket::tcp_connect $rhost, $rport, sub {
211 $state{fh} = shift 316 $state{fh} = shift
212 or return $cb->(undef, { Status => 599, Reason => "$!" }); 317 or return $cb->(undef, { Status => 599, Reason => "$!", URL => $url });
213 318
214 delete $state{connect_guard}; # reduce memory usage, save a tree 319 delete $state{connect_guard}; # reduce memory usage, save a tree
215 320
216 # get handle 321 # get handle
217 $state{handle} = new AnyEvent::Handle 322 $state{handle} = new AnyEvent::Handle
218 fh => $state{fh}, 323 fh => $state{fh},
219 ($scheme eq "https" ? (tls => "connect") : ()); 324 ($scheme eq "https" ? (tls => "connect") : ());
220 325
221 # limit the number of persistent connections 326 # limit the number of persistent connections
222 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) { 327 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) {
223 ++$KA_COUNT{$_[1]}; 328 ++$KA_COUNT{$_[1]};
224 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} }; 329 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} };
225 $hdr{connection} = "keep-alive"; 330 $hdr{connection} = "keep-alive";
226 delete $hdr{connection}; # keep-alive not yet supported 331 delete $hdr{connection}; # keep-alive not yet supported
227 } else { 332 } else {
228 delete $hdr{connection}; 333 delete $hdr{connection};
229 } 334 }
230 335
231 # (re-)configure handle 336 # (re-)configure handle
232 $state{handle}->timeout ($timeout); 337 $state{handle}->timeout ($timeout);
233 $state{handle}->on_error (sub { 338 $state{handle}->on_error (sub {
339 my $errno = "$!";
234 %state = (); 340 %state = ();
235 $cb->(undef, { Status => 599, Reason => "$!" }); 341 $cb->(undef, { Status => 599, Reason => $errno, URL => $url });
236 }); 342 });
237 $state{handle}->on_eof (sub { 343 $state{handle}->on_eof (sub {
238 %state = (); 344 %state = ();
239 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file" }); 345 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file", URL => $url });
240 }); 346 });
241 347
242 # send request 348 # send request
243 $state{handle}->push_write ( 349 $state{handle}->push_write (
244 "$method $path HTTP/1.0\015\012" 350 "$method $rpath HTTP/1.0\015\012"
245 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr) 351 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr)
246 . "\015\012" 352 . "\015\012"
247 . (delete $arg{body}) 353 . (delete $arg{body})
248 );
249
250 %hdr = (); # reduce memory usage, save a kitten
251
252 # status line
253 $state{handle}->push_read (line => qr/\015?\012/, sub {
254 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
255 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])" }));
256
257 my %hdr = ( # response headers
258 HTTPVersion => ",$1",
259 Status => ",$2",
260 Reason => ",$3",
261 ); 354 );
262 355
356 %hdr = (); # reduce memory usage, save a kitten
357
358 # status line
359 $state{handle}->push_read (line => qr/\015?\012/, sub {
360 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
361 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])", URL => $url }));
362
363 my %hdr = ( # response headers
364 HTTPVersion => "\x00$1",
365 Status => "\x00$2",
366 Reason => "\x00$3",
367 URL => "\x00$url"
368 );
369
263 # headers, could be optimized a bit 370 # headers, could be optimized a bit
264 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub { 371 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub {
265 for ("$_[1]\012") { 372 for ("$_[1]\012") {
266 # we support spaces in field names, as lotus domino 373 # we support spaces in field names, as lotus domino
267 # creates them. 374 # creates them.
268 $hdr{lc $1} .= ",$2" 375 $hdr{lc $1} .= "\x00$2"
269 while /\G 376 while /\G
270 ([^:\000-\037]+): 377 ([^:\000-\037]+):
271 [\011\040]* 378 [\011\040]*
272 ((?: [^\015\012]+ | \015?\012[\011\040] )*) 379 ((?: [^\015\012]+ | \015?\012[\011\040] )*)
273 \015?\012 380 \015?\012
274 /gxc; 381 /gxc;
275 382
276 /\G$/ 383 /\G$/
277 or return $cb->(undef, { Status => 599, Reason => "garbled response headers" }); 384 or return (%state = (), $cb->(undef, { Status => 599, Reason => "garbled response headers", URL => $url }));
278 } 385 }
279 386
280 substr $_, 0, 1, "" 387 substr $_, 0, 1, ""
281 for values %hdr; 388 for values %hdr;
282 389
283 my $finish = sub { 390 my $finish = sub {
391 %state = ();
392
393 # set-cookie processing
394 if ($arg{cookie_jar} && exists $hdr{"set-cookie"}) {
395 for (split /\x00/, $hdr{"set-cookie"}) {
396 my ($cookie, @arg) = split /;\s*/;
397 my ($name, $value) = split /=/, $cookie, 2;
398 my %kv = (value => $value, map { split /=/, $_, 2 } @arg);
399
400 my $cdom = (delete $kv{domain}) || $uhost;
401 my $cpath = (delete $kv{path}) || "/";
402
403 $cdom =~ s/^.?/./; # make sure it starts with a "."
404
405 next if $cdom =~ /\.$/;
406
407 # this is not rfc-like and not netscape-like. go figure.
408 my $ndots = $cdom =~ y/.//;
409 next if $ndots < ($cdom =~ /\.[^.][^.]\.[^.][^.]$/ ? 3 : 2);
410
411 # store it
412 $arg{cookie_jar}{version} = 1;
413 $arg{cookie_jar}{$cdom}{$cpath}{$name} = \%kv;
414 }
415 }
416
284 if ($_[1]{Status} =~ /^30[12]$/ && $recurse) { 417 if ($_[1]{Status} =~ /^30[12]$/ && $recurse) {
418 # microsoft and other assholes don't give a shit for following standards,
419 # try to support a common form of broken Location header.
420 $_[1]{location} =~ s%^/%$scheme://$uhost:$uport/%;
421
285 http_request ($method, $_[1]{location}, %arg, recurse => $recurse - 1, $cb); 422 http_request ($method, $_[1]{location}, %arg, recurse => $recurse - 1, $cb);
423 } else {
424 $cb->($_[0], $_[1]);
425 }
426 };
427
428 if ($hdr{Status} =~ /^(?:1..|204|304)$/ or $method eq "HEAD") {
429 $finish->(undef, \%hdr);
286 } else { 430 } else {
287 $cb->($_[0], $_[1]); 431 if (exists $hdr{"content-length"}) {
432 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub {
433 # could cache persistent connection now
434 if ($hdr{connection} =~ /\bkeep-alive\b/i) {
435 # but we don't, due to misdesigns, this is annoyingly complex
436 };
437
438 $finish->($_[1], \%hdr);
439 });
440 } else {
441 # too bad, need to read until we get an error or EOF,
442 # no way to detect winged data.
443 $_[0]->on_error (sub {
444 $finish->($_[0]{rbuf}, \%hdr);
445 });
446 $_[0]->on_eof (undef);
447 $_[0]->on_read (sub { });
448 }
288 } 449 }
289 }; 450 });
290
291 if ($hdr{Status} =~ /^(?:1..|204|304)$/ or $method eq "HEAD") {
292 %state = ();
293 $finish->(undef, \%hdr);
294 } else {
295 if (exists $hdr{"content-length"}) {
296 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub {
297 # could cache persistent connection now
298 if ($hdr{connection} =~ /\bkeep-alive\b/i) {
299 # but we don't, due to misdesigns, this is annoyingly complex
300 };
301
302 %state = ();
303 $finish->($_[1], \%hdr);
304 });
305 } else {
306 # too bad, need to read until we get an error or EOF,
307 # no way to detect winged data.
308 $_[0]->on_error (sub {
309 %state = ();
310 $finish->($_[0]{rbuf}, \%hdr);
311 });
312 $_[0]->on_eof (undef);
313 $_[0]->on_read (sub { });
314 }
315 }
316 }); 451 });
452 }, sub {
453 $timeout
317 }); 454 };
318 }, sub {
319 $timeout
320 }; 455 };
321 456
322 defined wantarray && AnyEvent::Util::guard { %state = () } 457 defined wantarray && AnyEvent::Util::guard { %state = () }
323} 458}
324 459
325sub http_get($$;@) { 460sub http_get($@) {
326 unshift @_, "GET"; 461 unshift @_, "GET";
327 &http_request 462 &http_request
328} 463}
329 464
330sub http_head($$;@) { 465sub http_head($@) {
331 unshift @_, "HEAD"; 466 unshift @_, "HEAD";
332 &http_request 467 &http_request
333} 468}
334 469
335sub http_post($$$;@) { 470sub http_post($$@) {
336 unshift @_, "POST", "body"; 471 unshift @_, "POST", "body";
337 &http_request 472 &http_request
338} 473}
339 474
340=back 475=back
367 502
368The maximum time to cache a persistent connection, in seconds (default: 2). 503The maximum time to cache a persistent connection, in seconds (default: 2).
369 504
370Not implemented currently. 505Not implemented currently.
371 506
507=item $AnyEvent::HTTP::ACTIVE
508
509The number of active connections. This is not the number of currently
510running requests, but the number of currently open and non-idle TCP
511connections. This number of can be useful for load-leveling.
512
372=back 513=back
373 514
374=cut 515=cut
375 516
376sub set_proxy($) { 517sub set_proxy($) {
384 525
385L<AnyEvent>. 526L<AnyEvent>.
386 527
387=head1 AUTHOR 528=head1 AUTHOR
388 529
389 Marc Lehmann <schmorp@schmorp.de> 530 Marc Lehmann <schmorp@schmorp.de>
390 http://home.schmorp.de/ 531 http://home.schmorp.de/
391 532
392=cut 533=cut
393 534
3941 5351
395 536

Diff Legend

Removed lines
+ Added lines
< Changed lines
> Changed lines