ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/AnyEvent-HTTP/HTTP.pm
(Generate patch)

Comparing AnyEvent-HTTP/HTTP.pm (file contents):
Revision 1.1 by root, Tue Jun 3 16:37:13 2008 UTC vs.
Revision 1.16 by root, Fri Jun 6 12:57:48 2008 UTC

8 8
9=head1 DESCRIPTION 9=head1 DESCRIPTION
10 10
11This module is an L<AnyEvent> user, you need to make sure that you use and 11This module is an L<AnyEvent> user, you need to make sure that you use and
12run a supported event loop. 12run a supported event loop.
13
14This module implements a simple, stateless and non-blocking HTTP
15client. It supports GET, POST and other request methods, cookies and more,
16all on a very low level. It can follow redirects supports proxies and
17automatically limits the number of connections to the values specified in
18the RFC.
19
20It should generally be a "good client" that is enough for most HTTP
21tasks. Simple tasks should be simple, but complex tasks should still be
22possible as the user retains control over request and response headers.
23
24The caller is responsible for authentication management, cookies (if
25the simplistic implementation in this module doesn't suffice), referer
26and other high-level protocol details for which this module offers only
27limited support.
13 28
14=head2 METHODS 29=head2 METHODS
15 30
16=over 4 31=over 4
17 32
29use AnyEvent::Socket (); 44use AnyEvent::Socket ();
30use AnyEvent::Handle (); 45use AnyEvent::Handle ();
31 46
32use base Exporter::; 47use base Exporter::;
33 48
34our $VERSION = '1.0'; 49our $VERSION = '1.01';
35 50
36our @EXPORT = qw(http_get http_request); 51our @EXPORT = qw(http_get http_request);
37 52
38our $MAX_REDIRECTS = 10;
39our $USERAGENT = "Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)"; 53our $USERAGENT = "Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)";
54our $MAX_RECURSE = 10;
40our $MAX_PERSISTENT = 8; 55our $MAX_PERSISTENT = 8;
41our $PERSISTENT_TIMEOUT = 15; 56our $PERSISTENT_TIMEOUT = 2;
42our $TIMEOUT = 60; 57our $TIMEOUT = 300;
43 58
44# changing these is evil 59# changing these is evil
45our $MAX_PERSISTENT_PER_HOST = 2; 60our $MAX_PERSISTENT_PER_HOST = 2;
46our $MAX_PER_HOST = 4; # not respected yet :( 61our $MAX_PER_HOST = 4;
62
63our $PROXY;
64our $ACTIVE = 0;
47 65
48my %KA_COUNT; # number of open keep-alive connections per host 66my %KA_COUNT; # number of open keep-alive connections per host
67my %CO_SLOT; # number of open connections, and wait queue, per host
49 68
50=item http_get $url, key => value..., $cb->($data, $headers) 69=item http_get $url, key => value..., $cb->($data, $headers)
51 70
52Executes an HTTP-GET request. See the http_request function for details on 71Executes an HTTP-GET request. See the http_request function for details on
53additional parameters. 72additional parameters.
54 73
74=item http_head $url, key => value..., $cb->($data, $headers)
75
76Executes an HTTP-HEAD request. See the http_request function for details on
77additional parameters.
78
79=item http_post $url, $body, key => value..., $cb->($data, $headers)
80
81Executes an HTTP-POST request with a request body of C<$bod>. See the
82http_request function for details on additional parameters.
83
55=item http_request $method => $url, key => value..., $cb->($data, $headers) 84=item http_request $method => $url, key => value..., $cb->($data, $headers)
56 85
57Executes a HTTP request of type C<$method> (e.g. C<GET>, C<POST>). The URL 86Executes a HTTP request of type C<$method> (e.g. C<GET>, C<POST>). The URL
58must be an absolute http or https URL. 87must be an absolute http or https URL.
59 88
89The callback will be called with the response data as first argument
90(or C<undef> if it wasn't available due to errors), and a hash-ref with
91response headers as second argument.
92
93All the headers in that hash are lowercased. In addition to the response
94headers, the three "pseudo-headers" C<HTTPVersion>, C<Status> and
95C<Reason> contain the three parts of the HTTP Status-Line of the same
96name. If the server sends a header multiple lines, then their contents
97will be joined together with C<\x00>.
98
99If an internal error occurs, such as not being able to resolve a hostname,
100then C<$data> will be C<undef>, C<< $headers->{Status} >> will be C<599>
101and the C<Reason> pseudo-header will contain an error message.
102
103A typical callback might look like this:
104
105 sub {
106 my ($body, $hdr) = @_;
107
108 if ($hdr->{Status} =~ /^2/) {
109 ... everything should be ok
110 } else {
111 print "error, $hdr->{Status} $hdr->{Reason}\n";
112 }
113 }
114
60Additional parameters are key-value pairs, and are fully optional. They 115Additional parameters are key-value pairs, and are fully optional. They
61include: 116include:
62 117
63=over 4 118=over 4
64 119
65=item recurse => $boolean (default: true) 120=item recurse => $count (default: $MAX_RECURSE)
66 121
67Whether to recurse requests or not, e.g. on redirects, authentication 122Whether to recurse requests or not, e.g. on redirects, authentication
68retries and so on. 123retries and so on, and how often to do so.
69 124
70=item headers => hashref 125=item headers => hashref
71 126
72The request headers to use. 127The request headers to use. Currently, C<http_request> may provide its
128own C<Host:>, C<Content-Length:>, C<Connection:> and C<Cookie:> headers
129and will provide defaults for C<User-Agent:> and C<Referer:>.
73 130
74=item timeout => $seconds 131=item timeout => $seconds
75 132
76The time-out to use for various stages - each connect attempt will reset 133The time-out to use for various stages - each connect attempt will reset
77the timeout, as will read or write activity. 134the timeout, as will read or write activity. Default timeout is 5 minutes.
135
136=item proxy => [$host, $port[, $scheme]] or undef
137
138Use the given http proxy for all requests. If not specified, then the
139default proxy (as specified by C<$ENV{http_proxy}>) is used.
140
141C<$scheme> must be either missing or C<http> for HTTP, or C<https> for
142HTTPS.
143
144=item body => $string
145
146The request body, usually empty. Will be-sent as-is (future versions of
147this module might offer more options).
148
149=item cookie_jar => $hash_ref
150
151Passing this parameter enables (simplified) cookie-processing, loosely
152based on the original netscape specification.
153
154The C<$hash_ref> must be an (initially empty) hash reference which will
155get updated automatically. It is possible to save the cookie_jar to
156persistent storage with something like JSON or Storable, but this is not
157recommended, as expire times are currently being ignored.
158
159Note that this cookie implementation is not of very high quality, nor
160meant to be complete. If you want complete cookie management you have to
161do that on your own. C<cookie_jar> is meant as a quick fix to get some
162cookie-using sites working. Cookies are a privacy disaster, do not use
163them unless required to.
78 164
79=back 165=back
80 166
81=back 167Example: make a simple HTTP GET request for http://www.nethype.de/
168
169 http_request GET => "http://www.nethype.de/", sub {
170 my ($body, $hdr) = @_;
171 print "$body\n";
172 };
173
174Example: make a HTTP HEAD request on https://www.google.com/, use a
175timeout of 30 seconds.
176
177 http_request
178 GET => "https://www.google.com",
179 timeout => 30,
180 sub {
181 my ($body, $hdr) = @_;
182 use Data::Dumper;
183 print Dumper $hdr;
184 }
185 ;
82 186
83=cut 187=cut
84 188
189sub _slot_schedule;
190sub _slot_schedule($) {
191 my $host = shift;
192
193 while ($CO_SLOT{$host}[0] < $MAX_PER_HOST) {
194 if (my $cb = shift @{ $CO_SLOT{$host}[1] }) {
195 # somebody wants that slot
196 ++$CO_SLOT{$host}[0];
197 ++$ACTIVE;
198
199 $cb->(AnyEvent::Util::guard {
200 --$ACTIVE;
201 --$CO_SLOT{$host}[0];
202 _slot_schedule $host;
203 });
204 } else {
205 # nobody wants the slot, maybe we can forget about it
206 delete $CO_SLOT{$host} unless $CO_SLOT{$host}[0];
207 last;
208 }
209 }
210}
211
212# wait for a free slot on host, call callback
213sub _get_slot($$) {
214 push @{ $CO_SLOT{$_[0]}[1] }, $_[1];
215
216 _slot_schedule $_[0];
217}
218
85sub http_request($$$;@) { 219sub http_request($$@) {
86 my $cb = pop; 220 my $cb = pop;
87 my ($method, $url, %arg) = @_; 221 my ($method, $url, %arg) = @_;
88 222
89 my %hdr; 223 my %hdr;
90 224
225 $method = uc $method;
226
91 if (my $hdr = delete $arg{headers}) { 227 if (my $hdr = $arg{headers}) {
92 while (my ($k, $v) = each %$hdr) { 228 while (my ($k, $v) = each %$hdr) {
93 $hdr{lc $k} = $v; 229 $hdr{lc $k} = $v;
94 } 230 }
95 } 231 }
96 232
233 my $recurse = exists $arg{recurse} ? $arg{recurse} : $MAX_RECURSE;
234
235 return $cb->(undef, { Status => 599, Reason => "recursion limit reached" })
236 if $recurse < 0;
237
238 my $proxy = $arg{proxy} || $PROXY;
97 my $timeout = $arg{timeout} || $TIMEOUT; 239 my $timeout = $arg{timeout} || $TIMEOUT;
98 240
99 $hdr{"user-agent"} ||= $USERAGENT; 241 $hdr{"user-agent"} ||= $USERAGENT;
100 242
101 my ($scheme, $authority, $path, $query, $fragment) = 243 my ($scheme, $authority, $upath, $query, $fragment) =
102 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|; 244 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
103 245
104 $scheme = lc $scheme; 246 $scheme = lc $scheme;
247
105 my $port = $scheme eq "http" ? 80 248 my $uport = $scheme eq "http" ? 80
106 : $scheme eq "https" ? 443 249 : $scheme eq "https" ? 443
107 : croak "$url: only http and https URLs supported"; 250 : return $cb->(undef, { Status => 599, Reason => "only http and https URL schemes supported" });
251
252 $hdr{referer} ||= "$scheme://$authority$upath"; # leave out fragment and query string, just a heuristic
108 253
109 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x 254 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x
110 or croak "$authority: unparsable URL"; 255 or return $cb->(undef, { Status => 599, Reason => "unparsable URL" });
111 256
112 my $host = $1; 257 my $uhost = $1;
113 $port = $2 if defined $2; 258 $uport = $2 if defined $2;
114 259
115 $host =~ s/^\[(.*)\]$/$1/; 260 $uhost =~ s/^\[(.*)\]$/$1/;
116 $path .= "?$query" if length $query; 261 $upath .= "?$query" if length $query;
117 262
118 $hdr{host} = $host = lc $host; 263 $upath =~ s%^/?%/%;
119 264
120 my %state; 265 # cookie processing
121 266 if (my $jar = $arg{cookie_jar}) {
122 my $body = ""; 267 %$jar = () if $jar->{version} < 1;
123 $state{body} = $body; 268
124 269 my @cookie;
125 $hdr{"content-length"} = length $body; 270
126 271 while (my ($chost, $v) = each %$jar) {
127 $state{connect_guard} = AnyEvent::Socket::tcp_connect $host, $port, sub { 272 next unless $chost eq substr $uhost, -length $chost;
128 $state{fh} = shift 273 next unless $chost =~ /^\./;
129 or return $cb->(undef, { Status => 599, Reason => "$!" }); 274
130 275 while (my ($cpath, $v) = each %$v) {
131 delete $state{connect_guard}; # reduce memory usage, save a tree 276 next unless $cpath eq substr $upath, 0, length $cpath;
132 277
133 # get handle 278 while (my ($k, $v) = each %$v) {
134 $state{handle} = new AnyEvent::Handle 279 next if $scheme ne "https" && exists $v->{secure};
135 fh => $state{fh}, 280 push @cookie, "$k=$v->{value}";
136 ($scheme eq "https" ? (tls => "connect") : ()); 281 }
137 282 }
138 # limit the number of persistent connections
139 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) {
140 ++$KA_COUNT{$_[1]};
141 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} };
142 $hdr{connection} = "keep-alive";
143 } else {
144 delete $hdr{connection};
145 } 283 }
284
285 $hdr{cookie} = join "; ", @cookie
286 if @cookie;
287 }
146 288
289 my ($rhost, $rport, $rpath); # request host, port, path
290
291 if ($proxy) {
292 ($rhost, $rport, $scheme) = @$proxy;
293 $rpath = $url;
294 } else {
295 ($rhost, $rport, $rpath) = ($uhost, $uport, $upath);
296 $hdr{host} = $uhost;
297 }
298
299 $hdr{"content-length"} = length $arg{body};
300
301 my %state = (connect_guard => 1);
302
303 _get_slot $uhost, sub {
304 $state{slot_guard} = shift;
305
306 return unless $state{connect_guard};
307
308 $state{connect_guard} = AnyEvent::Socket::tcp_connect $rhost, $rport, sub {
309 $state{fh} = shift
310 or return $cb->(undef, { Status => 599, Reason => "$!" });
311
312 delete $state{connect_guard}; # reduce memory usage, save a tree
313
314 # get handle
315 $state{handle} = new AnyEvent::Handle
316 fh => $state{fh},
317 ($scheme eq "https" ? (tls => "connect") : ());
318
319 # limit the number of persistent connections
320 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) {
321 ++$KA_COUNT{$_[1]};
322 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} };
323 $hdr{connection} = "keep-alive";
324 delete $hdr{connection}; # keep-alive not yet supported
325 } else {
326 delete $hdr{connection};
327 }
328
147 # (re-)configure handle 329 # (re-)configure handle
148 $state{handle}->timeout ($timeout); 330 $state{handle}->timeout ($timeout);
149 $state{handle}->on_error (sub { 331 $state{handle}->on_error (sub {
332 my $errno = "$!";
150 %state = (); 333 %state = ();
151 $cb->(undef, { Status => 599, Reason => "$!" }); 334 $cb->(undef, { Status => 599, Reason => $errno });
152 }); 335 });
153 $state{handle}->on_eof (sub { 336 $state{handle}->on_eof (sub {
154 %state = (); 337 %state = ();
155 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file" }); 338 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file" });
156 }); 339 });
157 340
158 # send request 341 # send request
159 $state{handle}->push_write ( 342 $state{handle}->push_write (
160 "\U$method\E $path HTTP/1.0\015\012" 343 "$method $rpath HTTP/1.0\015\012"
161 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr) 344 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr)
162 . "\015\012" 345 . "\015\012"
163 . (delete $state{body}) 346 . (delete $arg{body})
164 );
165
166 %hdr = (); # reduce memory usage, save a kitten
167
168 # status line
169 $state{handle}->push_read (line => qr/\015?\012/, sub {
170 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
171 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])" }));
172
173 my %hdr = ( # response headers
174 HTTPVersion => ",$1",
175 Status => ",$2",
176 Reason => ",$3",
177 ); 347 );
178 348
349 %hdr = (); # reduce memory usage, save a kitten
350
351 # status line
352 $state{handle}->push_read (line => qr/\015?\012/, sub {
353 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
354 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])" }));
355
356 my %hdr = ( # response headers
357 HTTPVersion => "\x00$1",
358 Status => "\x00$2",
359 Reason => "\x00$3",
360 );
361
179 # headers, could be optimized a bit 362 # headers, could be optimized a bit
180 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub { 363 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub {
181 for ("$_[1]\012") { 364 for ("$_[1]\012") {
365 # we support spaces in field names, as lotus domino
366 # creates them.
182 $hdr{lc $1} .= ",$2" 367 $hdr{lc $1} .= "\x00$2"
183 while /\G 368 while /\G
184 ([^:\000-\040]+): 369 ([^:\000-\037]+):
185 [\011\040]* 370 [\011\040]*
186 ((?: [^\015\012]+ | \015?\012[\011\040] )*) 371 ((?: [^\015\012]+ | \015?\012[\011\040] )*)
187 \015?\012 372 \015?\012
188 /gxc; 373 /gxc;
189 374
190 /\G$/ 375 /\G$/
191 or return $cb->(undef, { Status => 599, Reason => "garbled response headers" }); 376 or return (%state = (), $cb->(undef, { Status => 599, Reason => "garbled response headers" }));
192 } 377 }
193 378
194 substr $_, 0, 1, "" 379 substr $_, 0, 1, ""
195 for values %hdr; 380 for values %hdr;
196 381
197 if (exists $hdr{"content-length"}) { 382 my $finish = sub {
198 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub {
199 # could cache persistent connection now
200 if ($hdr{connection} =~ /\bkeep-alive\b/i) {
201 };
202
203 %state = (); 383 %state = ();
384
385 # set-cookie processing
386 if ($arg{cookie_jar} && exists $hdr{"set-cookie"}) {
387 for (split /\x00/, $hdr{"set-cookie"}) {
388 my ($cookie, @arg) = split /;\s*/;
389 my ($name, $value) = split /=/, $cookie, 2;
390 my %kv = (value => $value, map { split /=/, $_, 2 } @arg);
391
392 my $cdom = (delete $kv{domain}) || $uhost;
393 my $cpath = (delete $kv{path}) || "/";
394
395 $cdom =~ s/^.?/./; # make sure it starts with a "."
396
397 next if $cdom =~ /\.$/;
398
399 # this is not rfc-like and not netscape-like. go figure.
400 my $ndots = $cdom =~ y/.//;
401 next if $ndots < ($cdom =~ /\.[^.][^.]\.[^.][^.]$/ ? 3 : 2);
402
403 # store it
404 $arg{cookie_jar}{version} = 1;
405 $arg{cookie_jar}{$cdom}{$cpath}{$name} = \%kv;
406 }
407 }
408
409 if ($_[1]{Status} =~ /^30[12]$/ && $recurse) {
410 # microsoft and other assholes don't give a shit for following standards,
411 # try to support a common form of broken Location header.
412 $_[1]{location} =~ s%^/%$scheme://$uhost:$uport/%;
413
414 http_request ($method, $_[1]{location}, %arg, recurse => $recurse - 1, $cb);
415 } else {
204 $cb->($_[1], \%hdr); 416 $cb->($_[0], $_[1]);
417 }
205 }); 418 };
419
420 if ($hdr{Status} =~ /^(?:1..|204|304)$/ or $method eq "HEAD") {
421 $finish->(undef, \%hdr);
206 } else { 422 } else {
423 if (exists $hdr{"content-length"}) {
424 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub {
425 # could cache persistent connection now
426 if ($hdr{connection} =~ /\bkeep-alive\b/i) {
427 # but we don't, due to misdesigns, this is annoyingly complex
428 };
429
430 $finish->($_[1], \%hdr);
431 });
432 } else {
207 # too bad, need to read until we get an error or EOF, 433 # too bad, need to read until we get an error or EOF,
208 # no way to detect winged data. 434 # no way to detect winged data.
209 $_[0]->on_error (sub { 435 $_[0]->on_error (sub {
210 %state = ();
211 $cb->($_[0]{rbuf}, \%hdr); 436 $finish->($_[0]{rbuf}, \%hdr);
437 });
438 $_[0]->on_eof (undef);
439 $_[0]->on_read (sub { });
440 }
212 }); 441 }
213 $_[0]->on_eof (undef);
214 $_[0]->on_read (sub { });
215 } 442 });
216 }); 443 });
444 }, sub {
445 $timeout
217 }); 446 };
218 }, sub {
219 $timeout
220 }; 447 };
221 448
222 defined wantarray && AnyEvent::Util::guard { %state = () } 449 defined wantarray && AnyEvent::Util::guard { %state = () }
223} 450}
224 451
225sub http_get($$;@) { 452sub http_get($@) {
226 unshift @_, "GET"; 453 unshift @_, "GET";
227 &http_request 454 &http_request
228} 455}
229 456
457sub http_head($@) {
458 unshift @_, "HEAD";
459 &http_request
460}
461
462sub http_post($$@) {
463 unshift @_, "POST", "body";
464 &http_request
465}
466
467=back
468
230=head2 GLOBAL VARIABLES 469=head2 GLOBAL FUNCTIONS AND VARIABLES
231 470
232=over 4 471=over 4
233 472
473=item AnyEvent::HTTP::set_proxy "proxy-url"
474
475Sets the default proxy server to use. The proxy-url must begin with a
476string of the form C<http://host:port> (optionally C<https:...>).
477
234=item $AnyEvent::HTTP::MAX_REDIRECTS 478=item $AnyEvent::HTTP::MAX_RECURSE
235 479
236The default value for the C<max_redirects> request parameter 480The default value for the C<recurse> request parameter (default: C<10>).
237(default: C<10>).
238 481
239=item $AnyEvent::HTTP::USERAGENT 482=item $AnyEvent::HTTP::USERAGENT
240 483
241The default value for the C<User-Agent> header (the default is 484The default value for the C<User-Agent> header (the default is
242C<Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)>). 485C<Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)>).
243 486
244=item $AnyEvent::HTTP::MAX_PERSISTENT 487=item $AnyEvent::HTTP::MAX_PERSISTENT
245 488
246The maximum number of persistent connections to keep open (default: 8). 489The maximum number of persistent connections to keep open (default: 8).
247 490
491Not implemented currently.
492
248=item $AnyEvent::HTTP::PERSISTENT_TIMEOUT 493=item $AnyEvent::HTTP::PERSISTENT_TIMEOUT
249 494
250The maximum time to cache a persistent connection, in seconds (default: 15). 495The maximum time to cache a persistent connection, in seconds (default: 2).
496
497Not implemented currently.
498
499=item $AnyEvent::HTTP::ACTIVE
500
501The number of active connections. This is not the number of currently
502running requests, but the number of currently open and non-idle TCP
503connections. This number of can be useful for load-leveling.
251 504
252=back 505=back
253 506
254=cut 507=cut
508
509sub set_proxy($) {
510 $PROXY = [$2, $3 || 3128, $1] if $_[0] =~ m%^(https?):// ([^:/]+) (?: : (\d*) )?%ix;
511}
512
513# initialise proxy from environment
514set_proxy $ENV{http_proxy};
255 515
256=head1 SEE ALSO 516=head1 SEE ALSO
257 517
258L<AnyEvent>. 518L<AnyEvent>.
259 519

Diff Legend

Removed lines
+ Added lines
< Changed lines
> Changed lines