ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/AnyEvent-HTTP/HTTP.pm
(Generate patch)

Comparing AnyEvent-HTTP/HTTP.pm (file contents):
Revision 1.4 by root, Wed Jun 4 11:59:22 2008 UTC vs.
Revision 1.16 by root, Fri Jun 6 12:57:48 2008 UTC

8 8
9=head1 DESCRIPTION 9=head1 DESCRIPTION
10 10
11This module is an L<AnyEvent> user, you need to make sure that you use and 11This module is an L<AnyEvent> user, you need to make sure that you use and
12run a supported event loop. 12run a supported event loop.
13
14This module implements a simple, stateless and non-blocking HTTP
15client. It supports GET, POST and other request methods, cookies and more,
16all on a very low level. It can follow redirects supports proxies and
17automatically limits the number of connections to the values specified in
18the RFC.
19
20It should generally be a "good client" that is enough for most HTTP
21tasks. Simple tasks should be simple, but complex tasks should still be
22possible as the user retains control over request and response headers.
23
24The caller is responsible for authentication management, cookies (if
25the simplistic implementation in this module doesn't suffice), referer
26and other high-level protocol details for which this module offers only
27limited support.
13 28
14=head2 METHODS 29=head2 METHODS
15 30
16=over 4 31=over 4
17 32
29use AnyEvent::Socket (); 44use AnyEvent::Socket ();
30use AnyEvent::Handle (); 45use AnyEvent::Handle ();
31 46
32use base Exporter::; 47use base Exporter::;
33 48
34our $VERSION = '1.0'; 49our $VERSION = '1.01';
35 50
36our @EXPORT = qw(http_get http_request); 51our @EXPORT = qw(http_get http_request);
37 52
38our $USERAGENT = "Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)"; 53our $USERAGENT = "Mozilla/5.0 (compatible; AnyEvent::HTTP/$VERSION; +http://software.schmorp.de/pkg/AnyEvent)";
39our $MAX_RECURSE = 10; 54our $MAX_RECURSE = 10;
41our $PERSISTENT_TIMEOUT = 2; 56our $PERSISTENT_TIMEOUT = 2;
42our $TIMEOUT = 300; 57our $TIMEOUT = 300;
43 58
44# changing these is evil 59# changing these is evil
45our $MAX_PERSISTENT_PER_HOST = 2; 60our $MAX_PERSISTENT_PER_HOST = 2;
46our $MAX_PER_HOST = 4; # not respected yet :( 61our $MAX_PER_HOST = 4;
47 62
48our $PROXY; 63our $PROXY;
64our $ACTIVE = 0;
49 65
50my %KA_COUNT; # number of open keep-alive connections per host 66my %KA_COUNT; # number of open keep-alive connections per host
67my %CO_SLOT; # number of open connections, and wait queue, per host
51 68
52=item http_get $url, key => value..., $cb->($data, $headers) 69=item http_get $url, key => value..., $cb->($data, $headers)
53 70
54Executes an HTTP-GET request. See the http_request function for details on 71Executes an HTTP-GET request. See the http_request function for details on
55additional parameters. 72additional parameters.
56 73
74=item http_head $url, key => value..., $cb->($data, $headers)
75
76Executes an HTTP-HEAD request. See the http_request function for details on
77additional parameters.
78
57=item http_get $url, $body, key => value..., $cb->($data, $headers) 79=item http_post $url, $body, key => value..., $cb->($data, $headers)
58 80
59Executes an HTTP-POST request with a requets body of C<$bod>. See the 81Executes an HTTP-POST request with a request body of C<$bod>. See the
60http_request function for details on additional parameters. 82http_request function for details on additional parameters.
61 83
62=item http_request $method => $url, key => value..., $cb->($data, $headers) 84=item http_request $method => $url, key => value..., $cb->($data, $headers)
63 85
64Executes a HTTP request of type C<$method> (e.g. C<GET>, C<POST>). The URL 86Executes a HTTP request of type C<$method> (e.g. C<GET>, C<POST>). The URL
66 88
67The callback will be called with the response data as first argument 89The callback will be called with the response data as first argument
68(or C<undef> if it wasn't available due to errors), and a hash-ref with 90(or C<undef> if it wasn't available due to errors), and a hash-ref with
69response headers as second argument. 91response headers as second argument.
70 92
71All the headers in that has are lowercased. In addition to the response 93All the headers in that hash are lowercased. In addition to the response
72headers, the three "pseudo-headers" C<HTTPVersion>, C<Status> and 94headers, the three "pseudo-headers" C<HTTPVersion>, C<Status> and
73C<Reason> contain the three parts of the HTTP Status-Line of the same 95C<Reason> contain the three parts of the HTTP Status-Line of the same
74name. 96name. If the server sends a header multiple lines, then their contents
97will be joined together with C<\x00>.
75 98
76If an internal error occurs, such as not being able to resolve a hostname, 99If an internal error occurs, such as not being able to resolve a hostname,
77then C<$data> will be C<undef>, C<< $headers->{Status} >> will be C<599> 100then C<$data> will be C<undef>, C<< $headers->{Status} >> will be C<599>
78and the C<Reason> pseudo-header will contain an error message. 101and the C<Reason> pseudo-header will contain an error message.
79 102
103A typical callback might look like this:
104
105 sub {
106 my ($body, $hdr) = @_;
107
108 if ($hdr->{Status} =~ /^2/) {
109 ... everything should be ok
110 } else {
111 print "error, $hdr->{Status} $hdr->{Reason}\n";
112 }
113 }
114
80Additional parameters are key-value pairs, and are fully optional. They 115Additional parameters are key-value pairs, and are fully optional. They
81include: 116include:
82 117
83=over 4 118=over 4
84 119
87Whether to recurse requests or not, e.g. on redirects, authentication 122Whether to recurse requests or not, e.g. on redirects, authentication
88retries and so on, and how often to do so. 123retries and so on, and how often to do so.
89 124
90=item headers => hashref 125=item headers => hashref
91 126
92The request headers to use. 127The request headers to use. Currently, C<http_request> may provide its
128own C<Host:>, C<Content-Length:>, C<Connection:> and C<Cookie:> headers
129and will provide defaults for C<User-Agent:> and C<Referer:>.
93 130
94=item timeout => $seconds 131=item timeout => $seconds
95 132
96The time-out to use for various stages - each connect attempt will reset 133The time-out to use for various stages - each connect attempt will reset
97the timeout, as will read or write activity. Default timeout is 5 minutes. 134the timeout, as will read or write activity. Default timeout is 5 minutes.
107=item body => $string 144=item body => $string
108 145
109The request body, usually empty. Will be-sent as-is (future versions of 146The request body, usually empty. Will be-sent as-is (future versions of
110this module might offer more options). 147this module might offer more options).
111 148
149=item cookie_jar => $hash_ref
150
151Passing this parameter enables (simplified) cookie-processing, loosely
152based on the original netscape specification.
153
154The C<$hash_ref> must be an (initially empty) hash reference which will
155get updated automatically. It is possible to save the cookie_jar to
156persistent storage with something like JSON or Storable, but this is not
157recommended, as expire times are currently being ignored.
158
159Note that this cookie implementation is not of very high quality, nor
160meant to be complete. If you want complete cookie management you have to
161do that on your own. C<cookie_jar> is meant as a quick fix to get some
162cookie-using sites working. Cookies are a privacy disaster, do not use
163them unless required to.
164
112=back 165=back
113 166
114=back 167Example: make a simple HTTP GET request for http://www.nethype.de/
168
169 http_request GET => "http://www.nethype.de/", sub {
170 my ($body, $hdr) = @_;
171 print "$body\n";
172 };
173
174Example: make a HTTP HEAD request on https://www.google.com/, use a
175timeout of 30 seconds.
176
177 http_request
178 GET => "https://www.google.com",
179 timeout => 30,
180 sub {
181 my ($body, $hdr) = @_;
182 use Data::Dumper;
183 print Dumper $hdr;
184 }
185 ;
115 186
116=cut 187=cut
117 188
189sub _slot_schedule;
190sub _slot_schedule($) {
191 my $host = shift;
192
193 while ($CO_SLOT{$host}[0] < $MAX_PER_HOST) {
194 if (my $cb = shift @{ $CO_SLOT{$host}[1] }) {
195 # somebody wants that slot
196 ++$CO_SLOT{$host}[0];
197 ++$ACTIVE;
198
199 $cb->(AnyEvent::Util::guard {
200 --$ACTIVE;
201 --$CO_SLOT{$host}[0];
202 _slot_schedule $host;
203 });
204 } else {
205 # nobody wants the slot, maybe we can forget about it
206 delete $CO_SLOT{$host} unless $CO_SLOT{$host}[0];
207 last;
208 }
209 }
210}
211
212# wait for a free slot on host, call callback
213sub _get_slot($$) {
214 push @{ $CO_SLOT{$_[0]}[1] }, $_[1];
215
216 _slot_schedule $_[0];
217}
218
118sub http_request($$$;@) { 219sub http_request($$@) {
119 my $cb = pop; 220 my $cb = pop;
120 my ($method, $url, %arg) = @_; 221 my ($method, $url, %arg) = @_;
121 222
122 my %hdr; 223 my %hdr;
123 224
124 $method = uc $method; 225 $method = uc $method;
125 226
126 if (my $hdr = delete $arg{headers}) { 227 if (my $hdr = $arg{headers}) {
127 while (my ($k, $v) = each %$hdr) { 228 while (my ($k, $v) = each %$hdr) {
128 $hdr{lc $k} = $v; 229 $hdr{lc $k} = $v;
129 } 230 }
130 } 231 }
131 232
233 my $recurse = exists $arg{recurse} ? $arg{recurse} : $MAX_RECURSE;
234
235 return $cb->(undef, { Status => 599, Reason => "recursion limit reached" })
236 if $recurse < 0;
237
132 my $proxy = $arg{proxy} || $PROXY; 238 my $proxy = $arg{proxy} || $PROXY;
133 my $timeout = $arg{timeout} || $TIMEOUT; 239 my $timeout = $arg{timeout} || $TIMEOUT;
134 my $recurse = exists $arg{recurse} ? $arg{recurse} : $MAX_RECURSE;
135 240
136 $hdr{"user-agent"} ||= $USERAGENT; 241 $hdr{"user-agent"} ||= $USERAGENT;
137 242
138 my ($host, $port, $path, $scheme); 243 my ($scheme, $authority, $upath, $query, $fragment) =
244 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
245
246 $scheme = lc $scheme;
247
248 my $uport = $scheme eq "http" ? 80
249 : $scheme eq "https" ? 443
250 : return $cb->(undef, { Status => 599, Reason => "only http and https URL schemes supported" });
251
252 $hdr{referer} ||= "$scheme://$authority$upath"; # leave out fragment and query string, just a heuristic
253
254 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x
255 or return $cb->(undef, { Status => 599, Reason => "unparsable URL" });
256
257 my $uhost = $1;
258 $uport = $2 if defined $2;
259
260 $uhost =~ s/^\[(.*)\]$/$1/;
261 $upath .= "?$query" if length $query;
262
263 $upath =~ s%^/?%/%;
264
265 # cookie processing
266 if (my $jar = $arg{cookie_jar}) {
267 %$jar = () if $jar->{version} < 1;
268
269 my @cookie;
270
271 while (my ($chost, $v) = each %$jar) {
272 next unless $chost eq substr $uhost, -length $chost;
273 next unless $chost =~ /^\./;
274
275 while (my ($cpath, $v) = each %$v) {
276 next unless $cpath eq substr $upath, 0, length $cpath;
277
278 while (my ($k, $v) = each %$v) {
279 next if $scheme ne "https" && exists $v->{secure};
280 push @cookie, "$k=$v->{value}";
281 }
282 }
283 }
284
285 $hdr{cookie} = join "; ", @cookie
286 if @cookie;
287 }
288
289 my ($rhost, $rport, $rpath); # request host, port, path
139 290
140 if ($proxy) { 291 if ($proxy) {
141 ($host, $port, $scheme) = @$proxy; 292 ($rhost, $rport, $scheme) = @$proxy;
142 $path = $url; 293 $rpath = $url;
143 } else { 294 } else {
144 ($scheme, my $authority, $path, my $query, my $fragment) = 295 ($rhost, $rport, $rpath) = ($uhost, $uport, $upath);
145 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
146
147 $port = $scheme eq "http" ? 80
148 : $scheme eq "https" ? 443
149 : croak "$url: only http and https URLs supported";
150
151 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x
152 or croak "$authority: unparsable URL";
153
154 $host = $1;
155 $port = $2 if defined $2;
156
157 $host =~ s/^\[(.*)\]$/$1/;
158 $path .= "?$query" if length $query;
159
160 $path = "/" unless $path;
161
162 $hdr{host} = $host = lc $host; 296 $hdr{host} = $uhost;
163 } 297 }
164 298
165 $scheme = lc $scheme;
166
167 my %state;
168
169 $state{body} = delete $arg{body};
170
171 $hdr{"content-length"} = length $state{body}; 299 $hdr{"content-length"} = length $arg{body};
172 300
301 my %state = (connect_guard => 1);
302
303 _get_slot $uhost, sub {
304 $state{slot_guard} = shift;
305
306 return unless $state{connect_guard};
307
173 $state{connect_guard} = AnyEvent::Socket::tcp_connect $host, $port, sub { 308 $state{connect_guard} = AnyEvent::Socket::tcp_connect $rhost, $rport, sub {
174 $state{fh} = shift 309 $state{fh} = shift
175 or return $cb->(undef, { Status => 599, Reason => "$!" }); 310 or return $cb->(undef, { Status => 599, Reason => "$!" });
176 311
177 delete $state{connect_guard}; # reduce memory usage, save a tree 312 delete $state{connect_guard}; # reduce memory usage, save a tree
178 313
179 # get handle 314 # get handle
180 $state{handle} = new AnyEvent::Handle 315 $state{handle} = new AnyEvent::Handle
181 fh => $state{fh}, 316 fh => $state{fh},
182 ($scheme eq "https" ? (tls => "connect") : ()); 317 ($scheme eq "https" ? (tls => "connect") : ());
183 318
184 # limit the number of persistent connections 319 # limit the number of persistent connections
185 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) { 320 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) {
186 ++$KA_COUNT{$_[1]}; 321 ++$KA_COUNT{$_[1]};
187 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} }; 322 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} };
188 $hdr{connection} = "keep-alive"; 323 $hdr{connection} = "keep-alive";
189 delete $hdr{connection}; # keep-alive not yet supported 324 delete $hdr{connection}; # keep-alive not yet supported
190 } else { 325 } else {
191 delete $hdr{connection}; 326 delete $hdr{connection};
192 } 327 }
193 328
194 # (re-)configure handle 329 # (re-)configure handle
195 $state{handle}->timeout ($timeout); 330 $state{handle}->timeout ($timeout);
196 $state{handle}->on_error (sub { 331 $state{handle}->on_error (sub {
332 my $errno = "$!";
197 %state = (); 333 %state = ();
198 $cb->(undef, { Status => 599, Reason => "$!" }); 334 $cb->(undef, { Status => 599, Reason => $errno });
199 }); 335 });
200 $state{handle}->on_eof (sub { 336 $state{handle}->on_eof (sub {
201 %state = (); 337 %state = ();
202 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file" }); 338 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file" });
203 }); 339 });
204 340
205 # send request 341 # send request
206 $state{handle}->push_write ( 342 $state{handle}->push_write (
207 "$method $path HTTP/1.0\015\012" 343 "$method $rpath HTTP/1.0\015\012"
208 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr) 344 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr)
209 . "\015\012" 345 . "\015\012"
210 . (delete $state{body}) 346 . (delete $arg{body})
211 );
212
213 %hdr = (); # reduce memory usage, save a kitten
214
215 # status line
216 $state{handle}->push_read (line => qr/\015?\012/, sub {
217 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
218 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])" }));
219
220 my %hdr = ( # response headers
221 HTTPVersion => ",$1",
222 Status => ",$2",
223 Reason => ",$3",
224 ); 347 );
225 348
349 %hdr = (); # reduce memory usage, save a kitten
350
351 # status line
352 $state{handle}->push_read (line => qr/\015?\012/, sub {
353 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
354 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])" }));
355
356 my %hdr = ( # response headers
357 HTTPVersion => "\x00$1",
358 Status => "\x00$2",
359 Reason => "\x00$3",
360 );
361
226 # headers, could be optimized a bit 362 # headers, could be optimized a bit
227 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub { 363 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub {
228 for ("$_[1]\012") { 364 for ("$_[1]\012") {
229 # we support spaces in field names, as lotus domino 365 # we support spaces in field names, as lotus domino
230 # creates them. 366 # creates them.
231 $hdr{lc $1} .= ",$2" 367 $hdr{lc $1} .= "\x00$2"
232 while /\G 368 while /\G
233 ([^:\000-\037]+): 369 ([^:\000-\037]+):
234 [\011\040]* 370 [\011\040]*
235 ((?: [^\015\012]+ | \015?\012[\011\040] )*) 371 ((?: [^\015\012]+ | \015?\012[\011\040] )*)
236 \015?\012 372 \015?\012
237 /gxc; 373 /gxc;
238 374
239 /\G$/ 375 /\G$/
240 or return $cb->(undef, { Status => 599, Reason => "garbled response headers" }); 376 or return (%state = (), $cb->(undef, { Status => 599, Reason => "garbled response headers" }));
241 } 377 }
242 378
243 substr $_, 0, 1, "" 379 substr $_, 0, 1, ""
244 for values %hdr; 380 for values %hdr;
245 381
246 if ($method eq "HEAD") { 382 my $finish = sub {
247 %state = (); 383 %state = ();
248 $cb->(undef, \%hdr); 384
249 } else { 385 # set-cookie processing
250 if (exists $hdr{"content-length"}) { 386 if ($arg{cookie_jar} && exists $hdr{"set-cookie"}) {
251 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub { 387 for (split /\x00/, $hdr{"set-cookie"}) {
252 # could cache persistent connection now 388 my ($cookie, @arg) = split /;\s*/;
253 if ($hdr{connection} =~ /\bkeep-alive\b/i) { 389 my ($name, $value) = split /=/, $cookie, 2;
254 # but we don't, due to misdesigns, this is annoyingly complex 390 my %kv = (value => $value, map { split /=/, $_, 2 } @arg);
391
392 my $cdom = (delete $kv{domain}) || $uhost;
393 my $cpath = (delete $kv{path}) || "/";
394
395 $cdom =~ s/^.?/./; # make sure it starts with a "."
396
397 next if $cdom =~ /\.$/;
398
399 # this is not rfc-like and not netscape-like. go figure.
400 my $ndots = $cdom =~ y/.//;
401 next if $ndots < ($cdom =~ /\.[^.][^.]\.[^.][^.]$/ ? 3 : 2);
402
403 # store it
404 $arg{cookie_jar}{version} = 1;
405 $arg{cookie_jar}{$cdom}{$cpath}{$name} = \%kv;
255 }; 406 }
256
257 %state = ();
258 $cb->($_[1], \%hdr);
259 }); 407 }
408
409 if ($_[1]{Status} =~ /^30[12]$/ && $recurse) {
410 # microsoft and other assholes don't give a shit for following standards,
411 # try to support a common form of broken Location header.
412 $_[1]{location} =~ s%^/%$scheme://$uhost:$uport/%;
413
414 http_request ($method, $_[1]{location}, %arg, recurse => $recurse - 1, $cb);
415 } else {
416 $cb->($_[0], $_[1]);
417 }
418 };
419
420 if ($hdr{Status} =~ /^(?:1..|204|304)$/ or $method eq "HEAD") {
421 $finish->(undef, \%hdr);
260 } else { 422 } else {
423 if (exists $hdr{"content-length"}) {
424 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub {
425 # could cache persistent connection now
426 if ($hdr{connection} =~ /\bkeep-alive\b/i) {
427 # but we don't, due to misdesigns, this is annoyingly complex
428 };
429
430 $finish->($_[1], \%hdr);
431 });
432 } else {
261 # too bad, need to read until we get an error or EOF, 433 # too bad, need to read until we get an error or EOF,
262 # no way to detect winged data. 434 # no way to detect winged data.
263 $_[0]->on_error (sub { 435 $_[0]->on_error (sub {
264 %state = ();
265 $cb->($_[0]{rbuf}, \%hdr); 436 $finish->($_[0]{rbuf}, \%hdr);
266 }); 437 });
267 $_[0]->on_eof (undef); 438 $_[0]->on_eof (undef);
268 $_[0]->on_read (sub { }); 439 $_[0]->on_read (sub { });
440 }
269 } 441 }
270 } 442 });
271 }); 443 });
444 }, sub {
445 $timeout
272 }); 446 };
273 }, sub {
274 $timeout
275 }; 447 };
276 448
277 defined wantarray && AnyEvent::Util::guard { %state = () } 449 defined wantarray && AnyEvent::Util::guard { %state = () }
278} 450}
279 451
280sub http_get($$;@) { 452sub http_get($@) {
281 unshift @_, "GET"; 453 unshift @_, "GET";
282 &http_request 454 &http_request
283} 455}
284 456
285sub http_head($$;@) { 457sub http_head($@) {
286 unshift @_, "HEAD"; 458 unshift @_, "HEAD";
287 &http_request 459 &http_request
288} 460}
289 461
290sub http_post($$$;@) { 462sub http_post($$@) {
291 unshift @_, "POST", "body"; 463 unshift @_, "POST", "body";
292 &http_request 464 &http_request
293} 465}
294 466
467=back
468
295=head2 GLOBAL FUNCTIONS AND VARIABLES 469=head2 GLOBAL FUNCTIONS AND VARIABLES
296 470
297=over 4 471=over 4
298 472
299=item AnyEvent::HTTP::set_proxy "proxy-url" 473=item AnyEvent::HTTP::set_proxy "proxy-url"
320 494
321The maximum time to cache a persistent connection, in seconds (default: 2). 495The maximum time to cache a persistent connection, in seconds (default: 2).
322 496
323Not implemented currently. 497Not implemented currently.
324 498
499=item $AnyEvent::HTTP::ACTIVE
500
501The number of active connections. This is not the number of currently
502running requests, but the number of currently open and non-idle TCP
503connections. This number of can be useful for load-leveling.
504
325=back 505=back
326 506
327=cut 507=cut
328 508
329sub set_proxy($) { 509sub set_proxy($) {

Diff Legend

Removed lines
+ Added lines
< Changed lines
> Changed lines