ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/AnyEvent-HTTP/HTTP.pm
(Generate patch)

Comparing AnyEvent-HTTP/HTTP.pm (file contents):
Revision 1.5 by root, Wed Jun 4 12:03:47 2008 UTC vs.
Revision 1.13 by root, Thu Jun 5 16:43:45 2008 UTC

8 8
9=head1 DESCRIPTION 9=head1 DESCRIPTION
10 10
11This module is an L<AnyEvent> user, you need to make sure that you use and 11This module is an L<AnyEvent> user, you need to make sure that you use and
12run a supported event loop. 12run a supported event loop.
13
14This module implements a simple, stateless and non-blocking HTTP
15client. It supports GET, POST and other request methods, cookies and more,
16all on a very low level. It can follow redirects supports proxies and
17automatically limits the number of connections to the values specified in
18the RFC.
19
20It should generally be a "good client" that is enough for most HTTP
21tasks. Simple tasks should be simple, but complex tasks should still be
22possible as the user retains control over request and response headers.
23
24The caller is responsible for authentication management, cookies (if
25the simplistic implementation in this module doesn't suffice), referer
26and other high-level protocol details for which this module offers only
27limited support.
13 28
14=head2 METHODS 29=head2 METHODS
15 30
16=over 4 31=over 4
17 32
41our $PERSISTENT_TIMEOUT = 2; 56our $PERSISTENT_TIMEOUT = 2;
42our $TIMEOUT = 300; 57our $TIMEOUT = 300;
43 58
44# changing these is evil 59# changing these is evil
45our $MAX_PERSISTENT_PER_HOST = 2; 60our $MAX_PERSISTENT_PER_HOST = 2;
46our $MAX_PER_HOST = 4; # not respected yet :( 61our $MAX_PER_HOST = 4;
47 62
48our $PROXY; 63our $PROXY;
49 64
50my %KA_COUNT; # number of open keep-alive connections per host 65my %KA_COUNT; # number of open keep-alive connections per host
66my %CO_SLOT; # number of open connections, and wait queue, per host
51 67
52=item http_get $url, key => value..., $cb->($data, $headers) 68=item http_get $url, key => value..., $cb->($data, $headers)
53 69
54Executes an HTTP-GET request. See the http_request function for details on 70Executes an HTTP-GET request. See the http_request function for details on
55additional parameters. 71additional parameters.
59Executes an HTTP-HEAD request. See the http_request function for details on 75Executes an HTTP-HEAD request. See the http_request function for details on
60additional parameters. 76additional parameters.
61 77
62=item http_post $url, $body, key => value..., $cb->($data, $headers) 78=item http_post $url, $body, key => value..., $cb->($data, $headers)
63 79
64Executes an HTTP-POST request with a requets body of C<$bod>. See the 80Executes an HTTP-POST request with a request body of C<$bod>. See the
65http_request function for details on additional parameters. 81http_request function for details on additional parameters.
66 82
67=item http_request $method => $url, key => value..., $cb->($data, $headers) 83=item http_request $method => $url, key => value..., $cb->($data, $headers)
68 84
69Executes a HTTP request of type C<$method> (e.g. C<GET>, C<POST>). The URL 85Executes a HTTP request of type C<$method> (e.g. C<GET>, C<POST>). The URL
71 87
72The callback will be called with the response data as first argument 88The callback will be called with the response data as first argument
73(or C<undef> if it wasn't available due to errors), and a hash-ref with 89(or C<undef> if it wasn't available due to errors), and a hash-ref with
74response headers as second argument. 90response headers as second argument.
75 91
76All the headers in that has are lowercased. In addition to the response 92All the headers in that hash are lowercased. In addition to the response
77headers, the three "pseudo-headers" C<HTTPVersion>, C<Status> and 93headers, the three "pseudo-headers" C<HTTPVersion>, C<Status> and
78C<Reason> contain the three parts of the HTTP Status-Line of the same 94C<Reason> contain the three parts of the HTTP Status-Line of the same
79name. 95name. If the server sends a header multiple lines, then their contents
96will be joined together with C<\x00>.
80 97
81If an internal error occurs, such as not being able to resolve a hostname, 98If an internal error occurs, such as not being able to resolve a hostname,
82then C<$data> will be C<undef>, C<< $headers->{Status} >> will be C<599> 99then C<$data> will be C<undef>, C<< $headers->{Status} >> will be C<599>
83and the C<Reason> pseudo-header will contain an error message. 100and the C<Reason> pseudo-header will contain an error message.
84 101
102A typical callback might look like this:
103
104 sub {
105 my ($body, $hdr) = @_;
106
107 if ($hdr->{Status} =~ /^2/) {
108 ... everything should be ok
109 } else {
110 print "error, $hdr->{Status} $hdr->{Reason}\n";
111 }
112 }
113
85Additional parameters are key-value pairs, and are fully optional. They 114Additional parameters are key-value pairs, and are fully optional. They
86include: 115include:
87 116
88=over 4 117=over 4
89 118
92Whether to recurse requests or not, e.g. on redirects, authentication 121Whether to recurse requests or not, e.g. on redirects, authentication
93retries and so on, and how often to do so. 122retries and so on, and how often to do so.
94 123
95=item headers => hashref 124=item headers => hashref
96 125
97The request headers to use. 126The request headers to use. Currently, C<http_request> may provide its
127own C<Host:>, C<Content-Length:>, C<Connection:> and C<Cookie:> headers
128and will provide defaults for C<User-Agent:> and C<Referer:>.
98 129
99=item timeout => $seconds 130=item timeout => $seconds
100 131
101The time-out to use for various stages - each connect attempt will reset 132The time-out to use for various stages - each connect attempt will reset
102the timeout, as will read or write activity. Default timeout is 5 minutes. 133the timeout, as will read or write activity. Default timeout is 5 minutes.
112=item body => $string 143=item body => $string
113 144
114The request body, usually empty. Will be-sent as-is (future versions of 145The request body, usually empty. Will be-sent as-is (future versions of
115this module might offer more options). 146this module might offer more options).
116 147
148=item cookie_jar => $hash_ref
149
150Passing this parameter enables (simplified) cookie-processing, loosely
151based on the original netscape specification.
152
153The C<$hash_ref> must be an (initially empty) hash reference which will
154get updated automatically. It is possible to save the cookie_jar to
155persistent storage with something like JSON or Storable, but this is not
156recommended, as expire times are currently being ignored.
157
158Note that this cookie implementation is not of very high quality, nor
159meant to be complete. If you want complete cookie management you have to
160do that on your own. C<cookie_jar> is meant as a quick fix to get some
161cookie-using sites working. Cookies are a privacy disaster, do not use
162them unless required to.
163
117=back 164=back
118 165
119=back 166Example: make a simple HTTP GET request for http://www.nethype.de/
167
168 http_request GET => "http://www.nethype.de/", sub {
169 my ($body, $hdr) = @_;
170 print "$body\n";
171 };
172
173Example: make a HTTP HEAD request on https://www.google.com/, use a
174timeout of 30 seconds.
175
176 http_request
177 GET => "https://www.google.com",
178 timeout => 30,
179 sub {
180 my ($body, $hdr) = @_;
181 use Data::Dumper;
182 print Dumper $hdr;
183 }
184 ;
120 185
121=cut 186=cut
187
188sub _slot_schedule;
189sub _slot_schedule($) {
190 my $host = shift;
191
192 while ($CO_SLOT{$host}[0] < $MAX_PER_HOST) {
193 if (my $cb = shift @{ $CO_SLOT{$host}[1] }) {
194 # somebody wants that slot
195 ++$CO_SLOT{$host}[0];
196
197 $cb->(AnyEvent::Util::guard {
198 --$CO_SLOT{$host}[0];
199 _slot_schedule $host;
200 });
201 } else {
202 # nobody wants the slot, maybe we can forget about it
203 delete $CO_SLOT{$host} unless $CO_SLOT{$host}[0];
204 last;
205 }
206 }
207}
208
209# wait for a free slot on host, call callback
210sub _get_slot($$) {
211 push @{ $CO_SLOT{$_[0]}[1] }, $_[1];
212
213 _slot_schedule $_[0];
214}
122 215
123sub http_request($$$;@) { 216sub http_request($$$;@) {
124 my $cb = pop; 217 my $cb = pop;
125 my ($method, $url, %arg) = @_; 218 my ($method, $url, %arg) = @_;
126 219
127 my %hdr; 220 my %hdr;
128 221
129 $method = uc $method; 222 $method = uc $method;
130 223
131 if (my $hdr = delete $arg{headers}) { 224 if (my $hdr = $arg{headers}) {
132 while (my ($k, $v) = each %$hdr) { 225 while (my ($k, $v) = each %$hdr) {
133 $hdr{lc $k} = $v; 226 $hdr{lc $k} = $v;
134 } 227 }
135 } 228 }
136 229
230 my $recurse = exists $arg{recurse} ? $arg{recurse} : $MAX_RECURSE;
231
232 return $cb->(undef, { Status => 599, Reason => "recursion limit reached" })
233 if $recurse < 0;
234
137 my $proxy = $arg{proxy} || $PROXY; 235 my $proxy = $arg{proxy} || $PROXY;
138 my $timeout = $arg{timeout} || $TIMEOUT; 236 my $timeout = $arg{timeout} || $TIMEOUT;
139 my $recurse = exists $arg{recurse} ? $arg{recurse} : $MAX_RECURSE;
140 237
141 $hdr{"user-agent"} ||= $USERAGENT; 238 $hdr{"user-agent"} ||= $USERAGENT;
142 239
143 my ($host, $port, $path, $scheme); 240 my ($scheme, $authority, $upath, $query, $fragment) =
241 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
242
243 $scheme = lc $scheme;
244
245 my $uport = $scheme eq "http" ? 80
246 : $scheme eq "https" ? 443
247 : return $cb->(undef, { Status => 599, Reason => "only http and https URL schemes supported" });
248
249 $hdr{referer} ||= "$scheme://$authority$upath"; # leave out fragment and query string, just a heuristic
250
251 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x
252 or return $cb->(undef, { Status => 599, Reason => "unparsable URL" });
253
254 my $uhost = $1;
255 $uport = $2 if defined $2;
256
257 $uhost =~ s/^\[(.*)\]$/$1/;
258 $upath .= "?$query" if length $query;
259
260 $upath =~ s%^/?%/%;
261
262 # cookie processing
263 if (my $jar = $arg{cookie_jar}) {
264 %$jar = () if $jar->{version} < 1;
265
266 my @cookie;
267
268 while (my ($chost, $v) = each %$jar) {
269 next unless $chost eq substr $uhost, -length $chost;
270 next unless $chost =~ /^\./;
271
272 while (my ($cpath, $v) = each %$v) {
273 next unless $cpath eq substr $upath, 0, length $cpath;
274
275 while (my ($k, $v) = each %$v) {
276 next if $scheme ne "https" && exists $v->{secure};
277 push @cookie, "$k=$v->{value}";
278 }
279 }
280 }
281
282 $hdr{cookie} = join "; ", @cookie
283 if @cookie;
284 }
285
286 my ($rhost, $rport, $rpath); # request host, port, path
144 287
145 if ($proxy) { 288 if ($proxy) {
146 ($host, $port, $scheme) = @$proxy; 289 ($rhost, $rport, $scheme) = @$proxy;
147 $path = $url; 290 $rpath = $url;
148 } else { 291 } else {
149 ($scheme, my $authority, $path, my $query, my $fragment) = 292 ($rhost, $rport, $rpath) = ($uhost, $uport, $upath);
150 $url =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
151
152 $port = $scheme eq "http" ? 80
153 : $scheme eq "https" ? 443
154 : croak "$url: only http and https URLs supported";
155
156 $authority =~ /^(?: .*\@ )? ([^\@:]+) (?: : (\d+) )?$/x
157 or croak "$authority: unparsable URL";
158
159 $host = $1;
160 $port = $2 if defined $2;
161
162 $host =~ s/^\[(.*)\]$/$1/;
163 $path .= "?$query" if length $query;
164
165 $path = "/" unless $path;
166
167 $hdr{host} = $host = lc $host; 293 $hdr{host} = $uhost;
168 } 294 }
169 295
170 $scheme = lc $scheme;
171
172 my %state;
173
174 $state{body} = delete $arg{body};
175
176 $hdr{"content-length"} = length $state{body}; 296 $hdr{"content-length"} = length $arg{body};
177 297
298 my %state = (connect_guard => 1);
299
300 _get_slot $uhost, sub {
301 $state{slot_guard} = shift;
302
303 return unless $state{connect_guard};
304
178 $state{connect_guard} = AnyEvent::Socket::tcp_connect $host, $port, sub { 305 $state{connect_guard} = AnyEvent::Socket::tcp_connect $rhost, $rport, sub {
179 $state{fh} = shift 306 $state{fh} = shift
180 or return $cb->(undef, { Status => 599, Reason => "$!" }); 307 or return $cb->(undef, { Status => 599, Reason => "$!" });
181 308
182 delete $state{connect_guard}; # reduce memory usage, save a tree 309 delete $state{connect_guard}; # reduce memory usage, save a tree
183 310
184 # get handle 311 # get handle
185 $state{handle} = new AnyEvent::Handle 312 $state{handle} = new AnyEvent::Handle
186 fh => $state{fh}, 313 fh => $state{fh},
187 ($scheme eq "https" ? (tls => "connect") : ()); 314 ($scheme eq "https" ? (tls => "connect") : ());
188 315
189 # limit the number of persistent connections 316 # limit the number of persistent connections
190 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) { 317 if ($KA_COUNT{$_[1]} < $MAX_PERSISTENT_PER_HOST) {
191 ++$KA_COUNT{$_[1]}; 318 ++$KA_COUNT{$_[1]};
192 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} }; 319 $state{handle}{ka_count_guard} = AnyEvent::Util::guard { --$KA_COUNT{$_[1]} };
193 $hdr{connection} = "keep-alive"; 320 $hdr{connection} = "keep-alive";
194 delete $hdr{connection}; # keep-alive not yet supported 321 delete $hdr{connection}; # keep-alive not yet supported
195 } else { 322 } else {
196 delete $hdr{connection}; 323 delete $hdr{connection};
197 } 324 }
198 325
199 # (re-)configure handle 326 # (re-)configure handle
200 $state{handle}->timeout ($timeout); 327 $state{handle}->timeout ($timeout);
201 $state{handle}->on_error (sub { 328 $state{handle}->on_error (sub {
202 %state = (); 329 %state = ();
203 $cb->(undef, { Status => 599, Reason => "$!" }); 330 $cb->(undef, { Status => 599, Reason => "$!" });
204 }); 331 });
205 $state{handle}->on_eof (sub { 332 $state{handle}->on_eof (sub {
206 %state = (); 333 %state = ();
207 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file" }); 334 $cb->(undef, { Status => 599, Reason => "unexpected end-of-file" });
208 }); 335 });
209 336
210 # send request 337 # send request
211 $state{handle}->push_write ( 338 $state{handle}->push_write (
212 "$method $path HTTP/1.0\015\012" 339 "$method $rpath HTTP/1.0\015\012"
213 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr) 340 . (join "", map "$_: $hdr{$_}\015\012", keys %hdr)
214 . "\015\012" 341 . "\015\012"
215 . (delete $state{body}) 342 . (delete $arg{body})
216 );
217
218 %hdr = (); # reduce memory usage, save a kitten
219
220 # status line
221 $state{handle}->push_read (line => qr/\015?\012/, sub {
222 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
223 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])" }));
224
225 my %hdr = ( # response headers
226 HTTPVersion => ",$1",
227 Status => ",$2",
228 Reason => ",$3",
229 ); 343 );
230 344
345 %hdr = (); # reduce memory usage, save a kitten
346
347 # status line
348 $state{handle}->push_read (line => qr/\015?\012/, sub {
349 $_[1] =~ /^HTTP\/([0-9\.]+) \s+ ([0-9]{3}) \s+ ([^\015\012]+)/ix
350 or return (%state = (), $cb->(undef, { Status => 599, Reason => "invalid server response ($_[1])" }));
351
352 my %hdr = ( # response headers
353 HTTPVersion => "\x00$1",
354 Status => "\x00$2",
355 Reason => "\x00$3",
356 );
357
231 # headers, could be optimized a bit 358 # headers, could be optimized a bit
232 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub { 359 $state{handle}->unshift_read (line => qr/\015?\012\015?\012/, sub {
233 for ("$_[1]\012") { 360 for ("$_[1]\012") {
234 # we support spaces in field names, as lotus domino 361 # we support spaces in field names, as lotus domino
235 # creates them. 362 # creates them.
236 $hdr{lc $1} .= ",$2" 363 $hdr{lc $1} .= "\x00$2"
237 while /\G 364 while /\G
238 ([^:\000-\037]+): 365 ([^:\000-\037]+):
239 [\011\040]* 366 [\011\040]*
240 ((?: [^\015\012]+ | \015?\012[\011\040] )*) 367 ((?: [^\015\012]+ | \015?\012[\011\040] )*)
241 \015?\012 368 \015?\012
242 /gxc; 369 /gxc;
243 370
244 /\G$/ 371 /\G$/
245 or return $cb->(undef, { Status => 599, Reason => "garbled response headers" }); 372 or return (%state = (), $cb->(undef, { Status => 599, Reason => "garbled response headers" }));
246 } 373 }
247 374
248 substr $_, 0, 1, "" 375 substr $_, 0, 1, ""
249 for values %hdr; 376 for values %hdr;
250 377
251 if ($method eq "HEAD") { 378 my $finish = sub {
252 %state = (); 379 %state = ();
253 $cb->(undef, \%hdr); 380
254 } else { 381 # set-cookie processing
255 if (exists $hdr{"content-length"}) { 382 if ($arg{cookie_jar} && exists $hdr{"set-cookie"}) {
256 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub { 383 for (split /\x00/, $hdr{"set-cookie"}) {
257 # could cache persistent connection now 384 my ($cookie, @arg) = split /;\s*/;
258 if ($hdr{connection} =~ /\bkeep-alive\b/i) { 385 my ($name, $value) = split /=/, $cookie, 2;
259 # but we don't, due to misdesigns, this is annoyingly complex 386 my %kv = (value => $value, map { split /=/, $_, 2 } @arg);
387
388 my $cdom = (delete $kv{domain}) || $uhost;
389 my $cpath = (delete $kv{path}) || "/";
390
391 $cdom =~ s/^.?/./; # make sure it starts with a "."
392
393 next if $cdom =~ /\.$/;
394
395 # this is not rfc-like and not netscape-like. go figure.
396 my $ndots = $cdom =~ y/.//;
397 next if $ndots < ($cdom =~ /\.[^.][^.]\.[^.][^.]$/ ? 3 : 2);
398
399 # store it
400 $arg{cookie_jar}{version} = 1;
401 $arg{cookie_jar}{$cdom}{$cpath}{$name} = \%kv;
260 }; 402 }
261
262 %state = ();
263 $cb->($_[1], \%hdr);
264 }); 403 }
404
405 if ($_[1]{Status} =~ /^30[12]$/ && $recurse) {
406 # microsoft and other assholes don't give a shit for following standards,
407 # try to support a common form of broken Location header.
408 $_[1]{location} =~ s%^/%$scheme://$uhost:$uport/%;
409
410 http_request ($method, $_[1]{location}, %arg, recurse => $recurse - 1, $cb);
411 } else {
412 $cb->($_[0], $_[1]);
413 }
414 };
415
416 if ($hdr{Status} =~ /^(?:1..|204|304)$/ or $method eq "HEAD") {
417 $finish->(undef, \%hdr);
265 } else { 418 } else {
419 if (exists $hdr{"content-length"}) {
420 $_[0]->unshift_read (chunk => $hdr{"content-length"}, sub {
421 # could cache persistent connection now
422 if ($hdr{connection} =~ /\bkeep-alive\b/i) {
423 # but we don't, due to misdesigns, this is annoyingly complex
424 };
425
426 $finish->($_[1], \%hdr);
427 });
428 } else {
266 # too bad, need to read until we get an error or EOF, 429 # too bad, need to read until we get an error or EOF,
267 # no way to detect winged data. 430 # no way to detect winged data.
268 $_[0]->on_error (sub { 431 $_[0]->on_error (sub {
269 %state = ();
270 $cb->($_[0]{rbuf}, \%hdr); 432 $finish->($_[0]{rbuf}, \%hdr);
271 }); 433 });
272 $_[0]->on_eof (undef); 434 $_[0]->on_eof (undef);
273 $_[0]->on_read (sub { }); 435 $_[0]->on_read (sub { });
436 }
274 } 437 }
275 } 438 });
276 }); 439 });
440 }, sub {
441 $timeout
277 }); 442 };
278 }, sub {
279 $timeout
280 }; 443 };
281 444
282 defined wantarray && AnyEvent::Util::guard { %state = () } 445 defined wantarray && AnyEvent::Util::guard { %state = () }
283} 446}
284 447
295sub http_post($$$;@) { 458sub http_post($$$;@) {
296 unshift @_, "POST", "body"; 459 unshift @_, "POST", "body";
297 &http_request 460 &http_request
298} 461}
299 462
463=back
464
300=head2 GLOBAL FUNCTIONS AND VARIABLES 465=head2 GLOBAL FUNCTIONS AND VARIABLES
301 466
302=over 4 467=over 4
303 468
304=item AnyEvent::HTTP::set_proxy "proxy-url" 469=item AnyEvent::HTTP::set_proxy "proxy-url"

Diff Legend

Removed lines
+ Added lines
< Changed lines
> Changed lines