1 |
NAME |
2 |
Convert::UUlib - decode uu/xx/b64/mime/yenc/etc-encoded data from a |
3 |
massive number of files |
4 |
|
5 |
SYNOPSIS |
6 |
use Convert::UUlib ':all'; |
7 |
|
8 |
# read all the files named on the commandline and decode them |
9 |
# into the CURRENT directory. See below for a longer example. |
10 |
LoadFile $_ for @ARGV; |
11 |
|
12 |
for my $uu (GetFileList) { |
13 |
if ($uu->state & FILE_OK) { |
14 |
$uu->decode; |
15 |
print $uu->filename, "\n"; |
16 |
} |
17 |
} |
18 |
|
19 |
DESCRIPTION |
20 |
This module started as an interface to the uulib/uudeview library by |
21 |
Frank Pilhofer that can be used to decode all kinds of usenet (and |
22 |
other) binary messages. |
23 |
|
24 |
After upstream abondoned the project, th library was continuously |
25 |
bugfixed and improved in this module, with major focuses on security |
26 |
fixes, correctness and speed (that does not mean that this library is |
27 |
considered safe with untrusted data, but it surely is safer than the |
28 |
poriginal uudeview). |
29 |
|
30 |
Read the file doc/library.pdf from the distribution for in-depth |
31 |
information about the C-library used in this interface, and the rest of |
32 |
this document and especially the non-trivial decoder program at the end. |
33 |
|
34 |
EXPORTED CONSTANTS |
35 |
Action code constants |
36 |
ACT_IDLE we don't do anything |
37 |
ACT_SCANNING scanning an input file |
38 |
ACT_DECODING decoding into a temp file |
39 |
ACT_COPYING copying temp to target |
40 |
ACT_ENCODING encoding a file |
41 |
|
42 |
Message severity levels |
43 |
MSG_MESSAGE just a message, nothing important |
44 |
MSG_NOTE something that should be noticed |
45 |
MSG_WARNING important msg, processing continues |
46 |
MSG_ERROR processing has been terminated |
47 |
MSG_FATAL decoder cannot process further requests |
48 |
MSG_PANIC recovery impossible, app must terminate |
49 |
|
50 |
Options |
51 |
OPT_VERSION version number MAJOR.MINORplPATCH (ro) |
52 |
OPT_FAST assumes only one part per file |
53 |
OPT_DUMBNESS switch off the program's intelligence |
54 |
OPT_BRACKPOL give numbers in [] higher precendence |
55 |
OPT_VERBOSE generate informative messages |
56 |
OPT_DESPERATE try to decode incomplete files |
57 |
OPT_IGNREPLY ignore RE:plies (off by default) |
58 |
OPT_OVERWRITE whether it's OK to overwrite ex. files |
59 |
OPT_SAVEPATH prefix to save-files on disk |
60 |
OPT_IGNMODE ignore the original file mode |
61 |
OPT_DEBUG print messages with FILE/LINE info |
62 |
OPT_ERRNO get last error code for RET_IOERR (ro) |
63 |
OPT_PROGRESS retrieve progress information |
64 |
OPT_USETEXT handle text messages |
65 |
OPT_PREAMB handle Mime preambles/epilogues |
66 |
OPT_TINYB64 detect short B64 outside of Mime |
67 |
OPT_ENCEXT extension for single-part encoded files |
68 |
OPT_REMOVE remove input files after decoding (dangerous) |
69 |
OPT_MOREMIME strict MIME adherence |
70 |
OPT_DOTDOT ".."-unescaping has not yet been done on input files |
71 |
OPT_RBUF set default read I/O buffer size in bytes |
72 |
OPT_WBUF set default write I/O buffer size in bytes |
73 |
OPT_AUTOCHECK automatically check file list after every loadfile |
74 |
|
75 |
Result/Error codes |
76 |
RET_OK everything went fine |
77 |
RET_IOERR I/O Error - examine errno |
78 |
RET_NOMEM not enough memory |
79 |
RET_ILLVAL illegal value for operation |
80 |
RET_NODATA decoder didn't find any data |
81 |
RET_NOEND encoded data wasn't ended properly |
82 |
RET_UNSUP unsupported function (encoding) |
83 |
RET_EXISTS file exists (decoding) |
84 |
RET_CONT continue -- special from ScanPart |
85 |
RET_CANCEL operation canceled |
86 |
|
87 |
File States |
88 |
This code is zero, i.e. "false": |
89 |
|
90 |
UUFILE_READ Read in, but not further processed |
91 |
|
92 |
The following state codes are or'ed together: |
93 |
|
94 |
FILE_MISPART Missing Part(s) detected |
95 |
FILE_NOBEGIN No 'begin' found |
96 |
FILE_NOEND No 'end' found |
97 |
FILE_NODATA File does not contain valid uudata |
98 |
FILE_OK All Parts found, ready to decode |
99 |
FILE_ERROR Error while decoding |
100 |
FILE_DECODED Successfully decoded |
101 |
FILE_TMPFILE Temporary decoded file exists |
102 |
|
103 |
Encoding types |
104 |
UU_ENCODED UUencoded data |
105 |
B64_ENCODED Mime-Base64 data |
106 |
XX_ENCODED XXencoded data |
107 |
BH_ENCODED Binhex encoded |
108 |
PT_ENCODED Plain-Text encoded (MIME) |
109 |
QP_ENCODED Quoted-Printable (MIME) |
110 |
YENC_ENCODED yEnc encoded (non-MIME) |
111 |
|
112 |
EXPORTED FUNCTIONS |
113 |
Initializing and cleanup |
114 |
Initialize is automatically called when the module is loaded and |
115 |
allocates quite a small amount of memory for todays machines ;) CleanUp |
116 |
releases that again. |
117 |
|
118 |
On my machine, a fairly complete decode with DBI backend needs about |
119 |
10MB RSS to decode 20000 files. |
120 |
|
121 |
CleanUp |
122 |
Release memory, file items and clean up files. Should be called |
123 |
after a decoidng run, if you want to start a new one. |
124 |
|
125 |
Setting and querying options |
126 |
$option = GetOption OPT_xxx |
127 |
SetOption OPT_xxx, opt-value |
128 |
|
129 |
See the "OPT_xxx" constants above to see which options exist. |
130 |
|
131 |
Setting various callbacks |
132 |
SetMsgCallback [callback-function] |
133 |
SetBusyCallback [callback-function] |
134 |
SetFileCallback [callback-function] |
135 |
SetFNameFilter [callback-function] |
136 |
|
137 |
Call the currently selected FNameFilter |
138 |
$file = FNameFilter $file |
139 |
|
140 |
Loading sourcefiles, optionally fuzzy merge and start decoding |
141 |
($retval, $count) = LoadFile $fname, [$id, [$delflag, [$partno]]] |
142 |
Load the given file and scan it for encoded contents. Optionally tag |
143 |
it with the given id, and if $delflag is true, delete the file after |
144 |
it is no longer necessary. If you are certain of the part number, |
145 |
you can specify it as the last argument. |
146 |
|
147 |
A better (usually faster) way of doing this is using the |
148 |
"SetFNameFilter" functionality. |
149 |
|
150 |
$retval = Smerge $pass |
151 |
If you are desperate, try to call "Smerge" with increasing $pass |
152 |
values, beginning at 0, to try to merge parts that usually would not |
153 |
have been merged. |
154 |
|
155 |
Most probably this will result in garbled files, so never do this by |
156 |
default, except: |
157 |
|
158 |
If the "OPT_AUTOCHECK" option has been disabled (by default it is |
159 |
enabled) to speed up file loading, then you *have* to call "Smerge |
160 |
-1" after loading all files as an additional pre-pass (which is |
161 |
normally done by "LoadFile"). |
162 |
|
163 |
$item = GetFileListItem $item_number |
164 |
Return the $item structure for the $item_number'th found file, or |
165 |
"undef" of no file with that number exists. |
166 |
|
167 |
The first file has number 0, and the series has no holes, so you can |
168 |
iterate over all files by starting with zero and incrementing until |
169 |
you hit "undef". |
170 |
|
171 |
This function has to walk the linear list of fils on each access, so |
172 |
if you want to iterate over all items, it is usually faster to use |
173 |
"GetFileList". |
174 |
|
175 |
@items = GetFileList |
176 |
Similar to "GetFileListItem", but returns all files in one go, which |
177 |
is very much faster for large number of items, and has no drawbacks |
178 |
when used for a small number of items. |
179 |
|
180 |
Decoding files |
181 |
$retval = $item->rename ($newname) |
182 |
Change the ondisk filename where the decoded file will be saved. |
183 |
|
184 |
$retval = $item->decode_temp |
185 |
Decode the file into a temporary location, use "$item->infile" to |
186 |
retrieve the temporary filename. |
187 |
|
188 |
$retval = $item->remove_temp |
189 |
Remove the temporarily decoded file again. |
190 |
|
191 |
$retval = $item->decode ([$target_path]) |
192 |
Decode the file to its destination, or the given target path. |
193 |
|
194 |
$retval = $item->info (callback-function) |
195 |
|
196 |
Querying (and setting) item attributes |
197 |
$state = $item->state |
198 |
$mode = $item->mode ([newmode]) |
199 |
$uudet = $item->uudet |
200 |
$size = $item->size |
201 |
$filename = $item->filename ([newfilename}) |
202 |
$subfname = $item->subfname |
203 |
$mimeid = $item->mimeid |
204 |
$mimetype = $item->mimetype |
205 |
$binfile = $item->binfile |
206 |
|
207 |
Information about source parts |
208 |
$parts = $item->parts |
209 |
Return information about all parts (source files) used to decode the |
210 |
file as a list of hashrefs with the following structure: |
211 |
|
212 |
{ |
213 |
partno => <integer describing the part number, starting with 1>, |
214 |
# the following member sonly exist when they contain useful information |
215 |
sfname => <local pathname of the file where this part is from>, |
216 |
filename => <the ondisk filename of the decoded file>, |
217 |
subfname => <used to cluster postings, possibly the posting filename>, |
218 |
subject => <the subject of the posting/mail>, |
219 |
origin => <the possible source (From) address>, |
220 |
mimetype => <the possible mimetype of the decoded file>, |
221 |
mimeid => <the id part of the Content-Type>, |
222 |
} |
223 |
|
224 |
Usually you are interested mostly the "sfname" and possibly the |
225 |
"partno" and "filename" members. |
226 |
|
227 |
Functions below are not documented and not very well tested - feedback welcome |
228 |
QuickDecode |
229 |
EncodeMulti |
230 |
EncodePartial |
231 |
EncodeToStream |
232 |
EncodeToFile |
233 |
E_PrepSingle |
234 |
E_PrepPartial |
235 |
|
236 |
EXTENSION FUNCTIONS |
237 |
Functions found in this module but not documented in the uulib |
238 |
documentation: |
239 |
|
240 |
$msg = straction ACT_xxx |
241 |
Return a human readable string representing the given action code. |
242 |
|
243 |
$msg = strerror RET_xxx |
244 |
Return a human readable string representing the given error code. |
245 |
|
246 |
$str = strencoding xxx_ENCODED |
247 |
Return the name of the encoding type as a string. |
248 |
|
249 |
$str = strmsglevel MSG_xxx |
250 |
Returns the message level as a string. |
251 |
|
252 |
SetFileNameCallback $cb |
253 |
Sets (or queries) the FileNameCallback, which is called whenever the |
254 |
decoding library can't find a filename and wants to extract a |
255 |
filename from the subject line of a posting. The callback will be |
256 |
called with two arguments, the subject line and the current |
257 |
candidate for the filename. The latter argument can be "undef", |
258 |
which means that no filename could be found (and likely no one |
259 |
exists, so it is safe to also return "undef" in this case). If it |
260 |
doesn't return anything (not even "undef"!), then nothing happens, |
261 |
so this is a no-op callback: |
262 |
|
263 |
sub cb { |
264 |
return (); |
265 |
} |
266 |
|
267 |
If it returns "undef", then this indicates that no filename could be |
268 |
found. In all other cases, the return value is taken to be the |
269 |
filename. |
270 |
|
271 |
This is a slightly more useful callback: |
272 |
|
273 |
sub cb { |
274 |
return unless $_[1]; # skip "Re:"-plies et al. |
275 |
my ($subject, $filename) = @_; |
276 |
# if we find some *.rar, take it |
277 |
return $1 if $subject =~ /(\w+\.rar)/; |
278 |
# otherwise just pass what we have |
279 |
return (); |
280 |
} |
281 |
|
282 |
LARGE EXAMPLE DECODER |
283 |
The general workflow for decoding is like this: |
284 |
|
285 |
1. Configure options with "SetOption" or "SetXXXCallback". |
286 |
2. Load all source files with "LoadFile". |
287 |
3. Optionally "Smerge". |
288 |
4. Iterate over all "GetFileList" items (i.e. result files). |
289 |
5. "CleanUp" to delete files and free items. |
290 |
|
291 |
What follows is the file "example-decoder" from the distribution that |
292 |
illustrates the above worklfow in a non-trivial example. |
293 |
|
294 |
#!/usr/bin/perl |
295 |
|
296 |
# decode all the files in the directory uusrc/ and copy |
297 |
# the resulting files to uudst/ |
298 |
|
299 |
use Convert::UUlib ':all'; |
300 |
|
301 |
sub namefilter { |
302 |
my ($path) = @_; |
303 |
|
304 |
$path=~s/^.*[\/\\]//; |
305 |
|
306 |
$path |
307 |
} |
308 |
|
309 |
sub busycb { |
310 |
my ($action, $curfile, $partno, $numparts, $percent, $fsize) = @_; |
311 |
$_[0]=straction($action); |
312 |
print "busy_callback(", (join ",",@_), ")\n"; |
313 |
0 |
314 |
} |
315 |
|
316 |
SetOption OPT_RBUF, 128*1024; |
317 |
SetOption OPT_WBUF, 1024*1024; |
318 |
SetOption OPT_IGNMODE, 1; |
319 |
SetOption OPT_IGNMODE, 1; |
320 |
SetOption OPT_VERBOSE, 1; |
321 |
SetOption OPT_AUTOCHK, 0; |
322 |
|
323 |
# show the three ways you can set callback functions. I normally |
324 |
# prefer the one with the sub inplace. |
325 |
SetFNameFilter \&namefilter; |
326 |
|
327 |
SetBusyCallback "busycb", 333; |
328 |
|
329 |
SetMsgCallback sub { |
330 |
my ($msg, $level) = @_; |
331 |
print uc strmsglevel $_[1], ": $msg\n"; |
332 |
}; |
333 |
|
334 |
# the following non-trivial FileNameCallback takes care |
335 |
# of some subject lines not detected properly by uulib: |
336 |
SetFileNameCallback sub { |
337 |
return unless $_[1]; # skip "Re:"-plies et al. |
338 |
local $_ = $_[0]; |
339 |
|
340 |
# the following rules are rather effective on some newsgroups, |
341 |
# like alt.binaries.games.anime, where non-mime, uuencoded data |
342 |
# is very common |
343 |
|
344 |
# if we find some *.rar, take it as the filename |
345 |
return $1 if /(\S{3,}\.(?:[rstuvwxyz]\d\d|rar))\s/i; |
346 |
|
347 |
# one common subject format |
348 |
return $1 if /- "(.{2,}?\..+?)" (?:yenc )?\(\d+\/\d+\)/i; |
349 |
|
350 |
# - filename.par (04/55) |
351 |
return $1 if /- "?(\S{3,}\.\S+?)"? (?:yenc )?\(\d+\/\d+\)/i; |
352 |
|
353 |
# - (xxx) No. 1 sayuri81.jpg 756565 bytes |
354 |
# - (20 files) No.17 Roseanne.jpg [2/2] |
355 |
return $1 if /No\.[ 0-9]+ (\S+\....) (?:\d+ bytes )?\[/; |
356 |
|
357 |
# try to detect some common forms of filenames |
358 |
return $1 if /([a-z0-9_\-+.]{3,}\.[a-z]{3,4}(?:.\d+))/i; |
359 |
|
360 |
# otherwise just pass what we have |
361 |
() |
362 |
}; |
363 |
|
364 |
# now read all files in the directory uusrc/* |
365 |
for (<uusrc/*>) { |
366 |
my ($retval, $count) = LoadFile ($_, $_, 1); |
367 |
print "file($_), status(", strerror $retval, ") parts($count)\n"; |
368 |
} |
369 |
|
370 |
Smerge -1; |
371 |
|
372 |
SetOption OPT_SAVEPATH, "uudst/"; |
373 |
|
374 |
# now wade through all files and their source parts |
375 |
for my $uu (GetFileList) { |
376 |
print "file ", $uu->filename, "\n"; |
377 |
print " state ", $uu->state, "\n"; |
378 |
print " mode ", $uu->mode, "\n"; |
379 |
print " uudet ", strencoding $uu->uudet, "\n"; |
380 |
print " size ", $uu->size, "\n"; |
381 |
print " subfname ", $uu->subfname, "\n"; |
382 |
print " mimeid ", $uu->mimeid, "\n"; |
383 |
print " mimetype ", $uu->mimetype, "\n"; |
384 |
|
385 |
# print additional info about all parts |
386 |
print " parts"; |
387 |
for ($uu->parts) { |
388 |
for my $k (sort keys %$_) { |
389 |
print " $k=$_->{$k}"; |
390 |
} |
391 |
print "\n"; |
392 |
} |
393 |
|
394 |
$uu->remove_temp; |
395 |
|
396 |
if (my $err = $uu->decode) { |
397 |
print " ERROR ", strerror $err, "\n"; |
398 |
} else { |
399 |
print " successfully saved as uudst/", $uu->filename, "\n"; |
400 |
} |
401 |
} |
402 |
|
403 |
print "cleanup...\n"; |
404 |
|
405 |
CleanUp; |
406 |
|
407 |
PERLMULTICORE SUPPORT |
408 |
This module supports the perlmulticore standard (see |
409 |
<http://perlmulticore.schmorp.de/> for more info) for the following |
410 |
functions - generally these are functions accessing the disk and/or |
411 |
using considerable CPU time: |
412 |
|
413 |
LoadFile |
414 |
$item->decode |
415 |
$item->decode_temp |
416 |
$item->remove_temp |
417 |
$item->info |
418 |
|
419 |
The perl interpreter will be reacquired/released on every callback |
420 |
invocation, so for performance reasons, callbacks should be avoided if |
421 |
that is costly. |
422 |
|
423 |
Future versions might enable multicore support for more functions. |
424 |
|
425 |
BUGS AND LIMITATIONS |
426 |
The original uulib library this module uses was written at a time where |
427 |
main memory of measured in megabytes and buffer overflows as a security |
428 |
thign didn't exist. While a lot of security fixes have been applied over |
429 |
the years (includign some defense in depth mechanism that can shield |
430 |
against a lot of as-of-yet undetected bugs), using this library for |
431 |
security purposes requires care. |
432 |
|
433 |
Likewise, file sizes when the uulib library was written were tiny |
434 |
compared to today, so do not expect this library to handle files larger |
435 |
than 2GB. |
436 |
|
437 |
Lastly, this module uses a very "C-like" interface, which means it |
438 |
doesn't protect you from invalid points as you might expect from "more |
439 |
perlish" modules - for example, accessing a file item object after |
440 |
callinbg "CleanUp" will likely result in crashes, memory corruption, or |
441 |
worse. |
442 |
|
443 |
AUTHOR |
444 |
Marc Lehmann <schmorp@schmorp.de>, the original uulib library was |
445 |
written by Frank Pilhofer <fp@informatik.uni-frankfurt.de>, and later |
446 |
heavily bugfixed by Marc Lehmann. |
447 |
|
448 |
SEE ALSO |
449 |
perl(1), uudeview homepage at <http://www.fpx.de/fp/Software/UUDeview/>. |
450 |
|