ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/Convert-UUlib/README
Revision: 1.7
Committed: Thu Dec 17 01:24:59 2020 UTC (3 years, 4 months ago) by root
Branch: MAIN
CVS Tags: rel-1_8, HEAD
Changes since 1.6: +18 -3 lines
Log Message:
1.8

File Contents

# Content
1 NAME
2 Convert::UUlib - decode uu/xx/b64/mime/yenc/etc-encoded data from a
3 massive number of files
4
5 SYNOPSIS
6 use Convert::UUlib ':all';
7
8 # read all the files named on the commandline and decode them
9 # into the CURRENT directory. See below for a longer example.
10 LoadFile $_ for @ARGV;
11
12 for my $uu (GetFileList) {
13 if ($uu->state & FILE_OK) {
14 $uu->decode;
15 print $uu->filename, "\n";
16 }
17 }
18
19 DESCRIPTION
20 This module started as an interface to the uulib/uudeview library by
21 Frank Pilhofer that can be used to decode all kinds of usenet (and
22 other) binary messages.
23
24 After upstream abondoned the project, th library was continuously
25 bugfixed and improved in this module, with major focuses on security
26 fixes, correctness and speed (that does not mean that this library is
27 considered safe with untrusted data, but it surely is safer than the
28 poriginal uudeview).
29
30 Read the file doc/library.pdf from the distribution for in-depth
31 information about the C-library used in this interface, and the rest of
32 this document and especially the non-trivial decoder program at the end.
33
34 EXPORTED CONSTANTS
35 Action code constants
36 ACT_IDLE we don't do anything
37 ACT_SCANNING scanning an input file
38 ACT_DECODING decoding into a temp file
39 ACT_COPYING copying temp to target
40 ACT_ENCODING encoding a file
41
42 Message severity levels
43 MSG_MESSAGE just a message, nothing important
44 MSG_NOTE something that should be noticed
45 MSG_WARNING important msg, processing continues
46 MSG_ERROR processing has been terminated
47 MSG_FATAL decoder cannot process further requests
48 MSG_PANIC recovery impossible, app must terminate
49
50 Options
51 OPT_VERSION version number MAJOR.MINORplPATCH (ro)
52 OPT_FAST assumes only one part per file
53 OPT_DUMBNESS switch off the program's intelligence
54 OPT_BRACKPOL give numbers in [] higher precendence
55 OPT_VERBOSE generate informative messages
56 OPT_DESPERATE try to decode incomplete files
57 OPT_IGNREPLY ignore RE:plies (off by default)
58 OPT_OVERWRITE whether it's OK to overwrite ex. files
59 OPT_SAVEPATH prefix to save-files on disk
60 OPT_IGNMODE ignore the original file mode
61 OPT_DEBUG print messages with FILE/LINE info
62 OPT_ERRNO get last error code for RET_IOERR (ro)
63 OPT_PROGRESS retrieve progress information
64 OPT_USETEXT handle text messages
65 OPT_PREAMB handle Mime preambles/epilogues
66 OPT_TINYB64 detect short B64 outside of Mime
67 OPT_ENCEXT extension for single-part encoded files
68 OPT_REMOVE remove input files after decoding (dangerous)
69 OPT_MOREMIME strict MIME adherence
70 OPT_DOTDOT ".."-unescaping has not yet been done on input files
71 OPT_RBUF set default read I/O buffer size in bytes
72 OPT_WBUF set default write I/O buffer size in bytes
73 OPT_AUTOCHECK automatically check file list after every loadfile
74
75 Result/Error codes
76 RET_OK everything went fine
77 RET_IOERR I/O Error - examine errno
78 RET_NOMEM not enough memory
79 RET_ILLVAL illegal value for operation
80 RET_NODATA decoder didn't find any data
81 RET_NOEND encoded data wasn't ended properly
82 RET_UNSUP unsupported function (encoding)
83 RET_EXISTS file exists (decoding)
84 RET_CONT continue -- special from ScanPart
85 RET_CANCEL operation canceled
86
87 File States
88 This code is zero, i.e. "false":
89
90 UUFILE_READ Read in, but not further processed
91
92 The following state codes are or'ed together:
93
94 FILE_MISPART Missing Part(s) detected
95 FILE_NOBEGIN No 'begin' found
96 FILE_NOEND No 'end' found
97 FILE_NODATA File does not contain valid uudata
98 FILE_OK All Parts found, ready to decode
99 FILE_ERROR Error while decoding
100 FILE_DECODED Successfully decoded
101 FILE_TMPFILE Temporary decoded file exists
102
103 Encoding types
104 UU_ENCODED UUencoded data
105 B64_ENCODED Mime-Base64 data
106 XX_ENCODED XXencoded data
107 BH_ENCODED Binhex encoded
108 PT_ENCODED Plain-Text encoded (MIME)
109 QP_ENCODED Quoted-Printable (MIME)
110 YENC_ENCODED yEnc encoded (non-MIME)
111
112 EXPORTED FUNCTIONS
113 Initializing and cleanup
114 Initialize is automatically called when the module is loaded and
115 allocates quite a small amount of memory for todays machines ;) CleanUp
116 releases that again.
117
118 On my machine, a fairly complete decode with DBI backend needs about
119 10MB RSS to decode 20000 files.
120
121 CleanUp
122 Release memory, file items and clean up files. Should be called
123 after a decoidng run, if you want to start a new one.
124
125 Setting and querying options
126 $option = GetOption OPT_xxx
127 SetOption OPT_xxx, opt-value
128
129 See the "OPT_xxx" constants above to see which options exist.
130
131 Setting various callbacks
132 SetMsgCallback [callback-function]
133 SetBusyCallback [callback-function]
134 SetFileCallback [callback-function]
135 SetFNameFilter [callback-function]
136
137 Call the currently selected FNameFilter
138 $file = FNameFilter $file
139
140 Loading sourcefiles, optionally fuzzy merge and start decoding
141 ($retval, $count) = LoadFile $fname, [$id, [$delflag, [$partno]]]
142 Load the given file and scan it for encoded contents. Optionally tag
143 it with the given id, and if $delflag is true, delete the file after
144 it is no longer necessary. If you are certain of the part number,
145 you can specify it as the last argument.
146
147 A better (usually faster) way of doing this is using the
148 "SetFNameFilter" functionality.
149
150 $retval = Smerge $pass
151 If you are desperate, try to call "Smerge" with increasing $pass
152 values, beginning at 0, to try to merge parts that usually would not
153 have been merged.
154
155 Most probably this will result in garbled files, so never do this by
156 default, except:
157
158 If the "OPT_AUTOCHECK" option has been disabled (by default it is
159 enabled) to speed up file loading, then you *have* to call "Smerge
160 -1" after loading all files as an additional pre-pass (which is
161 normally done by "LoadFile").
162
163 $item = GetFileListItem $item_number
164 Return the $item structure for the $item_number'th found file, or
165 "undef" of no file with that number exists.
166
167 The first file has number 0, and the series has no holes, so you can
168 iterate over all files by starting with zero and incrementing until
169 you hit "undef".
170
171 This function has to walk the linear list of fils on each access, so
172 if you want to iterate over all items, it is usually faster to use
173 "GetFileList".
174
175 @items = GetFileList
176 Similar to "GetFileListItem", but returns all files in one go, which
177 is very much faster for large number of items, and has no drawbacks
178 when used for a small number of items.
179
180 Decoding files
181 $retval = $item->rename ($newname)
182 Change the ondisk filename where the decoded file will be saved.
183
184 $retval = $item->decode_temp
185 Decode the file into a temporary location, use "$item->infile" to
186 retrieve the temporary filename.
187
188 $retval = $item->remove_temp
189 Remove the temporarily decoded file again.
190
191 $retval = $item->decode ([$target_path])
192 Decode the file to its destination, or the given target path.
193
194 $retval = $item->info (callback-function)
195
196 Querying (and setting) item attributes
197 $state = $item->state
198 $mode = $item->mode ([newmode])
199 $uudet = $item->uudet
200 $size = $item->size
201 $filename = $item->filename ([newfilename})
202 $subfname = $item->subfname
203 $mimeid = $item->mimeid
204 $mimetype = $item->mimetype
205 $binfile = $item->binfile
206
207 Information about source parts
208 $parts = $item->parts
209 Return information about all parts (source files) used to decode the
210 file as a list of hashrefs with the following structure:
211
212 {
213 partno => <integer describing the part number, starting with 1>,
214 # the following member sonly exist when they contain useful information
215 sfname => <local pathname of the file where this part is from>,
216 filename => <the ondisk filename of the decoded file>,
217 subfname => <used to cluster postings, possibly the posting filename>,
218 subject => <the subject of the posting/mail>,
219 origin => <the possible source (From) address>,
220 mimetype => <the possible mimetype of the decoded file>,
221 mimeid => <the id part of the Content-Type>,
222 }
223
224 Usually you are interested mostly the "sfname" and possibly the
225 "partno" and "filename" members.
226
227 Functions below are not documented and not very well tested - feedback welcome
228 QuickDecode
229 EncodeMulti
230 EncodePartial
231 EncodeToStream
232 EncodeToFile
233 E_PrepSingle
234 E_PrepPartial
235
236 EXTENSION FUNCTIONS
237 Functions found in this module but not documented in the uulib
238 documentation:
239
240 $msg = straction ACT_xxx
241 Return a human readable string representing the given action code.
242
243 $msg = strerror RET_xxx
244 Return a human readable string representing the given error code.
245
246 $str = strencoding xxx_ENCODED
247 Return the name of the encoding type as a string.
248
249 $str = strmsglevel MSG_xxx
250 Returns the message level as a string.
251
252 SetFileNameCallback $cb
253 Sets (or queries) the FileNameCallback, which is called whenever the
254 decoding library can't find a filename and wants to extract a
255 filename from the subject line of a posting. The callback will be
256 called with two arguments, the subject line and the current
257 candidate for the filename. The latter argument can be "undef",
258 which means that no filename could be found (and likely no one
259 exists, so it is safe to also return "undef" in this case). If it
260 doesn't return anything (not even "undef"!), then nothing happens,
261 so this is a no-op callback:
262
263 sub cb {
264 return ();
265 }
266
267 If it returns "undef", then this indicates that no filename could be
268 found. In all other cases, the return value is taken to be the
269 filename.
270
271 This is a slightly more useful callback:
272
273 sub cb {
274 return unless $_[1]; # skip "Re:"-plies et al.
275 my ($subject, $filename) = @_;
276 # if we find some *.rar, take it
277 return $1 if $subject =~ /(\w+\.rar)/;
278 # otherwise just pass what we have
279 return ();
280 }
281
282 LARGE EXAMPLE DECODER
283 The general workflow for decoding is like this:
284
285 1. Configure options with "SetOption" or "SetXXXCallback".
286 2. Load all source files with "LoadFile".
287 3. Optionally "Smerge".
288 4. Iterate over all "GetFileList" items (i.e. result files).
289 5. "CleanUp" to delete files and free items.
290
291 What follows is the file "example-decoder" from the distribution that
292 illustrates the above worklfow in a non-trivial example.
293
294 #!/usr/bin/perl
295
296 # decode all the files in the directory uusrc/ and copy
297 # the resulting files to uudst/
298
299 use Convert::UUlib ':all';
300
301 sub namefilter {
302 my ($path) = @_;
303
304 $path=~s/^.*[\/\\]//;
305
306 $path
307 }
308
309 sub busycb {
310 my ($action, $curfile, $partno, $numparts, $percent, $fsize) = @_;
311 $_[0]=straction($action);
312 print "busy_callback(", (join ",",@_), ")\n";
313 0
314 }
315
316 SetOption OPT_RBUF, 128*1024;
317 SetOption OPT_WBUF, 1024*1024;
318 SetOption OPT_IGNMODE, 1;
319 SetOption OPT_IGNMODE, 1;
320 SetOption OPT_VERBOSE, 1;
321 SetOption OPT_AUTOCHK, 0;
322
323 # show the three ways you can set callback functions. I normally
324 # prefer the one with the sub inplace.
325 SetFNameFilter \&namefilter;
326
327 SetBusyCallback "busycb", 333;
328
329 SetMsgCallback sub {
330 my ($msg, $level) = @_;
331 print uc strmsglevel $_[1], ": $msg\n";
332 };
333
334 # the following non-trivial FileNameCallback takes care
335 # of some subject lines not detected properly by uulib:
336 SetFileNameCallback sub {
337 return unless $_[1]; # skip "Re:"-plies et al.
338 local $_ = $_[0];
339
340 # the following rules are rather effective on some newsgroups,
341 # like alt.binaries.games.anime, where non-mime, uuencoded data
342 # is very common
343
344 # if we find some *.rar, take it as the filename
345 return $1 if /(\S{3,}\.(?:[rstuvwxyz]\d\d|rar))\s/i;
346
347 # one common subject format
348 return $1 if /- "(.{2,}?\..+?)" (?:yenc )?\(\d+\/\d+\)/i;
349
350 # - filename.par (04/55)
351 return $1 if /- "?(\S{3,}\.\S+?)"? (?:yenc )?\(\d+\/\d+\)/i;
352
353 # - (xxx) No. 1 sayuri81.jpg 756565 bytes
354 # - (20 files) No.17 Roseanne.jpg [2/2]
355 return $1 if /No\.[ 0-9]+ (\S+\....) (?:\d+ bytes )?\[/;
356
357 # try to detect some common forms of filenames
358 return $1 if /([a-z0-9_\-+.]{3,}\.[a-z]{3,4}(?:.\d+))/i;
359
360 # otherwise just pass what we have
361 ()
362 };
363
364 # now read all files in the directory uusrc/*
365 for (<uusrc/*>) {
366 my ($retval, $count) = LoadFile ($_, $_, 1);
367 print "file($_), status(", strerror $retval, ") parts($count)\n";
368 }
369
370 Smerge -1;
371
372 SetOption OPT_SAVEPATH, "uudst/";
373
374 # now wade through all files and their source parts
375 for my $uu (GetFileList) {
376 print "file ", $uu->filename, "\n";
377 print " state ", $uu->state, "\n";
378 print " mode ", $uu->mode, "\n";
379 print " uudet ", strencoding $uu->uudet, "\n";
380 print " size ", $uu->size, "\n";
381 print " subfname ", $uu->subfname, "\n";
382 print " mimeid ", $uu->mimeid, "\n";
383 print " mimetype ", $uu->mimetype, "\n";
384
385 # print additional info about all parts
386 print " parts";
387 for ($uu->parts) {
388 for my $k (sort keys %$_) {
389 print " $k=$_->{$k}";
390 }
391 print "\n";
392 }
393
394 $uu->remove_temp;
395
396 if (my $err = $uu->decode) {
397 print " ERROR ", strerror $err, "\n";
398 } else {
399 print " successfully saved as uudst/", $uu->filename, "\n";
400 }
401 }
402
403 print "cleanup...\n";
404
405 CleanUp;
406
407 PERLMULTICORE SUPPORT
408 This module supports the perlmulticore standard (see
409 <http://perlmulticore.schmorp.de/> for more info) for the following
410 functions - generally these are functions accessing the disk and/or
411 using considerable CPU time:
412
413 LoadFile
414 $item->decode
415 $item->decode_temp
416 $item->remove_temp
417 $item->info
418
419 The perl interpreter will be reacquired/released on every callback
420 invocation, so for performance reasons, callbacks should be avoided if
421 that is costly.
422
423 Future versions might enable multicore support for more functions.
424
425 BUGS AND LIMITATIONS
426 The original uulib library this module uses was written at a time where
427 main memory of measured in megabytes and buffer overflows as a security
428 thign didn't exist. While a lot of security fixes have been applied over
429 the years (includign some defense in depth mechanism that can shield
430 against a lot of as-of-yet undetected bugs), using this library for
431 security purposes requires care.
432
433 Likewise, file sizes when the uulib library was written were tiny
434 compared to today, so do not expect this library to handle files larger
435 than 2GB.
436
437 Lastly, this module uses a very "C-like" interface, which means it
438 doesn't protect you from invalid points as you might expect from "more
439 perlish" modules - for example, accessing a file item object after
440 callinbg "CleanUp" will likely result in crashes, memory corruption, or
441 worse.
442
443 AUTHOR
444 Marc Lehmann <schmorp@schmorp.de>, the original uulib library was
445 written by Frank Pilhofer <fp@informatik.uni-frankfurt.de>, and later
446 heavily bugfixed by Marc Lehmann.
447
448 SEE ALSO
449 perl(1), uudeview homepage at <http://www.fpx.de/fp/Software/UUDeview/>.
450