ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/Convert-UUlib/README
Revision: 1.7
Committed: Thu Dec 17 01:24:59 2020 UTC (3 years, 5 months ago) by root
Branch: MAIN
CVS Tags: rel-1_8, HEAD
Changes since 1.6: +18 -3 lines
Log Message:
1.8

File Contents

# User Rev Content
1 root 1.1 NAME
2 root 1.7 Convert::UUlib - decode uu/xx/b64/mime/yenc/etc-encoded data from a
3     massive number of files
4 root 1.1
5     SYNOPSIS
6     use Convert::UUlib ':all';
7 root 1.5
8     # read all the files named on the commandline and decode them
9 root 1.1 # into the CURRENT directory. See below for a longer example.
10     LoadFile $_ for @ARGV;
11 root 1.6
12     for my $uu (GetFileList) {
13 root 1.1 if ($uu->state & FILE_OK) {
14     $uu->decode;
15     print $uu->filename, "\n";
16     }
17     }
18    
19     DESCRIPTION
20 root 1.7 This module started as an interface to the uulib/uudeview library by
21     Frank Pilhofer that can be used to decode all kinds of usenet (and
22     other) binary messages.
23    
24     After upstream abondoned the project, th library was continuously
25     bugfixed and improved in this module, with major focuses on security
26     fixes, correctness and speed (that does not mean that this library is
27     considered safe with untrusted data, but it surely is safer than the
28     poriginal uudeview).
29    
30 root 1.1 Read the file doc/library.pdf from the distribution for in-depth
31     information about the C-library used in this interface, and the rest of
32     this document and especially the non-trivial decoder program at the end.
33    
34     EXPORTED CONSTANTS
35     Action code constants
36     ACT_IDLE we don't do anything
37     ACT_SCANNING scanning an input file
38     ACT_DECODING decoding into a temp file
39     ACT_COPYING copying temp to target
40     ACT_ENCODING encoding a file
41    
42     Message severity levels
43     MSG_MESSAGE just a message, nothing important
44     MSG_NOTE something that should be noticed
45     MSG_WARNING important msg, processing continues
46     MSG_ERROR processing has been terminated
47     MSG_FATAL decoder cannot process further requests
48     MSG_PANIC recovery impossible, app must terminate
49    
50     Options
51     OPT_VERSION version number MAJOR.MINORplPATCH (ro)
52     OPT_FAST assumes only one part per file
53     OPT_DUMBNESS switch off the program's intelligence
54     OPT_BRACKPOL give numbers in [] higher precendence
55     OPT_VERBOSE generate informative messages
56     OPT_DESPERATE try to decode incomplete files
57     OPT_IGNREPLY ignore RE:plies (off by default)
58     OPT_OVERWRITE whether it's OK to overwrite ex. files
59     OPT_SAVEPATH prefix to save-files on disk
60     OPT_IGNMODE ignore the original file mode
61     OPT_DEBUG print messages with FILE/LINE info
62     OPT_ERRNO get last error code for RET_IOERR (ro)
63     OPT_PROGRESS retrieve progress information
64     OPT_USETEXT handle text messages
65     OPT_PREAMB handle Mime preambles/epilogues
66     OPT_TINYB64 detect short B64 outside of Mime
67     OPT_ENCEXT extension for single-part encoded files
68     OPT_REMOVE remove input files after decoding (dangerous)
69     OPT_MOREMIME strict MIME adherence
70     OPT_DOTDOT ".."-unescaping has not yet been done on input files
71 root 1.5 OPT_RBUF set default read I/O buffer size in bytes
72     OPT_WBUF set default write I/O buffer size in bytes
73     OPT_AUTOCHECK automatically check file list after every loadfile
74 root 1.1
75     Result/Error codes
76     RET_OK everything went fine
77     RET_IOERR I/O Error - examine errno
78     RET_NOMEM not enough memory
79     RET_ILLVAL illegal value for operation
80     RET_NODATA decoder didn't find any data
81     RET_NOEND encoded data wasn't ended properly
82     RET_UNSUP unsupported function (encoding)
83     RET_EXISTS file exists (decoding)
84     RET_CONT continue -- special from ScanPart
85     RET_CANCEL operation canceled
86    
87     File States
88     This code is zero, i.e. "false":
89    
90     UUFILE_READ Read in, but not further processed
91    
92     The following state codes are or'ed together:
93    
94     FILE_MISPART Missing Part(s) detected
95     FILE_NOBEGIN No 'begin' found
96     FILE_NOEND No 'end' found
97     FILE_NODATA File does not contain valid uudata
98     FILE_OK All Parts found, ready to decode
99     FILE_ERROR Error while decoding
100     FILE_DECODED Successfully decoded
101     FILE_TMPFILE Temporary decoded file exists
102    
103     Encoding types
104     UU_ENCODED UUencoded data
105     B64_ENCODED Mime-Base64 data
106     XX_ENCODED XXencoded data
107     BH_ENCODED Binhex encoded
108     PT_ENCODED Plain-Text encoded (MIME)
109     QP_ENCODED Quoted-Printable (MIME)
110     YENC_ENCODED yEnc encoded (non-MIME)
111    
112     EXPORTED FUNCTIONS
113     Initializing and cleanup
114     Initialize is automatically called when the module is loaded and
115     allocates quite a small amount of memory for todays machines ;) CleanUp
116     releases that again.
117    
118     On my machine, a fairly complete decode with DBI backend needs about
119     10MB RSS to decode 20000 files.
120    
121     CleanUp
122 root 1.6 Release memory, file items and clean up files. Should be called
123     after a decoidng run, if you want to start a new one.
124 root 1.1
125     Setting and querying options
126     $option = GetOption OPT_xxx
127     SetOption OPT_xxx, opt-value
128    
129     See the "OPT_xxx" constants above to see which options exist.
130    
131     Setting various callbacks
132     SetMsgCallback [callback-function]
133     SetBusyCallback [callback-function]
134     SetFileCallback [callback-function]
135     SetFNameFilter [callback-function]
136    
137     Call the currently selected FNameFilter
138     $file = FNameFilter $file
139    
140     Loading sourcefiles, optionally fuzzy merge and start decoding
141     ($retval, $count) = LoadFile $fname, [$id, [$delflag, [$partno]]]
142     Load the given file and scan it for encoded contents. Optionally tag
143     it with the given id, and if $delflag is true, delete the file after
144     it is no longer necessary. If you are certain of the part number,
145     you can specify it as the last argument.
146    
147     A better (usually faster) way of doing this is using the
148     "SetFNameFilter" functionality.
149    
150     $retval = Smerge $pass
151     If you are desperate, try to call "Smerge" with increasing $pass
152     values, beginning at 0, to try to merge parts that usually would not
153     have been merged.
154    
155     Most probably this will result in garbled files, so never do this by
156 root 1.5 default, except:
157    
158     If the "OPT_AUTOCHECK" option has been disabled (by default it is
159     enabled) to speed up file loading, then you *have* to call "Smerge
160     -1" after loading all files as an additional pre-pass (which is
161     normally done by "LoadFile").
162 root 1.1
163     $item = GetFileListItem $item_number
164     Return the $item structure for the $item_number'th found file, or
165     "undef" of no file with that number exists.
166    
167     The first file has number 0, and the series has no holes, so you can
168     iterate over all files by starting with zero and incrementing until
169     you hit "undef".
170    
171 root 1.6 This function has to walk the linear list of fils on each access, so
172     if you want to iterate over all items, it is usually faster to use
173     "GetFileList".
174    
175     @items = GetFileList
176 root 1.7 Similar to "GetFileListItem", but returns all files in one go, which
177     is very much faster for large number of items, and has no drawbacks
178     when used for a small number of items.
179 root 1.6
180 root 1.1 Decoding files
181 root 1.6 $retval = $item->rename ($newname)
182 root 1.1 Change the ondisk filename where the decoded file will be saved.
183    
184     $retval = $item->decode_temp
185     Decode the file into a temporary location, use "$item->infile" to
186     retrieve the temporary filename.
187    
188     $retval = $item->remove_temp
189     Remove the temporarily decoded file again.
190    
191 root 1.6 $retval = $item->decode ([$target_path])
192     Decode the file to its destination, or the given target path.
193 root 1.1
194 root 1.6 $retval = $item->info (callback-function)
195 root 1.1
196     Querying (and setting) item attributes
197     $state = $item->state
198 root 1.6 $mode = $item->mode ([newmode])
199 root 1.1 $uudet = $item->uudet
200     $size = $item->size
201 root 1.6 $filename = $item->filename ([newfilename})
202 root 1.1 $subfname = $item->subfname
203     $mimeid = $item->mimeid
204     $mimetype = $item->mimetype
205     $binfile = $item->binfile
206    
207     Information about source parts
208     $parts = $item->parts
209     Return information about all parts (source files) used to decode the
210     file as a list of hashrefs with the following structure:
211    
212     {
213     partno => <integer describing the part number, starting with 1>,
214     # the following member sonly exist when they contain useful information
215     sfname => <local pathname of the file where this part is from>,
216     filename => <the ondisk filename of the decoded file>,
217     subfname => <used to cluster postings, possibly the posting filename>,
218     subject => <the subject of the posting/mail>,
219     origin => <the possible source (From) address>,
220     mimetype => <the possible mimetype of the decoded file>,
221     mimeid => <the id part of the Content-Type>,
222     }
223    
224     Usually you are interested mostly the "sfname" and possibly the
225     "partno" and "filename" members.
226    
227 root 1.6 Functions below are not documented and not very well tested - feedback welcome
228 root 1.1 QuickDecode
229     EncodeMulti
230     EncodePartial
231     EncodeToStream
232     EncodeToFile
233     E_PrepSingle
234     E_PrepPartial
235    
236     EXTENSION FUNCTIONS
237     Functions found in this module but not documented in the uulib
238     documentation:
239    
240     $msg = straction ACT_xxx
241     Return a human readable string representing the given action code.
242    
243     $msg = strerror RET_xxx
244     Return a human readable string representing the given error code.
245    
246     $str = strencoding xxx_ENCODED
247     Return the name of the encoding type as a string.
248    
249     $str = strmsglevel MSG_xxx
250     Returns the message level as a string.
251    
252     SetFileNameCallback $cb
253     Sets (or queries) the FileNameCallback, which is called whenever the
254     decoding library can't find a filename and wants to extract a
255     filename from the subject line of a posting. The callback will be
256     called with two arguments, the subject line and the current
257     candidate for the filename. The latter argument can be "undef",
258     which means that no filename could be found (and likely no one
259     exists, so it is safe to also return "undef" in this case). If it
260     doesn't return anything (not even "undef"!), then nothing happens,
261     so this is a no-op callback:
262    
263     sub cb {
264     return ();
265     }
266    
267     If it returns "undef", then this indicates that no filename could be
268     found. In all other cases, the return value is taken to be the
269     filename.
270    
271     This is a slightly more useful callback:
272    
273     sub cb {
274     return unless $_[1]; # skip "Re:"-plies et al.
275     my ($subject, $filename) = @_;
276     # if we find some *.rar, take it
277     return $1 if $subject =~ /(\w+\.rar)/;
278     # otherwise just pass what we have
279     return ();
280     }
281    
282     LARGE EXAMPLE DECODER
283 root 1.6 The general workflow for decoding is like this:
284    
285     1. Configure options with "SetOption" or "SetXXXCallback".
286     2. Load all source files with "LoadFile".
287     3. Optionally "Smerge".
288     4. Iterate over all "GetFileList" items (i.e. result files).
289     5. "CleanUp" to delete files and free items.
290    
291     What follows is the file "example-decoder" from the distribution that
292     illustrates the above worklfow in a non-trivial example.
293 root 1.1
294 root 1.4 #!/usr/bin/perl
295 root 1.1
296 root 1.4 # decode all the files in the directory uusrc/ and copy
297     # the resulting files to uudst/
298 root 1.1
299 root 1.4 use Convert::UUlib ':all';
300 root 1.1
301 root 1.4 sub namefilter {
302     my ($path) = @_;
303    
304     $path=~s/^.*[\/\\]//;
305    
306     $path
307     }
308    
309     sub busycb {
310     my ($action, $curfile, $partno, $numparts, $percent, $fsize) = @_;
311     $_[0]=straction($action);
312     print "busy_callback(", (join ",",@_), ")\n";
313     0
314     }
315    
316     SetOption OPT_RBUF, 128*1024;
317     SetOption OPT_WBUF, 1024*1024;
318     SetOption OPT_IGNMODE, 1;
319     SetOption OPT_IGNMODE, 1;
320     SetOption OPT_VERBOSE, 1;
321 root 1.7 SetOption OPT_AUTOCHK, 0;
322 root 1.4
323     # show the three ways you can set callback functions. I normally
324     # prefer the one with the sub inplace.
325     SetFNameFilter \&namefilter;
326    
327     SetBusyCallback "busycb", 333;
328    
329     SetMsgCallback sub {
330     my ($msg, $level) = @_;
331     print uc strmsglevel $_[1], ": $msg\n";
332     };
333    
334     # the following non-trivial FileNameCallback takes care
335     # of some subject lines not detected properly by uulib:
336     SetFileNameCallback sub {
337     return unless $_[1]; # skip "Re:"-plies et al.
338     local $_ = $_[0];
339    
340     # the following rules are rather effective on some newsgroups,
341     # like alt.binaries.games.anime, where non-mime, uuencoded data
342     # is very common
343    
344     # if we find some *.rar, take it as the filename
345     return $1 if /(\S{3,}\.(?:[rstuvwxyz]\d\d|rar))\s/i;
346    
347     # one common subject format
348     return $1 if /- "(.{2,}?\..+?)" (?:yenc )?\(\d+\/\d+\)/i;
349    
350     # - filename.par (04/55)
351     return $1 if /- "?(\S{3,}\.\S+?)"? (?:yenc )?\(\d+\/\d+\)/i;
352    
353     # - (xxx) No. 1 sayuri81.jpg 756565 bytes
354     # - (20 files) No.17 Roseanne.jpg [2/2]
355     return $1 if /No\.[ 0-9]+ (\S+\....) (?:\d+ bytes )?\[/;
356    
357     # try to detect some common forms of filenames
358     return $1 if /([a-z0-9_\-+.]{3,}\.[a-z]{3,4}(?:.\d+))/i;
359    
360     # otherwise just pass what we have
361     ()
362     };
363    
364     # now read all files in the directory uusrc/*
365 root 1.6 for (<uusrc/*>) {
366 root 1.4 my ($retval, $count) = LoadFile ($_, $_, 1);
367     print "file($_), status(", strerror $retval, ") parts($count)\n";
368     }
369    
370 root 1.7 Smerge -1;
371    
372 root 1.4 SetOption OPT_SAVEPATH, "uudst/";
373    
374     # now wade through all files and their source parts
375 root 1.6 for my $uu (GetFileList) {
376     print "file ", $uu->filename, "\n";
377     print " state ", $uu->state, "\n";
378     print " mode ", $uu->mode, "\n";
379     print " uudet ", strencoding $uu->uudet, "\n";
380     print " size ", $uu->size, "\n";
381     print " subfname ", $uu->subfname, "\n";
382     print " mimeid ", $uu->mimeid, "\n";
383     print " mimetype ", $uu->mimetype, "\n";
384 root 1.4
385     # print additional info about all parts
386 root 1.6 print " parts";
387 root 1.4 for ($uu->parts) {
388 root 1.6 for my $k (sort keys %$_) {
389     print " $k=$_->{$k}";
390 root 1.4 }
391     print "\n";
392     }
393    
394     $uu->remove_temp;
395    
396 root 1.6 if (my $err = $uu->decode) {
397     print " ERROR ", strerror $err, "\n";
398 root 1.4 } else {
399 root 1.6 print " successfully saved as uudst/", $uu->filename, "\n";
400 root 1.4 }
401     }
402 root 1.1
403 root 1.4 print "cleanup...\n";
404    
405     CleanUp;
406 root 1.1
407 root 1.6 PERLMULTICORE SUPPORT
408     This module supports the perlmulticore standard (see
409     <http://perlmulticore.schmorp.de/> for more info) for the following
410     functions - generally these are functions accessing the disk and/or
411     using considerable CPU time:
412    
413     LoadFile
414     $item->decode
415     $item->decode_temp
416     $item->remove_temp
417     $item->info
418    
419     The perl interpreter will be reacquired/released on every callback
420     invocation, so for performance reasons, callbacks should be avoided if
421     that is costly.
422    
423     Future versions might enable multicore support for more functions.
424    
425     BUGS AND LIMITATIONS
426     The original uulib library this module uses was written at a time where
427     main memory of measured in megabytes and buffer overflows as a security
428     thign didn't exist. While a lot of security fixes have been applied over
429     the years (includign some defense in depth mechanism that can shield
430     against a lot of as-of-yet undetected bugs), using this library for
431     security purposes requires care.
432    
433     Likewise, file sizes when the uulib library was written were tiny
434     compared to today, so do not expect this library to handle files larger
435     than 2GB.
436    
437     Lastly, this module uses a very "C-like" interface, which means it
438     doesn't protect you from invalid points as you might expect from "more
439     perlish" modules - for example, accessing a file item object after
440     callinbg "CleanUp" will likely result in crashes, memory corruption, or
441     worse.
442    
443 root 1.1 AUTHOR
444     Marc Lehmann <schmorp@schmorp.de>, the original uulib library was
445     written by Frank Pilhofer <fp@informatik.uni-frankfurt.de>, and later
446     heavily bugfixed by Marc Lehmann.
447    
448     SEE ALSO
449 root 1.6 perl(1), uudeview homepage at <http://www.fpx.de/fp/Software/UUDeview/>.
450 root 1.1