=head1 NAME
The Perl Multicore Specification and Implementation
=head1 SYNOPSIS
#include "perlmulticore.h"
// in your XS function:
perlinterp_release ();
do_the_C_thing ();
perlinterp_acquire ();
=head1 DESCRIPTION
This specification describes a simple mechanism for XS modules to allow
re-use of the perl interpreter for other threads while doing some lengthy
operation, such as cryptography, SQL queries, disk I/O and so on.
The design goals for this mechanism were to be simple to use, to be
extremely low overhead when not active, with both low code and data size
overhead and broad applicability.
The newest version of this document can be found at
L.
The newest version of the header file that implements this specification
can be downloaded from L.
=head2 XS? HOW DO I USE THIS FROM PERL?
This document is only about the XS-level mechanism that defines generic
callbacks - to make use of this, you need a module that provides an
implementation for these callbacks, for example
L.
=head2 WHICH MODULES SUPPORT IT?
You can check L
for a list of modules that support this specification.
=head1 HOW DO I USE THIS IN MY MODULES?
The usage is very simple - you include this header file in your XS module. Then, before you
do your lengthy operation, you release the perl interpreter:
perlinterp_release ();
And when you are done with your computation, you acquire it again:
perlinterp_acquire ();
And that's it. This doesn't load any modules and consists of only a few
machine instructions when no module to take advantage of it is loaded.
Here is a simple example, an C wrapper implemented in XS. Unlike
perl's built-in C, it allows other threads (for example, those
provided by L) to execute, instead of blocking the whole perl
interpreter. For the sake of this example, it requires a file descriptor
instead of a handle.
#include "perlmulticore.h" // this header file
// and in the XS portion
int flock (int fd, int operation)
CODE:
perlinterp_release ();
RETVAL = flock (fd, operation);
perlinterp_acquire ();
OUTPUT:
RETVAL
You cna find more examples In the L appendix.
=head2 HOW ABOUT NOT-SO LONG WORK?
Sometimes you don't know how long your code will take - in a compression
library for example, compressing a few hundred Kilobyte of data can take
a while, while 50 Bytes will compress so fast that even attempting to do
something else could be more costly than just doing it.
This is a very hard problem to solve. The best you can do at the moment is
to release the perl interpreter only when you think the work to be done
justifies the expense.
As a rule of thumb, if you expect to need more than a few thousand cycles,
you should release the interpreter, else you shouldn't. When in doubt,
release.
For example, in a compression library, you might want to do this:
if (bytes_to_be_compressed > 2000) perlinterp_release ();
do_compress (...);
if (bytes_to_be_compressed > 2000) perlinterp_acquire ();
Make sure the if conditions are exactly the same and don't change, so you
always call acquire when you release, and vice versa.
When you don't have a handy indicator, you might still do something
useful. For example, if you do some file locking with C and you
expect the lock to be available immediately in most cases, you could try
with C (which doesn't wait), and only release/wait/acquire when
the lock couldn't be set:
int res = fcntl (fd, F_SETLK, &flock);
if (res)
{
// error, assume lock is held by another process and do it the slow way
perlinterp_release ();
res = fcntl (fd, F_SETLKW, &flock);
perlinterp_acquire ();
}
=head1 THE HARD AND FAST RULES
As with everything, there are a number of rules to follow.
=over 4
=item I touch any perl data structures after calling C.
Possibly the most important rule of them all, anything perl is
completely off-limits after C, until you call
C, after which you can access perl stuff again.
That includes anything in the perl interpreter that you didn't prove to be
safe, and didn't prove to be safe in older and future versions of perl:
global variables, local perl scalars, even if you are sure nobody accesses
them and you only try to "read" their value, and so on.
If you need to access perl things, do it before releasing the
interpreter with C, or after acquiring it again with
C.
=item I call C and C in pairs.
For each C call there must be a C
call. They don't have to be in the same function, and you can have
multiple calls to them, as long as every C call is
followed by exactly one C call.
For example., this would be fine:
perlinterp_release ();
if (!function_that_fails_with_0_return_value ())
{
perlinterp_acquire ();
croak ("error");
// croak doesn't return
}
perlinterp_acquire ();
// do other stuff
=item I nest calls to C and C.
That simply means that after calling C, you must
call C before calling C
again. Likewise, after C, you can call
C but not another C.
=item I call C first.
Also simple: you I call C without having
called C before.
=item I underestimate threads.
While it's easy to add parallel execution ability to your XS module, it
doesn't mean it is safe. After you release the perl interpreter, it's
perfectly possible that it will call your XS function in another thread,
even while your original function still executes. In other words: your C
code must be thread safe, and if you use any library, that library must be
thread-safe, too.
Always assume that the code between C and
C is executed in parallel on multiple CPUs at the same
time. If your code can't cope with that, you could consider using a mutex
to only allow one such execution, which is still better than blocking
everybody else from doing anything:
static pthread_mutex_t my_mutex = PTHREAD_MUTEX_INITIALIZER;
perlinterp_release ();
pthread_mutex_lock (&my_mutex);
do_your_non_thread_safe_thing ();
pthread_mutex_unlock (&my_mutex);
perlinterp_acquire ();
=item I get confused by having to release first.
In many real world scenarios, you acquire a resource, do something, then
release it again. Don't let this confuse you, with this, you already own
the resource (the perl interpreter) so you have to I first, and
I it again later, not the other way around.
=back
=head1 DESIGN PRINCIPLES
This section discusses how the design goals were reached (you be the
judge), how it is implemented, and what overheads this implies.
=over 4
=item Simple to Use
All you have to do is identify the place in your existing code where you
stop touching perl stuff, do your actual work, and start touching perl
stuff again.
Then slap C and C around the
actual work code.
You have to include F and distribute it with your XS
code, but all these things border on the trivial.
=item Very Efficient
The definition for C and C is very
short:
#define perlinterp_release() perl_multicore_api->pmapi_release ()
#define perlinterp_acquire() perl_multicore_api->pmapi_acquire ()
Both are macros that read a pointer from memory (perl_multicore_api),
dereference a function pointer stored at that place, and call the
function, which takes no arguments and returns nothing.
The first call to C will check for the presence
of any supporting module, and if none is loaded, will create a dummy
implementation where both C and C execute
this function:
static void perl_multicore_nop (void) { }
So in the case of no magical module being loaded, all calls except the
first are two memory accesses and a predictable function call of an empty
function.
Of course, the overhead is much higher when these functions actually
implement anything useful, but you always get what you pay for.
With L, every release/acquire involves two pthread
switches, two coro thread switches, a bunch of syscalls, and sometimes
interacting with the event loop.
A dedicated thread pool such as the one L uses could reduce
these overheads, and would also reduce the dependencies (L is a
smaller and more portable dependency than L), but it would require a
lot more work on the side of the module author wanting to support it than
this solution.
=item Low Code and Data Size Overhead
On a 64 bit system, F uses exactly C<8> octets (one
pointer) of your data segment, to store the C
pointer. In addition it creates a C<16> octet perl string to store the
function pointers in, and stores it in a hash provided by perl for this
purpose.
This is pretty much the equivalent of executing this code:
$existing_hash{perl_multicore_api} = "123456781234567812345678";
And that's it, which is, as I think, indeed very little.
As for code size and speed, on my amd64 system, every call to
C or C results in a variation of
the following 9-10 octet sequence which is easy to predict for modern
CPUs, as the function pointer is constant after initialisation:
150> mov 0x200f23(%rip),%rax #
157> callq *0x8(%rax)
The actual function being called when no backend is installed or enabled
looks like this:
1310> retq
The biggest part is the initialisation code, which consists of 11 lines of
typical XS code. On my system, all the code in F compiles
to less than 160 octets of read-only data.
=item Broad Applicability
While there are alternative ways to achieve the goal of parallel execution
with threads that might be more efficient, this mechanism was chosen
because it is very simple to retrofit existing modules with it, and it
The design goals for this mechanism were to be simple to use, very
efficient when not needed, low code and data size overhead and broad
applicability.
=back
=head1 DISABLING PERL MULTICORE AT COMPILE TIME
You can disable the complete perl multicore API by defining the
symbol C to C<1> (e.g. by specifying
F<-DPERL_MULTICORE_DISABLE> as compiler argument).
This will leave no traces of the API in the compiled code, suitable
"empty" C and C definitions will be provided.
This could be added to perl's C when configuring perl on
platforms that do not support threading at all for example, and would
reduce the overhead to nothing. It is by no means required, though, as the
header will compile and work just fine without any thread support.
=head1 APPENDIX: CASE STUDIESX
This appendix contains some case studies on how to patch existing
modules. Unless they are available on CPAN, the patched modules (including
diffs), can be found at the perl multicore repository (see L)
In addition to the patches shown, the
L header
must be added to the module and included in any XS or C file that uses it.
=head2 Case Study: C
The C module presents some unique challenges becausu it mixes
Perl-I/O and CPU-based processing.
So first let's identify the easy cases - set up (in C) and
calculating the final digest are very fast operations and would unlikely
profit from running them in a separate thread. Which leaves the C
method and the C (C, C) functions.
They are both very easy to update - the C call
doesn't access any perl data structures, so you can slap
C/C around it:
if (len > 8000) perlinterp_release ();
MD5Update(context, data, len);
if (len > 8000) perlinterp_acquire ();
This works for both C and C XS functions. The C<8000> is
somewhat arbitrary.
This leaves C, which would normally be the ideal candidate,
because it is often used on large files and needs to wait both for I/O and
the CPU. Unfortunately, it is implemented like this (only the inner loop
is shown):
unsigned char buffer[4096];
while ( (n = PerlIO_read(fh, buffer, sizeof(buffer))) > 0) {
MD5Update(context, buffer, n);
}
That is, it uses a 4KB buffer per C. Putting
C/C calls around it would be way
too inefficient. Ideally, you would want to put them around the whole
loop.
Unfortunately, C uses C for the actual I/O, and
C is not thread-safe. We can't even use a mutex, as we would have
to protect against all other C calls.
As a compromise, we can use the C option that
C provide, which puts the buffer onto the stack, and use a
far larger buffer:
#define USE_HEAP_INSTEAD_OF_STACK
New(0, buffer, 1024 * 1024, unsigned char);
while ( (n = PerlIO_read(fh, buffer, sizeof(buffer))) > 0) {
if (n > 8000) perlinterp_release ();
MD5Update(context, buffer, n);
if (n > 8000) perlinterp_acquire ();
}
This will unfortunately still block on I/O, and allocate a large block of
memory, but it is better than nothing.
=head2 Case Study: C
Another example would be to modify C to allow other
threads to execute while executing SQL queries.
The actual code that needs to be patched is not actually in an F<.xs>
file, but in the F file, which is included in an XS file.
While there are many calls, the most important ones are the statement
execute calls. There are only two in F, one call in
C, and one in C, both calling
the undocumented internal C function.
The difference is that the former is used with mysql 4.1+ and prepared
statements.
The call in C is easy, as it does all the important work
and doesn't access any perl data structures (I checked C
manually to make sure):
perlinterp_release ();
imp_sth->row_num= mysql_st_internal_execute(
sth,
*statement,
NULL,
DBIc_NUM_PARAMS(imp_sth),
imp_sth->params,
&imp_sth->result,
imp_dbh->pmysql,
imp_sth->use_mysql_use_result
);
perlinterp_acquire ();
Despite the name, C isn't actually from
F, but a long function in F. Here is an abridged version, with
C/C calls:
int i;
enum enum_field_types enum_type;
dTHX;
int execute_retval;
my_ulonglong rows=0;
D_imp_xxh(sth);
if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
PerlIO_printf(DBIc_LOGPIO(imp_xxh),
"\t-> mysql_st_internal_execute41\n");
perlinterp_release ();
if (num_params > 0 && !(*has_been_bound))
{
if (mysql_stmt_bind_param(stmt,bind))
goto error;
}
if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
{
perlinterp_acquire ();
PerlIO_printf(DBIc_LOGPIO(imp_xxh),
"\t\tmysql_st_internal_execute41 calling mysql_execute with %d num_params\n",
num_params);
perlinterp_release ();
}
execute_retval= mysql_stmt_execute(stmt);
if (execute_retval)
goto error;
/*
This statement does not return a result set (INSERT, UPDATE...)
*/
if (!(*result= mysql_stmt_result_metadata(stmt)))
{
if (mysql_stmt_errno(stmt))
goto error;
rows= mysql_stmt_affected_rows(stmt);
}
/*
This statement returns a result set (SELECT...)
*/
else
{
for (i = mysql_stmt_field_count(stmt) - 1; i >=0; --i) {
enum_type = mysql_to_perl_type(stmt->fields[i].type);
if (enum_type != MYSQL_TYPE_DOUBLE && enum_type != MYSQL_TYPE_LONG)
{
/* mysql_stmt_store_result to update MYSQL_FIELD->max_length */
my_bool on = 1;
mysql_stmt_attr_set(stmt, STMT_ATTR_UPDATE_MAX_LENGTH, &on);
break;
}
}
/* Get the total rows affected and return */
if (mysql_stmt_store_result(stmt))
goto error;
else
rows= mysql_stmt_num_rows(stmt);
}
perlinterp_acquire ();
if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
PerlIO_printf(DBIc_LOGPIO(imp_xxh),
"\t<- mysql_internal_execute_41 returning %d rows\n",
(int) rows);
return(rows);
error:
if (*result)
{
mysql_free_result(*result);
*result= 0;
}
perlinterp_acquire ();
if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
PerlIO_printf(DBIc_LOGPIO(imp_xxh),
" errno %d err message %s\n",
mysql_stmt_errno(stmt),
mysql_stmt_error(stmt));
So C is called after some logging, but before the
C call.
To make things more interesting, the function has multiple calls to
C to log things, all of which aren't thread-safe, and need to be
surrounded with C and C calls
to temporarily re-acquire the interpreter. This is slow, but logging is
normally off:
if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
{
perlinterp_acquire ();
PerlIO_printf(DBIc_LOGPIO(imp_xxh),
"\t\tmysql_st_internal_execute41 calling mysql_execute with %d num_params\n",
num_params);
perlinterp_release ();
}
The function also has a separate error exit, each of which needs it's own
C call. First the normal function exit:
perlinterp_acquire ();
if (DBIc_TRACE_LEVEL(imp_xxh) >= 2)
PerlIO_printf(DBIc_LOGPIO(imp_xxh),
"\t<- mysql_internal_execute_41 returning %d rows\n",
(int) rows);
return(rows);
And this is the error exit:
error:
if (*result)
{
mysql_free_result(*result);
*result= 0;
}
perlinterp_acquire ();
This is enough to run DBI's C calls in separate threads.
=head3 Interlude: the various C async mechanisms
Here is a short discussion of the four principal ways to run
C SQL queries asynchronously.
=over 4
=item in a separate process
Both C and C (via
C) can run C calls in a separate
process, and this is not limited to mysql. This has to be paid with more
complex management, some limitations in what can be done, and an extra
serailisation/deserialisation step for all data.
=item C's async support
This let's you execute the SQL query, while waiting for the results
via an event loop or similar mechanism. This is reasonably fast and
very compatible, but the disadvantage are that C requires
undocumented internal functions to do this, and more importantly, this
only covers the actual execution phase, not the data transfer phase:
for statements with large results, the program blocks till all of it is
transferred, which can include large amounts of disk I/O.
=item C
This module actually works quite similar to the perl multicore, but uses
Coro threads exclusively. It shares the advantages of C's
async mode, but not, at least in theory, it's disadvantages. In practise,
the mechanism it uses isn't undocumented, but distributions often don't
come with the correct header file needed top use it, and oracle's mysql
has broken whtis mechanism multiple times (mariadb supports it), so it's
actually less reliably available than C's async mode or perl
multicore.
It also requires C.
=item perl multicore
This method has all the advantages of C without most
disadvantages, except that it incurs higher overhead due to the extra
thread switching.
=back
Pick your poison.
=head1 AUTHOR
Marc A. Lehmann
http://perlmulticore.schmorp.de/
=head1 LICENSE
The F header file itself is in the public
domain. Where this is legally not possible, or at your
option, it can be licensed under the creative commons CC0
license: L.
This document is licensed under the General Public License, version
3.0, or any later version.