This is an old revision of the document!
This is a draft.
MINIX3 now has support for live update and rerandomization of its system services. These features are based on LLVM bitcode compilation and instrumentation in combination with various run-time extensions. Live update and rerandomization support is currently fully functional but in an experimental state, not enabled by default, and available for x86 only. This document describes the basic idea, provides instructions on how to enable and use the functionality, provides more in-depth information for developers, and lists open issues and further reading material.
This section contains a high-level overview of the live update and rerandomization functionality.
A live update is an update to a software component while it is active, allowing the component's code and data to be changed without affecting the environment around it. The MINIX3 live update functionality allows such updates to be applied to its system services: the usermode server and driver processes that, in addition to the microkernel, make up the operating system. As a result, these services can be be updated at run time without requiring a system reboot. There is no support for live updating the microkernel or user applications at this time.
The live update procedure can be summarized as follows. The component responsible for orchestrating live updates is the RS (Reincarnation Server) service. When RS applies an update to a particular system service, it first brings that service to a stop in a known state, by exploiting the message-based nature of MINIX3. A new instance of the service is created. This new instance performs its own state transfer, copying and adjusting all the relevant data from the old instance to itself. If the state transfer succeeds, the new instance continues to run, and the old instance is killed. If the state transfer fails, RS performs a rollback: the new instance is killed, and the system resumes execution of the old instance. In order to maintain the illusion to the rest of the system that there only ever was one service process, the process slots of the old and the new instance are swapped before the new instance gets to run, and swapped back upon rollback.
The MINIX3 live update system allows updates to all system services. Those includes the RS service itself, and the VM (Virtual Memory) service. The VM service can be updated with severe restrictions only, however. The system also supports multicomponent live updates: atomic live updates of several system services at once, possibly including RS and/or VM. In principle, this allows for an atomic live update of the entire MINIX3 service layer.
The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM “optimization” passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization, the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the magic pass.
In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the magic library or libmagic. Together, the magic pass and library make up the magic framework.
Live rerandomization consists of randomizing the internal address space layout of a component at run time. While the concept of ASR or ASLR - Address Space (Layout) Randomization - is well known, most implementations are rather limited: they do such randomization only once, when starting a process; the randomize the base location of entire process regions, for example the process stack; and, they apply the concept to user processes only. In contrast, the MINIX3 live rerandomization can randomize the address space layout of system services, as often as desired, and with fine granularity. In order to achieve this, the live rerandomization makes use of live updates.
The fundamental idea is to first generate a new version of the service binary, using link-time randomization of various parts of the binary. Ideally, this would be done at run time; due to various limitations, MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another.
The randomization of binaries is done with another link-time pass, called the asr pass. The magic library implements the runtime aspects of ASR rerandomization during live update.
In this section, we explain how to set up a MINIX3 system that supports live update and rerandomization, and we describe how to use these functionalities when running MINIX3.
We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only. The current procedure is not quite ideal, but it is what we have right now, and it should work.
After setting up an initial environment, the MINIX3 update cycle basically consists of four steps: obtaining or updating the MINIX3 source code, building the system, instrumenting the system system, and generating a bootable image. We will go through all steps in detail. There is also a summary of commands to issue at the end.
All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific sudo apt-get
example in the first subsection, require more than ordinary user privileges.
The initial step is to set up a crosscompilation environment. General information about setting up a crosscompilation environment can be found on the crosscompilation page. As one example, the reference platform used to test the instructions in this document was the developer desktop edition of Ubuntu 14.04, a.k.a. ubuntu-14.04.2-desktop-i386.iso
, with the following extra packages installed:
$ sudo apt-get install curl clang binutils zlibc zlib1g zlib1g-dev libncurses-dev qemu-system-x86
In terms of directory organization, the idea is that everything will end up in one containing directory. Here we use /home/user/minix-liveupdate
as an example, but the location is entirely up to you. This containing directory will end up having one subdirectory for the MINIX3 source code (called minix-src
in this document), one subdirectory for the LLVM LTO toolchain (called obj_llvm.i386
), and one subdirectory for the crosscompilation tool chain and compiled objects (called obj.i386
). Thus, the ultimate directory structure will look like this:
/home/user/minix-liveupdate/minix-src /home/user/minix-liveupdate/obj_llvm.i386 /home/user/minix-liveupdate/obj.i386
You have to choose a location for the containing directory, and create it yourself. The three subdirectories will be created automatically as part of the following steps. In terms of placement, expect to be needing a bare minimum of 30GB for the combination of these three subdirectories, with a recommended 40GB of available space.
The first real step is then to check out the MINIX3 source code. Other wiki pages cover this in more detail, but the gist is to check out the sources from the main MINIX3 repository using git:
$ cd /home/user/minix-liveupdate $ git clone git://git.minix3.org/minix minix-src
This will create a minix-src
subdirectory containing the latest version of the MINIX3 source code.
Later on, a newer version of the source code can be pulled from the MINIX3 repository:
$ cd /home/user/minix-liveupdate/minix-src $ git pull
In both cases, the next step is now to build the source code.
The next step consists of building the system. When run for the first time, this step will also build the LLVM LTO infrastructure, the crosscompilation tools, and the instrumentation. The first run may take several hours.
The center of all the instrumentation activities is the minix/llvm
subdirectory of the MINIX3 source tree. This directory contains the instrumentation passes, runtime library, and supporting scripts. This step and the next steps therefore assume this subdirectory as the current directory:
$ cd minix-src/minix/llvm
It may be necessary to ensure that clang is used as the compiler, by exporting the following shell variables. GCC should work as well, but has not been tested as thoroughly.
$ export CC=clang CXX=clang++
Then, the system can built with support for instrumentation by running the configure.llvm
script in the current directory, with the MKMAGIC
build variable set to yes. To build the infrastructure and system without parallel compilation, simply run the script:
$ BUILDVARS="-V MKMAGIC=yes" ./configure.llvm
Alternatively, a number of parallel jobs may be supplied. It is typically advisable to use as many jobs as there are hardware threads of execution (i.e., CPUs or hyperthreads) in the system:
$ JOBS=8 BUILDVARS="-V MKMAGIC=yes" ./configure.llvm
After the first run, the configure.llvm
will perform recompilation of only parts of the source code that have changed, and should not take nearly as long to run as the first time. In case of unexpected problems when rebuilding, it may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entirety. This can be done by going to the top-level obj.i386
directory and deleting all files and directories except the tooldir.{yourplatform}
subdirectory in there. Fully rebuilding the MINIX3 source code will take longer than an incremental rebuild, but since the crosscompilation toolchain is left as is, it will still be nowhere close as long as the first run.
As explained in more detail on the crosscompilation page, it is also possible to rebuild particular parts of the system without going through the entire “make build” process. This involves the use of the nbmake-i386
tool and generally requires a good understanding of the compilation process. It may be worth mentioning that the first configure.llvm
run saves the MKMAGIC
value, so this variable need not be passed to nbmake-i386
each time.
When building the system for the first time, this step may be skipped, as it is performed automatically. However, when the source code is changed for any of the LLVM passes or the magic library, that is, the source code in minix/llvm
, the changed component must be recompiled. Warning: updating the MINIX3 source code with git pull
may also upgrade any of these components, in which case it is the responsibility of the user (you) to recompile and reinstall them!
Once we properly integrate the LLVM LTO infrastructure into the MINIX3 build system, this step should disappear altogether.
This substep must be performed whenever the source code of libmagic changes. This is due to the fact that dependency tracking is not working correctly for libmagic, which means the automated step in configure.llvm
may not recompile the library properly.
The source code of libmagic is located in the minix/llvm/static/magic
subdirectory of the MINIX3 source code. To (re)compile and install libmagic, go to its source directory, issue a make clean
and a make install
:
$ cd static/magic $ make clean install
The library is installed to minix/llvm/bin
. In a later step, the relink.llvm
script will pick it up from there.
This substep is also performed automatically for the first time, by the generate_gold_plugin.sh
script invoked from configure.llvm
. However, whenever the source code of any of the LLVM instrumentation passes changes, that pass must be recompiled and installed.
The source code of the passes is located in the minix/llvm/passes
subdirectory of the MINIX3 source code. A pass can be compiled and installed by going to its minix/llvm/passes/{pass}
subdirectory, and issuing make install
.
For example, to recompile and install the magic pass:
$ cd passes/magic $ make install
The passes are installed to minix/llvm/bin
. In a later step, the build.llvm
script will pick them up from there.
After building the system, two more steps need to be performed: instrumentation of system services, and generation of a bootable hard disk image. These steps must be performed every time the system is built, including the first time. In particular: every time a system service is (re)compiled, it must be (re)instrumented afterwards. Every time any part of the compiled MINIX3 installation is changed, a new image must be built.
In order to generate a fully instrumented system image with a number of pregenerated ASR binaries for all services, one can run a command that automates both steps. This is covered in the first subsection. Alternatively, the details of manual instrumentation and image building are covered in the two subsections after.
The clientctl
script in minix/llvm
provides a convenient way to instrument all services for live update and rerandomization, generate a number of rerandomized versions of each service, and build a hard disk image. The command has the following syntax:
$ ./clientctl buildasr [N]
Here, N is an optional parameter specifying the number of rerandomized binaries that should be generated in addition to the standard set of randomized binaries. N defaults to 1. For example, the following command will produce a system with four randomized sets of service binaries: one set of ASR-randomized services that are used by default, and three extra rerandomized binaries to which the system can switch at run time:
$ ./clientctl buildasr 3
The result is a MINIX3 hard disk image file which can be booted in (for example) qemu; see further below.
Instrumentation takes place at the granularity of individual system services. The minix/llvm
directory contains scripts that allow for relinking services against runtime libraries, and instrumenting services with LLVM passes. The general procedure is like this:
Each step also (re)generates a ready-to-execute machine code version of the service.
Step 1 happens in the “building the system” step, using configure.llvm
, as explained before.
Step 2 is done with the relink.llvm
script in minix/llvm
. This script will relink services against a space-separated list of libraries. For live update, only the magic library is relevant:
$ ./relink.llvm magic
This command will relink all services against libmagic, thus providing them with runtime support for live update.
Step 3 is done with the build.llvm
script in minix/llvm
. This script will instrument services with a space-separated list of LLVM passes. For live update, the magic pass should be used:
$ ./build.llvm magic
This command will instrument all services with the magic pass, performing static analysis and changing the service to include the information necessary for libmagic to perform live updates at runtime.
For live rerandomization support, one must apply not only the magic pass, but also the asr pass:
$ ./build.llvm magic asr
The resulting service will not only be ready for live update, but also be subjected to fine-grained randomization, as well as be supplied with parameters to perform the runtime component of rerandomization during live updates.
For reference, the clientctl buildasr
command shown above performs this step multiple times to generate different rerandomized versions of each service, storing each in a different location.
Some details that might be useful to know about relinking and applying passes:
relink.llvm
and build.llvm
perform their respective actions on all system services. It is possible instrument only a subset of services, leaving the other services as is. This can be done by passing a C
shell variable with a comma-separate list of individual services. For example, the following command relinks the PM (Process Manager) service against the magic library:$ C=pm ./relink.llvm magic
The pseudo-targets servers
, fs
, net
, and drivers
will perform actions on the services in the corresponding subdirectories in the MINIX3 source tree. The rd
pseudo-target regenerates the ramdisk, which must be redone after changing any service on the ramdisk. For example, the following command instruments core servers and file system services with the magic and asr passes, and rebuilds the ramdisk:
$ C=servers,fs,rd ./build.llvm magic asr
The clientctl buildasr
command accepts this optional C
shell variable as well.
build.llvm
invocation must be used to apply all passes at once.
Finally, a MINIX3 image can be built from the compiled MINIX3 code using the clientctl
buildimage command:
$ ./clientctl buildimage
This command produces a bootable MINIX3 hard disk image file. The generated image file is called minix_x86.img
and located in the root of the MINIX3 source tree - minix-src
in our examples.
This command is called automatically as part of clientctl buildasr
.
Once a hard disk image has been generated, it can be run. The most convenient way to run the image is to use qemu. For convenience, the clientctl
script in minix/llvm
has a run command to run the image in qemu without further effort:
$ OUT=F ./clientctl run
The OUT
shell variable can be set to other values to control what to do with serial output. The F
value specifies that the serial output will be redirected to a F
ile, namely serial.out
. The other supported settings are S
tdout, C
onsole, and P
ty.
Extra boot options can be supplied through the APPEND variable:
$ OUT=F APPEND="rs_verbose=1" ./clientctl run
This example will enable verbose output in the RS service, which is highly useful for debugging issues with live update.
The following commands can be used to obtain, build, instrument, and start a MINIX3 system that supports live update and live rerandomization, including three alternative rerandomized versions, in addition to the standard ones, of all system services:
$ git clone git://git.minix3.org/minix minix-src $ cd minix-src/llvm $ export CC=clang CXX=clang++ $ JOBS=8 BUILDVARS="-V MKMAGIC=yes" ./configure.llvm $ ./clientctl buildasr 3 $ OUT=F ./clientctl run
The entire procedure will typically take about 30GB of disk space and several hours of time.
Sometime later, the following steps can be used to update the installation to a newer MINIX3 version:
$ cd minix-src/llvm $ git pull $ export CC=clang CXX=clang++ $ JOBS=8 BUILDVARS="-V MKMAGIC=yes" ./configure.llvm $ for pass in WeakAliasModuleOverride sectionify magic asr; do (cd passes/$pass && make clean install); done $ (cd static/magic && make clean install) $ ./clientctl buildasr 3 $ OUT=F ./clientctl run
In contrast to the initial run, the entire update procedure should take no more than an hour.
Instead of the ./clientctl buildasr 3
step in the above two examples, one can for example also instrument the system for live update but not live rerandomization, using the following three replacement steps:
$ ./relink.llvm magic $ ./build.llvm magic $ ./clientctl buildimage
Once an instrumented MINIX3 system has been built and started, it should be ready for live updates. MINIX3 offers two scripts that make use of the live update functionality: one for testing the infrastructure, and one for performing runtime ASR rerandomization. In addition, the user may let the system perform live updates explicitly. In this section, we cover both parts.
The commands in this section are to be run within MINIX3, rather than on the host system. They must be run as root, because performing a live update of a system service requires superuser privileges. These two things are reflected by the minix#
prompt used in the examples below.
The MINIX3 distribution comes with two scripts that can be used to test and use the live update and rerandomization functionality. The first one is testrelpol. This script may be used for basic regression testing on the MINIX3 live update infrastructure. The second one is update_asr. This command performs live rerandomization of system services at runtime.
The MINIX3 test suite has a test set script that tests the basic MINIX3 crash recovery and live update functionality. The script is called testrelpol and can be found in /usr/tests/minix-posix
:
minix# cd /usr/tests/minix-posix minix# ./testrelpol
For its live update tests, this script does not use the magic framework for state transfer at all. Instead it uses identity transfer which basically just performs a memory copy between the old and the new instance. As a result, the testrelpol script should work whether or not services are instrumented. However, it may not work reliably on MINIX3 systems that are not built for magic instrumentation at all (i.e., built without MKMAGIC=yes
).
As we have shown before, the clientctl buildasr
host-side command can perform the build-time preparation of a MINIX3 system for live rerandomization. Complementing this, the run-time side of the live rerandomization is provided by means of the update_asr command. The update_asr command will update system services to their next pregenerated rerandomized version, using a cyclic system. Live rerandomization is not automatic, and thus, the MINIX3 system administrator is responsible for running the update_asr command at appropriate times.
By default, the update_asr command performs one round of ASR rerandomization, updating each service to its next version:
minix# update_asr
By default, this command will report errors only. More verbose information can be shown using the -v switch:
minix# update_asr -v
For further details about this command, see the update_asr(8) manual page.
Aside from providing actual security benefits, the update_asr script is the most complete test of the live update and rerandomization functionality at this time. It uses the magic framework for state transfer, with full-scale relocation of all state, and it applies the runtime ASR features. As of writing, it runs in the default qemu environment without any errors or subsequent issues.
The only aspect that is not tested with this command, is whether ASR rerandomization is effective, that is, whether all parts of its address space were properly randomized by the asr pass. After all, ASR rerandomization between identical service copies works just as well, but provides substantially fewer security guarantees. Developers working on the asr pass are encouraged to check its effectiveness manually, for example using nm(1) on generated service binaries on the host side.
RS can be instructed to perform live updates through the service(8) command, specifically through its service update subcommand. This command is also used by the automated scripts. For a full overview of the command's functionality, please see the service(8) manual page as well as the command's output when it is run with no parameters.
In its most fundamental form, the service update command will update a running service, identified by its label, to a new on-disk binary file. It is however possible to tell RS to update the service into a copy of itself, and to influence the process using various flags and options. The basic syntax to perform a live update on a single system service is as follows:
minix# service [flags] update [self|<binary>] -label <label> [options]
Through various combinations of this command's parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions for the overall live update infrastructure in general, and state transfer in particular. We will now go through all of them, and explain how they can be performed. Later on, the developers guide will provide a more in-depth explanation of the four types of updates.
The first update type is identity transfer. In this case, the service is updated to an identical copy of itself, with all functions and static data in the new instance located at the exact same addresses as the old instance. Identity transfer bluntly copies over entire sections at once, thus requiring no instrumentation at all. This makes it suitable for testing of the MINIX3-specific side of the live update infrastructure, hence its use in the testrelpol
script. Identity transfer is the default of the service(8) command when “self” is given instead of a path to a new binary:
minix# service update self -label pm
This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 system services. As mentioned, it is guaranteed to work only when the system was built with MKMAGIC=yes
, although it will mostly work on systems built without magic support as well. It works regardless of whether the target service was instrumented with the magic framework (or ASR).
If the live update is successful, the service(8) command will be silent, but RS will print a system message that the update succeeded:
RS: update succeeded
If the system was started on qemu with the OUT=F
, this message will end up in serial.out
. Otherwise, the message should show up the system log (/var/log/messages
) and possibly on the first console.
If the live update fails, RS should print an error to the system log, and service(8) will complain. In order to debug such failures, it may be useful to enable verbose mode in RS, buy starting the system with rs_verbose=1
as shown earlier.
The second update type is self state transfer. Self state transfer also performs an update of a service into an identical copy of itself, but instead uses the state transfer functionality of the magic framework. Thus, self state transfer requires that the service be instrumented properly, and the update type can be used to test whether a service's state can be transferred without problems. Many of the things explained here also apply to the remaining two update types, as all three are using the state transfer of the magic framework.
Self state transfer is performed by supplying the -t
flag along with “self” to the service update command:
minix# service -t update self -label pm
This command will perform self state transfer of the PM service. The libmagic state transfer routine in the new service instance will print additional system messages while it is running. Upon success, the system output will look somewhat like this:
total remote functions: 57. relocated: 54 total remote sentries: 186. relocated normal: 84 relocated string: 101 total remote dsentries: 5 st_data_transfer: processing sentries st_data_transfer: processing dsentries st_data_transfer: processing sentries st_data_transfer: processing dsentries st_state_transfer: state transfer is done, num type transformations: 0 RS: update succeeded
If the state transfer routine is not able to perform state transfer successfully, it will print messages that start with [ERROR]
. RS will then roll back the service to the old instance, and both RS and service(8) will report failure. Self state transfer should succeed for all MINIX3 system services that have been built with bitcode and instrumented with libmagic and the magic pass. As of writing, there are no system services for which self state transfer is known to result in [ERROR]
lines and subsequent live update failure. However:
-state
option. This currently applies to all services that make use of usermode threads, namely the VFS, ahci, virtio_blk services. They must be updated using quiescence state 2 (request free) rather than state 1 (work free):minix# service -t update self -label vfs -state 2
Omitting the appropriate state parameter may result in a crash of the service after live update. At the moment, update_asr script has hardcoded knowledge about these necessary states. None of this is great, and we will be working towards a situation where the default state will not result in a crash.
-maxtime
option to service(8):minix# service -t update self -label vfs -state 2 -maxtime 120HZ
The maximum time is specified in clock ticks by default, but may be given in seconds by appending “HZ” to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system's clock frequency aka its HZ setting. The above command allows the live update of VFS to take up to two minutes.
The third update type is ASR rerandomization. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service performs state transfer into a rerandomized version of the same service. This involves specifying the path to a rerandomized ASR binary to the service(8) command, as well as the -a
flag. The -a
flag tells the new instance to enable the run-time parts of rerandomization during the live update.
minix# service -a update /service/asr/1/pm -label pm
When a system has been built with ASR rerandomization, the (randomized) base service binaries are located in /service
and the (randomized) alternative service binaries are located in numbered subdirectories in /service/asr
. As mentioned before, the update_asr(8) command can be used to perform these updates semi-automatically.
ASR rerandomization comes with one extra restriction: the VM service cannot be subjected to more complicated forms of state transfer than self state transfer. It is also skipped by the update_asr(8) command for this reason. We will explain the restrictions regarding the VM service in the developer section.
The final update type is a functional update. Compared to self state transfer, ASR rerandomization relocates code and more data. However, there are fundamentally no differences between the old and the new version of the service. In the case of a functional update, the service performs state transfer into a new program. While typically highly similar, the new program may be different from the running service in various ways.
In terms of the service(8) command, such functional updates can be performed by simply using service update with a new binary. For example, one could test a new version of the UDS (UNIX Domain Sockets) service, without installing it into /service
yet, and without affecting its open sockets:
minix# service update /usr/src/minix/net/uds/uds -label uds
The fact that this time there may be actual differences between the old and new versions of the services adds an extra dimension to the state transfer issue. Additional state transfer failures can be expected in this case, and must be dealt with accordingly. The developers guide will eventually elaborate on this point.
Similarly, depending on the nature of the update, the update action may require a specific state of quiescence. Taking UDS as an example, an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors are being transferred at the time of the update. The old instance of the service must support this as a quiescence state. This state can then be specified through the -state
option of the service update command.
Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned!
From the user's perspective, updating multiple services at once is not much more complex than updating a single service. First, a number of service update commands should be issued, each with the -q
flag:
minix# service -q -t update /service/pm -label pm minix# service -q -t update /service/vfs -label vfs -state 2
Then, the entire update can be launched with the service sysctl upd_run command:
minix# service sysctl upd_run
The RS output will be much more verbose in this case. Timeouts are still to be specified on a per-service basis, rather than for the entire update at once. If necessary, any queued service update commands may be canceled with the upd_stop sysctl subcommand:
minix# service sysctl upd_stop
This will cancel the entire multicomponent live update action.
The host-side clientctl
script in minix/llvm
offers a number of additional convenient commands, mainly for developers. We list some of them here.
The buildboot command installs just the services that are part of the boot image. It can be used instead of clientctl buildimage
when only boot-image services have been changed, thus speeding up the development cycle:
$ ./clientctl buildboot
Using this command, it is possible to make and test changes to boot system services fairly quickly. As an example, the following set of steps suffices to make and test changes to the PM service:
$ export PATH=$PATH:/home/user/minix-liveupdate/obj.i386/tooldir.{platform}/bin $ cd minix-src/minix/servers/pm [make changes to the PM source code] $ nbmake-i386 all install $ cd ../../llvm $ C=pm ./relink.llvm magic $ C=pm ./build.llvm magic $ ./clientctl buildboot $ OUT=F ./clientctl run
The unstack command shows a stacktrace of pretty much any MINIX3 binary in human-readable form:
$ ./clientctl unstack <name> [address [address ..]]
For example, to show a stack trace of the PM service in a human-readable form:
$ ./clientctl unstack pm 0x805a7fd 0x80492a5 0x8048050
Note that on ASR-enabled installations, the unstack command works only on the base versions of system services: there is currently no way to unstack a stacktrace for any of the ASR-rerandomized service binaries. On one occasion, the author of this document has done that process by hand, by finding the matching assembly code of an ASR-rerandomized service's crash site in the service's base version.
This part of the document provides in-depth information for developers. We start with a quick summary for system service developers, explaining how to support live update for a newly written service.
The rest of the developers guide is targeted towards people who maintain the live update infrastructure. We first cover some of the theoretical and practical aspects of the live update approach and infrastructure on MINIX3, and then elaborate on several practical aspects related to state transfer using the magic framework, including how to prevent and resolve state transfer issues.
In by far most cases, allowing live update on a new service requires no action at all from the service developer. That is, if the service has been written properly, it can also be updated. Specifically, a service can be updated if it meets these conditions:
The first three points are required for all services in any case, and are not specific to live update. These points are therefore covered better on other pages, in particular those on programming device drivers on MINIX3 and the System Event Framework API (warning: currently outdated). We do explain the reason behind these three points in detail later on.
Only the fourth point is specific to live update, and is relevant only to a small subset of services. This point is covered in more detail in the “State transfer in practice” section below. Specifically, as a service developer, you will want to verify that your new service does not suffer from potential issues with long-running memory grants, userspace threads, and physically unmovable memory. Then, you will want to test self state transfer on your service, and resolve any state transfer errors that come up. Only in these cases does the SEF live-update API (that is, the sef_*_lu_*(3) calls) become relevant. We do not elaborate on the SEF API in this document.
We now describe the live update procedure in more depth. This overview forms the basis for the subsequent sections, which address particular aspects of the procedure in more detail.
A fundamental design property of any live update system is the granularity at which updates are applied. There is a spectrum of approaches here, ranging from directly changing bits in the memory image of the running component, to controlled state saving and shutdown of the entire component followed by the update and state restoration. The second approach requires more infrastructure to be in place, in particular to handle state transfer: transfer of all the internal state of the old version to the new version, possibly changing state on-the-fly as required for the live update. However, the first approach requires a more elaborate system to ensure quiescence, meaning that the component being updated is in an execution state that cannot lead to interference by the live update. For example, if a live update changes the implementation of a particular function, the component being updated must not be executing that function at the time of the live update. If it is, the live update will most likely be followed by a crash of the component. A typical live update system has some degree of support for both state transfer and quiescence.
In MINIX3, where services are processes, the live update system applies updates at the granularity of those (entire) processes. In order to allow for rollback in case the update fails, live updates are not in-place: the live update system effectively updates the current process instance of a service into a new process instance, while maintaining the illusion to the rest of the system that the service is really only one process. This approach to live updates requires state transfer between the two instances, and involves both link-time instrumentation and run-time tracking in order to identify the process state to transfer. State transfer is a complicated subject which we cover later in this document. However, the quiescence issue can be resolved fairly easily, by exploiting MINIX3's message-based nature. In essence, all the MINIX3 services consist of a main message loop that repeatedly receives a message and processes it. MINIX3 supports no kernel threads, and thus, the MINIX3 services have no internal CPU-level concurrency. As a result, a message can be used to enforce quiescence. We will now elaborate on this by describing the general procedure of a live update.
MINIX3 live updates are orchestrated by the RS (Reincarnation Server) service. The administrator of the system first compiles a new version of the service into an executable on disk, and then instructs RS to update a particular running system service into the new version, through the service(8) utility. RS starts by loading the new version of the service as a new service process, without letting it run. Thus, there are temporarily two instances of the service: the 'old' instance, which is still running, and the 'new' instance, which contains the new code but not yet any of the necessary state.
RS then asks the old instance of the service to prepare to be updated, by sending a prepare request message to it. At the moment that the service receives and processes the preparation message, it is by definition in a known state, as it cannot also be doing something else at the same time. While this is a good start for quiescence, the service may have to meet additional requirements regarding its current activity, depending on the service and the type of live update. For example, a service that uses userlevel threads may have to ensure that all those threads are quiescent as well. As another example, an update that affects message protocols may have to ensure that the service has no outstanding requests to other services using that protocol. The administrator provides the intended quiescence state for the live update when starting the update, and the service itself determines whether or not it is ready When handling the prepare message. If the service decides that it does not meet the given quiescence requirements, the live update is aborted.
However, if the old instance does meet the requirements, it acknowledges that it is ready by sending a ready message to RS, blocking on receipt of a reply from RS. Thus, the old instance is effectively stopped in a known state. In order to maintain the externally visible state (most important, communication endpoint) of the service being updated, the process slots of the old and the new service instances are swapped. The new instance, now in the original process slot, is then allowed to run. Upon startup, the new instance find out from RS that it is the new instance of an old, stopped process, and attempt to perform state transfer from this old process into itself.
State transfer requires transfer of all individual pieces of data from the old to the new process, possibly to a new location. In a nutshell, the MINIX3 state transfer approach relies on having a full view of all the individual pieces of data that make up the process, along with type information about the data, including for example structure layouts and types of pointers. For static data, this information is generated through static analysis performed at compile time, and included with the service binary. For dynamic data, the information is collected and maintained by the magic library which is attached to the service. The end result is that the state transfer framework knows about all global variables and functions, and at least in theory, for each pointer, what type of data the pointer points to.
This knowledge, in addition to full access to the memory of the old instance, allows the runtime state transfer procedure in the new instance to iterate over all data of the old process, recursively following any pointers it encounters, pairing each piece of data with the corresponding piece of data in the new process, copying over and adjusting (as necessary) the data for the new layout as necessary. In practice, the state transfer system may not be able to pair all pieces of data, or deal with all pointers. In that case state transfer fails. Annotations in the service source code, as well as custom data transfer methods, can be provided in order to aid the state transfer process.
Regardless of whether state transfer succeeded or failed, the new instance sends the result of the state transfer to RS using an init request message. If state transfer succeeded, RS lets the new instance continue to run, and kills the process of the old instance. If the state transfer fails, RS again swaps the process slots of the old and the new instance, lets the old instance continue to run, and kills the new instance. In both cases, RS communicates the result to the service(8) utility as well, ultimately letting the system administrator know about the outcome of the live update.
For multicomponent live updates, all affected services are first brought into the ready state, after which they are all updated. Any service failing to get ready in the preparation phase will cause an abort of the entire update, and any service failing the state transfer phase causes a rollback of the entire update.
Updating the RS and VM components requires various deviations from the procedure sketched above. In addition, support for live updating the VM service is limited. We elaborate on these points later on.
For many parts of the story, it may be useful to take a look at the actual source code as well. In this section we give a quick overview of what parts of the source code are where. The LLVM instrumentation code is located in minix/llvm
of the MINIX3 source code, along with the supporting scripts described in the users guide. The following relevant LLVM passes are located in minix/llvm/passes
:
In addition to the passes, the following pieces of system functionality are especially important for live update:
minix/llvm/static/magic
.minix/lib/libsys/sef*.c
files.minix/servers/rs
. RS uses live update functionality implemented in the kernel, located in minix/kernel
, and VM, located in minix/servers/vm
.We now elaborate on various MINIX3-specific aspects that are important to understand regarding live update. We elaborate on the consequences of the quiescence approach for state transfer, the role of various process memory sections for state transfer, the supported two types of state transfer, and the exceptions to the general model for various core system services.
We describe the quiescence model in a bit more detail, in order to make two points: 1) the implementation of system services must follow a basic standard structure in order to allow for live update, and 2) the process stack is and can be disregarded for the purpose of state transfer. The following piece of pseudocode represents a simplified and flattened version of the general structure of each system service:
main: # initialization receive INIT message from RS if INIT message requests a FRESH start: perform fresh initialization if INIT message requests a LIVE UPDATE start: perform state transfer send result of performed action to RS # there should be nothing else here # the main message loop while true: receive message if message from RS and message is PREPARE: # for simplicity, we are always ready send READY message to RS receive response message from RS # if we get here, the live update has failed continue handle message
As can be seen, the service's initialization code starts by learning from RS what type of initialization it should perform. This can be either fresh initialization of the system service, or state transfer for the purpose of live update (for simplicity we disregard crash recovery). If the service is started initially, typically during system boot, it will perform the fresh initialization, thereby initializing global variables, performing initialization-only procedures, etcetera. In contrast, if a new service instance is started for the purpose of live update, it will skip the fresh initialization and instead perform state transfer from the old instance.
In practice, all interaction with RS is implemented in the System Event Framework (SEF) library that must be used by all system services. The service-specific actions such as the fresh initialization action are implemented as callbacks from SEF.
Thus, if the service has initialization code that is called outside of the “fresh initialization” procedure, for example at the “there should be nothing else here” point, then this code will also be called in case of a live update, possibly undoing the effects of the state transfer. Therefore, services must perform initialization only from the designated initialization routines; in the case of fresh initialization, this is a callback function provided to SEF using the sef_setcb_init_fresh(3) call. This covers point #1 from the start of this section.
After either type of initialization, the service will enter the main message loop afterwards, where it will repeatedly receive a message and handle that message. If the received message is a prepare request from RS, then the service is about to be updated, and it sends a ready message to RS, blocking until it gets a response. If the live update is successful, this old instance will never get a response, and instead be killed.
As can be seen, in terms of stack usage, the execution path from main() to the point where the service gets blocked receiving the ready response from RS (let's call it the quiescence point) is short and simple. Even if the state transfer procedure modified the new instance's stack and program counter to continue from the quiescence point, the result would essentially be the same as not doing it: in both cases, the new service would end up at the start of the message loop. Therefore, the MINIX3 state transfer approach chooses to disregard the execution context of the old process, thus obviating the need to transfer the stack. This is viable only due to the well defined quiescence model.
However, it is possible that the functions leading up to the quiescence point, including the main message loop, have local variables on the stack that maintain long-running state. For example, the main() function could maintain a counter for the number of messages received so far. The values of such variables will be lost during the live update. If this were a major issue, the live update framework could be made to instrument the stack as well, but this could come at great cost since instrumenting only the stack of functions leading up to the quiescence point would be difficult. In practice, not having essential long-running variables in main() is rather simple, and we have not seen problems so far. This covers point #2 from the start of this section.
The address space of a process is typically made up of various memory sections with different purposes, and MINIX3 system services are no different. There are important differences between various sections when it comes to state transfer.
For a live update of the VM service, the last two points are different. We describe the exceptions for VM in a later section.
With this situation as a given, MINIX3 allows for two forms of state transfer: identity transfer, and state transfer by the magic framework. These forms of state transfer are covered in the next two sections.
The simple case is identity transfer. Identity transfer is a minimal approach to transfer state from an old instance to a new instance of exactly the same service, that is, a process with exactly the same address space layout and functionality. Identity transfer is also supported when the target service has not been instrumented, and in fact even when the system has not been compiled using LLVM bitcode altogether.
Since new instance is a newly started copy of the same service, it already has a text section that is identical to the old instance. As described, the stack section need not be transferred, and the mmap section is inherited automatically.
Therefore, the identity transfer is concerned with the data and heap sections only. The identity transfer procedure starts by copying over the old instance's entire data section to itself. This includes the variable that contains the size of the old instance's heap (_brksize
). The identity transfer procedure then calls brk(2) to allocate a heap for itself which is just as large, and copies over the old instance's entire heap section it itself as well.
If the system is not built with MKMAGIC=yes
, which means that _MINIX_MAGIC
is not defined, then the mmap section of the process is not well delineated and may in fact overlap with other memory areas. This is intentional, as it ensures that for such a set-up, the address space layout of services is not unnecessarily restricted and services can use the full address space for, say, a page cache. However, as a result, some memory-mapped areas may not be mapped into the new process, possibly leading to segmentation faults. Therefore, even identity transfer is not expected to be reliable on a system not built with MKMAGIC=yes
. Eventually, MINIX3 should be changed to use another approach for transferring memory-mapped regions to the new process altogether, which is either not based on ranges or not the default at all.
Identity transfer is implemented in the System Event Framework (SEF) as part of libsys.
The other case is state transfer by the magic framework. This type of state transfer is used by the self state transfer, ASR rerandomization, and functional update update types as covered in the users guide. This form of state transfer relies on the magic pass and library to implement instrumentation and runtime support for state transfer.
The magic framework's state transfer procedure transfers static data objects to the new instance one by one. In this context, an object may for example be one global variable. The actual transfer of an object is not a simple memory copy; it involves analyzing any pointers in the object and adjusting these pointers as appropriate to match the address layout of the new instance.
The state transfer procedure also transfers dynamic data objects (in the heap and mmap sections) of the old instance to the same location in the new instance one by one. In essence, the procedure recreates the heap and mmap sections during the state transfer, by allocating new heap or mapped memory for each dynamic object, and then transferring its actual contents, again including pointer analysis and adjustment. Here, one object is one piece of memory created by a call to malloc(3) or mmap(2), for example.
Since MINIX3 already transfers the mmap section to the new instance automatically, the state transfer framework starts by unmapping all memory-mapped areas that it knows it will recreate. However, since some memory areas (the aforementioned memory-mapped I/O and DMA memory) cannot be recreated by the magic framework, these are not destroyed and recreated. These areas are called special, out-of-band memory.
Out-of-band memory is seen as opaque, unmovable memory, and ignored entirely for the purpose of state transfer. Thus, if a piece of out-of-band memory contains a pointer to a piece of memory that is not marked as out-of-band, this pointer will be missed during state transfer. For the aforementioned (MMIO and DMA) memory types, this is not a problem. The service has to tell the magic library about special memory areas; for the two common ways of allocating such memory, alloc_contig(3) and vm_map_phys(2), this is done automatically by libsys.
The default of inheriting the entire mmap section leads to the situation that if the magic framework misses any memory-mapped areas for any reason, these will effectively translate to a memory leak in the new instance. Currently, one such memory leak is addressed explicitly: the page directory that is allocated with mmap(2) internally by the libc malloc code.
The state transfer procedure may fail if its analysis is not successful, in which the system will roll back to the old instance, and the live update fails. It is up to the programmer to deal with such problems. This may involve annotating source code, for example to instruct the state transfer procedure to ignore certain pointers, or to copy over certain data as is. It may involve adding special state transfer routines to libmagic, which deal with fundamentally problematic cases such as unions. In rare cases, it may involve adapting source code to avoid state transfer problems. We discuss all this in more detail later.
In the case of ASR rerandomization, not just the dynamic part, but also the static part of the address space will have objects that are relocated between the old and the new instance. In addition, ASR rerandomization permutes the order in which the old instance's dynamic objects are allocated in the new instance. Finally, the asr pass may insert padding which may expose wrong assumptions about alignment of various buffers. Thus, while live rerandomization is a security feature, in practice it may expose not only additional problems with state transfer, but also bugs in the service itself.
In the case of a functional update, the new instance may be fundamentally different from the old instance. Unlike the previous cases of state transfer, such live updates may involve functions and global variables that are added or removed, thus causing problems in the pairing part of the state transfer. The programmer may have to provide explicit state transfer routines in order to deal with these problems.
While MINIX3 allows all of its services to be updated, a small number of services require special exceptions to allow for live updates, because they are crucial to the live update process itself. These services are RS and VM. This section elaborates on the exceptions made for RS and VM, and explains why VM in particular cannot be updated arbitrarily.
TODO
MINIX3 has limited support for performing a live update of the VM (Virtual Memory) service. There are two reasons why VM is a special case. First, VM provides essential memory management and page fault handling functionality to the other system services. Thus, the live update must ensure that none of that functionality is required during the course of a live update that includes VM. Second, VM's core data structures include page tables. If these page tables are changed during a live update, it may be impossible to perform a proper rollback.
During normal operation, VM may allocate memory for itself. VM has both a heap and dynamic pages, implementing special local versions of brk(2) and mmap(2) to support this. In particular, page tables are stored in dynamically allocated memory, effectively in VM's mmap section.
For VM, the live update procedure must therefore include the transfer of such dynamic state from the old to the new VM instance. Since page tables cannot simply be copied, they are made visible to the new instance by mapping the old instance's dynamic memory ranges directly into the new instance's address space.
That means that any changes made to the dynamic data structures by the new instance (page tables included) becomes visible to the old instance after a rollback. However, the two instances do each have their own static memory (i.e., text and data sections, as well as a preallocated stack). Thus, any changes to dynamic memory made by the new instance, would create a potential mismatch between the static and dynamic memory in the old instance after rollback.
Therefore, in order to allow for rollback, VM must not make any changes to its dynamic memory during the live update. That also means that it may not allocate memory during the live update, not for other processes and not for itself. This situation leads to the following exceptions:
In this section, we elaborate on some of the practical details of the state transfer of the magic framework, mainly aiming to allow developers to resolve real-world state transfer failures.
We do not get into most of the theoretical side of the state transfer, and we skip over many other practical details. Interested readers are advised to read the published work of Cristiano Giuffrida - see the “Further reading” section at the bottom of this document.
The magic framework keeps track of each static object of data using a sentry (“state entry”) data structure. The framework keeps track of each dynamic object of data using a dsentry (“dynamic state entry”) data structure, which itself has an embedded sentry data structure. The magic framework uses wrappers around memory allocation routines so that it can allocate extra memory to store the dsentry metadata right before the actual memory object. Finally, each special, out-of-band memory region is maintained in an obdsentry (“out-of-band dynamic state entry”) data structure. Since no extra memory can be allocated next to the actual memory object in this case, obdsentries themselves are (currently) stored as static data within libmagic's own state.
The magic framework also uses the concept of a selement (“state element”), which is a particular element within a state entry; for example, it can be one particular field in a structure. State is transferred one element at a time; if the state transfer procedure encounters a problem, it will report about the particular state element causing the problem.
Each pointer is expected to point a data type known to libmagic. All the possible data types that can be used by the service are enumerated through static analysis by the magic pass, and stored in a type table as part of the instrumented service. It may happen that one data type is cast to another, either in the source code of the service or as a result of the LLVM compilation and linking process. As a result, while the static analysis may conclude that a pointer is for one type, runtime state transfer may find that the pointer was (for example) allocated for another type. Normally, such mismatches would cause state transfer to fail, but casting makes this a legitimate case. As a result, the magic framework has the notion of compatible types: if type A is cast to type B anywhere, type A is marked as a compatible type for type B, and finding type A when transferring data of supposed type B will not result in state transfer failure. The magic pass adds a list of compatible types to the service binary as well, all for use by libmagic at state transfer time.
In particular the analysis part of state transfer may not always succeed, for a variety of reasons. In particular, the state transfer framework has problems with unknown pointers, unions, and more generally cases of ambiguity. In many cases, the issue can be resolved by the programmer through annotation in source code, which instructs the state transfer framework what to do with a particular variable. A variable can be annotated by prefixing either its type name (through typedef
) or its variable name with the annotation prefix. The following annotation prefixes are supported by the magic framework.
noxfer_: No Transfer. This annotation will prevent transfer of the state altogether, instead zeroing out the memory in the new instance. As an example, the noxfer annotiation can be used in cases where analysis is failing (e.g., in unions) and the new instance will never be using the old instance's data anyway. A practical example where it is used is the message
type. This data type contains a complicated union, and the quiescence model ensures that transferring this state is not necessary: the service being updated should never be involved in processing a message at the time of the update.
ixfer_: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain remote pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer.
cixfer_: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type void *
but may be used to store a small integer as well.
pxfer_: Pointer Transfer. This annotation forces a non-pointer value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. As of writing, this annotation is not used in practice.
sxfer_: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them. As an example, the sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure's fields as appropriate when annotating the union as sxfer.
The transfer exception is applied to the type (or variable) with the annotation. For example, a noxfer typedef for a pointer to a structure will refrain from transferring that pointer:
typedef struct foo * noxfer_foo_ptr_t; /* annotate the pointer */ struct foo my_foo_struct; /* the structure will be transferred */ noxfer_foo_ptr_t my_foo_pointer = &my_foo_struct; /* the pointer will not be transferred */
However, a pointer to a noxfer typedef'ed structure will be transferred; the contents of the structure will not:
typedef struct foo noxfer_foo_t; /* annotate the structure */ noxfer_foo_t my_foo_struct; /* the structure will not be transferred */ noxfer_foo_t * my_foo_pointer = &my_foo_struct; /* the pointer will be transferred */
It is possible to enable debugging flags in libmagic such that it will print more details on how it handles annotated exceptions: in minix/llvm/include/st/callback.h
, change ST_CB_DEFAULT_FLAGS
from (ST_CB_PRINT_ERR)
to (ST_CB_PRINT_ERR|ST_CB_PRINT_DBG)
. The debugging statements will be sent to the system log, and have a [DEBUG]
prefix.
Custom state transfer routines can be used in cases where annotation does not suffice.
TODO
There is currently one example case where a custom state transfer routine is used, namely for the dsi_u
union in the struct data_store
structure which is used by the Data Store (DS) service and defined in minix/servers/ds/store.h
. The custom state transfer routine is located in minix/llvm/static/magic/minix/magic_ds.c
, and basically provides the state transfer process with information as to which of the union elements should be used for transfer.
In some cases, small adjustments must be made to a service in order to prevent issues with state transfer. These issues will not result in failure of the state transfer procedure; instead, they may result in a crash of the new instance after a seemingly successful live update. We cover three topics: memory grants, userspace threads, and physically unmovable regions.
One potential issue concerns memory grants. Each service has a memory grant table, which is an array of all the memory grants that allow other processes to read and/or write the service's memory. If the service has any grants active at the time of a live update, the grants should in theory be adjusted corresponding to any relocation of the memory pointed to by the grants.
However, the main union of the grant structure (cp_grant_t
, defined in minix/include/minix/safecopies.h
) is currently marked as ixfer, meaning it will be transferred as is. This is not a problem for grants that point to memory outside the process being updated, and that means that indirect and magic grants pose no problem for state transfer. It is however a problem for grants that point to memory inside the process being updated, that is, for direct grants.
For this reason, the writer of any service that may potentially have direct grants active at the time of the live update, has two options: 1) implement a custom state transfer routine for the cp_grant_t
structure in libmagic, thus resolving the problem described in this section altogether, or 2) block live updates of the service whenever the service has active memory grants. Since the first option is the best, we do not describe the second option in detail here. Instead, see the next section regarding blocking updates using custom quiescence states. In any case, the potential consequence of doing neither is that the service ends up suffering from arbitrary memory corruption after a live update, since the transferred direct grant will point to the wrong memory location.
The live update system itself actually relies on the presence of a long-running direct grant, which provides access of the process's full address space to the process itself. This grant is used during a live update by the new instance to access the memory contents of the old instance. Since the grant provides access to the process's entire address space, it does not suffer from the problem above.
Userspace threads pose a problem for state transfer as well. We have previously explained that the process stack of the old instance can be disregarded by the state transfer procedure because it is “naturally” recreated in the new instance. The same does not apply to the stacks of userspace threads, since variables on the stack are not tracked at run time: even though the threads' stacks are transferred to the new instance by the magic framework, they are seen as blobs of (typically) memory-mapped character arrays. The result is that any pointers on these stacks will not be known to libmagic and thus not be transferred properly. In addition, thread context (CPU register) state will typically be stored as an array of integers, and similarly end up being skipped by the state transfer procedure. The result is that while state transfer may (appear to) succeed, the service will crash after completion of the live update.
At this time, the recommended solution is for the service to shut down all threads explicitly before starting state transfer, and to recreate the threads both after successful live update and as part of a rollback. The service may refuse to be updated if any of its threads are in use and cannot be shut down. The last point requires that the service supply a custom callback routine to SEF to perform that check for a quiescence state other than the default, through sef_setcb_lu_prepare(3). In order to allow the use of a nondefault state, a sef_setcb_lu_state_isvalid(3) callback routine must be supplied as well. For VFS and libblockdriver, we have chosen the following approach:
sef_setcb_lu_state_isvalid(sef_cb_lu_state_isvalid_standard)
is used to mark all standard quiescence states as valid, including SEF_LU_STATE_REQUEST_FREE
and SEF_LU_STATE_PROTOCOL_FREE
.SEF_LU_STATE_REQUEST_FREE
or the SEF_LU_STATE_PROTOCOL_FREE
state, the function will first check whether all worker threads are idle. If they are not, it will return a failure, aborting the live update. If they are, it will shut them down and report success.SEF_LU_STATE_REQUEST_FREE
or SEF_LU_STATE_PROTOCOL_FREE
, the function recreates the worker threads. This ensures that worker threads are recreated in the old instance upon a rollback.info→prepare_state
was either SEF_LU_STATE_REQUEST_FREE
or SEF_LU_STATE_PROTOCOL_FREE
, then it continues by recreating the worker threads. This ensures that the new instance has worker threads before it leaves the state transfer phase.
The result of this approach is that updates must be invoked with -state 2
(request free) or -state 3
(protocol free) in order to guarantee proper state transfer. As a sidenote, none of these issues are a problem for identity transfer, which should continue to work even with -state 1
(work free, the default).
Another case where the programmer may have to ensure that state transfer does not result in problems that will surface only after the live update, is when a service uses memory areas that are physically unmovable. Such memory areas are typically in use for DMA purposes. If the state transfer procedure changes the physical location of the buffers, DMA may be performed from or to the original physical location, resulting in garbage and possibly arbitrary memory corruption. Such DMA areas must be marked as special out-of-band memory in libmagic, and unmarked when freed, using the sef_llvm_add_special_mem_region(3) and sef_llvm_del_special_mem_region(3) SEF calls. This is done automatically by the alloc_contig(3) and free_contig(3) wrapper routines, but must be done explicitly for memory allocated in different ways.
However, this is only necessary if DMA can happen across a live update. In cases where it is known that no DMA can possibly be ongoing during the live update, the regions are not actually physically unmovable, and therefore need not be marked as such. For example, this is the case for the file system buffer cache implemented in libminixfs. This library allocates and manages buffers without using physically contiguous memory and alloc_contig(3), instead using mmap(2) directly and requesting DMA I/O in page-sized chunks (in order to avoid DMA issues on ARM). Therefore, it would be affected by the above problem, were it not for the fact that all its block I/O calls are synchronous.
As we mentioned before, memory-mapped I/O poses a similar problem. However, the only way to map such I/O memory is currently through the vm_map_phys(2) and vm_unmap_phys(2) calls, of which the libsys wrappers automatically call the special-memory marking/unmarking functions as well.
As mentioned previously, the state transfer routine reports detected failures, with details written to the system log entries using an [ERROR]
prefix. In this section, we cover a number of common reasons for state transfer to fail in practice, including some example errors and workarounds.
In order to know how to transfer a piece of memory, the magic library must know about the data type associated to that piece of memory. If no type information is known for a piece of memory, it cannot be transferred. There are various reasons why libmagic might not have type information about a piece of memory. The simplest one is a case of a dangling pointer: a pointer that used to be valid at some point, but no longer is, because the memory itself has been deallocated. While the actual program may know not to use that particular pointer anymore, the state transfer routine does not have such knowledge. A typical error resulting from a dangling pointer may look like this, with some important parts of the output highlighted:
<ctype.h>
isalpha(3) (etc) set of macros from within libmagic. We worked around this issue by putting our own replacement set of macros in libmagic instead.
== Assembly code ==
Yet another case that leads to invisibility of certain aspects is the direct inclusion of assembly code. Assembly code is machine code, not bitcode, and thus, the bitcode instrumentation passes will have problems processing them. Needless to say, the use of assembly code should be minimal throughout the source code. In cases where it cannot be avoided, custom solutions will have to be found for any resulting state transfer problems. Fortunately, much of the assembly in use by services these days is the result of optimized str*(3) and mem*(3) functions, which require no special treatment for the purpose of state transfer.
== Incompatible types ==
Finally, we describe one class of state transfer failures which are the result of shortcomings in the magic instrumentation framework. LLVM bitcode has the notion of an opaque data type, which is the type used for data of which the type has been declared but not defined, typically as a result of forward declarations of structures. As a result, opaque pointers may show up in various places: instead of resolving these types after they have been instantiated, LLVM tends to cast between various data types which are identical except for the presence of opaque pointers. The magic pass should mark all these practically identical data types as compatible types, but due to the fact that the casts can take rather complex forms, this is not always happening. The result is that in some cases, state transfer may fail because libmagic erroneously detects an incompatibility between a pointer type and the type of data being pointed to. As an example, this is a state transfer error that was reported during state transfer of the PM service:
* [ERROR] uncaught ptr with violations. Current state element:
* SELEMENT: (parent=timers.515278380, num=1, depth=0, address=0xdfb760a8, name=timers.515278380, type=TYPE: (id=96 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=01000000, values='', type_str={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func opaque*, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))
* SEL_ANALYZED: (num=1, type=ptr, flags(DIVW)=1110, value=0x08147460, trg_name=mproc, trg_offset=274616, trg_flags(RL)=D0, trg_selements=(#2|0: 1|o=SELEMENT: (parent=mproc, num=0, depth=0, address=0x08147460, name=mproc/143/mp_timer, type=TYPE: (id=38 , name=minix_timer, size=16, num_child_types=4, type_id=9, bit_width=0, flags(ERDIVvUP)=00000000, values='', names='minix_timer_t|minix_timer', type_str={ $minix_timer tmr_next { $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/*, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/*, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } })), 2|o=SELEMENT: (parent=mproc, num=0, depth=0, address=0x08147460, name=mproc/143/mp_timer/tmr_next, type=TYPE: (id=37 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=00000000, values='', type_str={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/*, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))))
* SEL_STATS: (type=ptr, trg_flags(RL)=D0, ptr_found=1, other_types_found=1, violations=1)
In this case, the analysis failed on the global timers variable. The analysis dump shows that two matching types (#2) were found, both associated with the mproc[143].mp_timer structure field, but neither type matched the type of the pointer. A closer look at the textual representations of the pointer type (the type_str of the primary selement) and of the data types (the type_str of the target selements) reveals that the only difference is that the tmr_func field of the structure type to which the timers variable should point is an opaque pointer, whereas the same tmr_func field of the target structures is a particular function pointer (to a function referred to as hash_3792421438). The remainder of the types are the same.
As an aside, the \n notation indicates type recursion of the type n levels up. The asterisk at the end of a { structure } block indicates a pointer to this structure. In this case, the timers variable is a pointer to a minix_timer_t structure. In the type string of timers, the \2 after the tmr_next field indicates that it is again a pointer to minix_timer_t: one type level up is the structure itself, two type levels up is the pointer to the minix_timer_t structure. In this case there are no three levels up, but in other cases three levels up could for example be a pointer to a pointer to the structure, etcetera.
The analysis fails because it finds different, noncompatible, and therefore other types, even though the opaque pointer and the function pointer are really the same field types. A look at the corresponding declarations in minix/include/minix/timers.h
shows that there is indeed a forward declaration of struct minix_timer
which is the cause of LLVM's link-time introduction of casts. We resolved this case by extending the casting analysis of the magic pass to include casts of structures through function prototypes.
The following example also resulted from the same forward declaration of MINIX3 timer structures, this time in the sched (scheduling) service:
* [ERROR] uncaught ptr with violations. Current state element:
* SELEMENT: (parent=sched_timer.29458437, num=4, depth=1, address=0xdfbe70b0, name=sched_timer.29458437/tmr_func, type=TYPE: (id=17 , name=tmr_func_t, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=00000000, values='', type_str=opaque*))
* SEL_ANALYZED: (num=4, type=ptr, flags(DIVW)=1110, value=0x08048dc0, trg_name=balance_queues.29458437, trg_offset=0, trg_flags(RL)=T0, trg_selements=(#1|0: 1|o=SELEMENT: (parent=???, num=0, depth=0, address=0x08048dc0, name=???, type=TYPE: (id=119 , name=, size=1, num_child_types=0, type_id=4, bit_width=0, flags(ERDIVvUP)=11000001, values='', type_str=hash_3792445575/))))
* SEL_STATS: (type=ptr, trg_flags(RL)=T0, ptr_found=1, other_types_found=1, violations=1)
In this case, the type mismatch was not between two structures that differed in opaque fields, but between two function pointers themselves: the function pointer in sched_timer.tmr_func, and the function it is pointing to, balance_queues. Registering these types as compatible would result in much more complexity in the magic pass, and likely still not resolve the more general problem of opaque pointers. This is currently one of the open issues, and we believe that another approach would be more viable; see below. In this particular case, it turned out that the sched service did not need to use timers at all, and we simplified it by getting rid of its timer altogether. Obviously, adapting the actual functionality of a service to allow for state transfer is not always an option, nor is it generally the right approach: the core code of system services should not have to be (re)written specifically to allow for state transfer.
===== Open issues =====
In this section, we describe what we believe are currently the main open issues related to live update. For most issues, no active development is ongoing. We therefore invite any interested parties to work on resolving these issues, and welcome both inquiries and updates regarding the current status on both the MINIX3 newsgroup and info@minix3.org.
This section is roughly sorted by order of importance, starting with the most important issues.
==== The build system ====
As shown in the setup part of the users guide, the entire live update build infrastructure consists of a separate set of scripts built on top of the regular build system. These scripts deviate from the standard build system approach in various ways, for example by building a separate copy of the LLVM toolchain, placing binaries in the MINIX3 source tree, and separately performing scripted steps which should be performed from the regular makefile infrastructure instead. All of these issues should be resolved through proper integration of the live update build infrastructure into the regular build system.
=== Two LLVM toolchains ===
A major part of the problem is the current necessity to build a separate instance of the LLVM toolchain. This is the instance that is suitable for Link-Time Optimization (LTO), built by minix/llvm/generate_gold_plugin.sh
, and placed in obj_llvm.i386
. Even though the exact same LLVM 3.4 source code is used to compile both this and the additional regular crosscompilation toolchain in obj.i386
, the separate compilation is necessary because of a problem with makefiles.
NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain not being built in the same way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation.
The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the missing pieces without the need to build LLVM twice.
=== Lack of integration ===
Once that step has been taken, it should be possible to resolve the other issues as well, effectively replacing all the *.llvm
scripts in minix/llvm
with extensions in the regular build system, specifically by adapting the share/mk
set of makefiles as appropriate. All of this should be optional, controlled by the MKMAGIC
build (pseudo)variable and possibly other, new build variables. Ultimately, relinking with libmagic and invoking the appropriate link-time passes should be performed by those makefiles. As an example, the WeakAliasModuleOverride pass is already invoked this way.
All passes, as well as the magic library, should be (re)built as part of the standard build system infrastructure. As we have indicated earlier, the lack of this step puts an unnecessary burden on the user of the system.
As part of this, none of the generated binaries should be placed in minix/llvm/bin
. Instead, they should end up in an appropriate subdirectory of obj.i386
, thereby keeping the source directory clean.
Finally, any generated ASR-rerandomized service binaries should automatically be removed when the corresponding service is reinstalled, so as to prevent that stale ASR binaries end up in a generated image.
==== Instrumentation ====
A number of shortcomings in our instrumentation passes currently lead to potential problems at run time.
=== Type unification ===
As shown in the developers guide, the magic instrumentation pass is not always capable of establishing that two different data types are in fact compatible, resulting in state transfer errors at run time. The main cause of these issues lies in LLVM's use of the opaque placeholder data type.
This problem is a product of circumstances. Between LLVM 2.x and LLVM 3.x, a significant change was made in LLVM regarding the way that types are handled. In a nutshell, rather than unifying various instances of the same data type at compile time, LLVM 3.x keeps these instances as separate type, instead using bit casting between the types to resolve the resulting incompatibilities at link time. More details about this change can be found in the LLVM blog post
LLVM 3.0 Type System Rewrite by Chris Lattner.
However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has resulted in the situation that not all cases of identical types
As of writing, this is not yet a real problem, but it eventually will be.
We believe that the right solution would be a new type unification pass, which unifies all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. This pass would be run before the magic pass, thus resolving the original problem while also freeing the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagic to search through compatible types.
=== Exceptions for libmagic ===
The instrumentation framework currently makes more exceptions than it should. In particular, the ASR pass exempts all of the magic library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagic is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service. The exact reasons as to why this exception was made are currently unknown. However, if possible, this overall limitation should be resolved by either removing the exception or at least narrowing it to the exact scope of the problem.
In addition, although less importantly, state transfer makes some exceptions based on name prefixes, and some of these name prefixes are overly broad. For example, it is not impossible that the current exception of the prefix st_
also ends up matching certain variables in the actual service. At the very least, all exception prefixes should start with magic_
.
==== Memory management ====
The MINIX3 memory management, implemented in the VM service, currently has a number of significant limitations and missing features. Some of its problems are relevant for live update only. Other problems are merely becoming more visible as a result of enabling or using live update functionality.
=== Region transfer issues ===
A problem which we already flagged earlier on, is the issue that for live update, transfer of in particular memory-mapped pages requires these pages to be in a strictly delineated address range. This range may not overlap with any of the process's other sections' address ranges. The range is hardcoded globally, and thus, defined much more strictly than necessary for most service processes. Moreover, the definition indiscriminately affects all processes, including application processes. The result is that when the system is built with live update support, all processes are severely restricted in how much of their address space they can use for memory-mapped regions.
Another problem mentioned before, is the bulk transfer of all pages in the process's mmap section, regardless of whether the state transfer framework knows about them. This could easily lead to memory leaks due to transfer of untracked pages.
We believe that both points could be resolved with a system that does not automatically transfer memory-mapped pages from the old to the new instance, but rather performs such transfer on demand, so that the (identity or magic) state transfer routine can determine what memory to transfer.
=== Out-of-memory issues ===
MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocation for pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no available memory, the service will be killed. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent.
Live update is making this situation even more problematic. The magic library uses more dynamic memory, and is not particularly careful about using preallocated memory when necessary. The ASR functionality increases memory usage, including the use of stack space through its stack padding feature. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.
Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulting in stack pages that this is not a viable option in general. There has been a partial attempt to prepare file system service's buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a limited amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services.
In the meantime, it can be expected that test64** of the MINIX3 test set, the test case that tests one particular scenario of running out of memory, will causes test or system failure in an increasing number of cases. It may have to be removed from the default set of tests in the short term.In addition, MINIX3 does not deal well with running out of special memory. Some services require blocks of physically contiguous memory for DMA purposes. VM currently has no way to recombine fragmented blocks of free memory into contiguous ranges. Some services require memory that is located in the lower 1MB or 16MB of the system memory. The support in VM for obtaining memory in those ranges is very limited as well. Both of these cases may result in the inability for a system service to obtain its needed resources if it is not started immediately at system bootup time.
These problems are not particularly important for live update, since the new instance will inherit special memory from its old memory by default. They are important for crash recovery however, and they are known to cause failures in the testrelpol
test set on occasion.
Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection.
Since libmagic performs extra memory allocations, this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with PROT_NONE
protection instead. Theoretically, this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken.
Ideally, this issue would be resolved by implementing proper support for page protection in VM, including for example an implementation of mprotect(2).
We now list a number of other issues concerning the MINIX3 runtime infrastructure side of live update.
The case of userspace threads has shown that it may be necessary for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. Moreover, these services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that both users and scripts, the update_asr(8) script in particular, be aware of specific services requiring custom quiescence state. This is annoying and dangerous.
The default quiescence state is currently hardcoded in the service(8) utility, in the form of DEFAULT_LU_STATE
in minix/commands/service/service.c
. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service's own SEF routines. If this approach does not work, it would also be possible to somehow expose each service's default state through the procfs per-service /proc/service/
files, so that at least scripts could add any custom -state
options automatically.
While the following issue is relevant more for crash recovery than for live update, it is included here because it affects the infrastructure supporting the testrelpol
script.
Each service effectively knows what its own crash recovery policy should be. Separately, procfs has a policy table with an entry for each service in minix/fs/procfs/service.c
, exposing the same crash recovery policy information to userland, and the testrelpol script in particular. This is effectively redundant information.
Ideally, each service would communicate its policy to RS. That information can then be used by procfs to expose the policy information to userland, thus eliminating the redundancy.
If process A is being updated, process B should not make use of process A's grants, because those grants may temporarily be inaccessible, invalid, etcetera. The kernel currently has a simple way to enforce that rule, by responding to process B's safecopy kernel call with an ENOTREADY
error response. The service-side libsys implementation of sys_*safecopy*(2) automatically suspends the calling service for a short while (using tickdelay(3)) and then retries the safecopy. This shortcut approach works, but it is not ideal, in particular because it could theoretically lead to starvation.
Instead, the kernel should block the caller of a safecopy kernel call for the duration of its target's live update procedure, retrying the safecopy operation and unblocking the caller only once the target is no longer being updated. A proper implementation of this functionality requires several cases to be covered: indirect grants, either the granter or the grantee being terminated or having its process slots swapped, etcetera. As a possible simplification, internally retrying the safecopy operation more than once would not be a problem, since the caller would simply remain blocked if the retried safecopy operation hits a case of live update again.
In a very specific scenario, the kernel performs a memory copy of the entire asynsend table between two processes of which the slots are being swapped. Although it is not yet clear which exact circumstances cause the need for this memory copy, the actual copy action relies on very specific conditions which are not validated before the copy action.
A rather long comment in minix/lib/libsys/sef_liveupdate.c
elaborates on the specifics of this case, and suggests why RS may be the only affected service. If the comment is correct, it may be possible to engineer another solution for RS in particular, and remove the copy hack from the kernel.
A number of miscellaneous issues remain. The first issue, regarding performance, is a relatively important issue. The other issues listed in this section are relatively minor.
The performance of various parts of the live update infrastructure, both at instrumentation time and (in particular) at run time, is not fantastic. One of the effects is that in several cases, live update operations must be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option.
We have not yet looked into the causes of the poor performance. This is therefore a rather open-ended issue.
After running the testrelpol
script a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in testrelpol. If that is the case, it should be possible to change testrelpol to disable the exponential backoff using existing service(8) flags.
The implementation of the magic library currently relies on asserts being enabled. We have changed its Makefile so that asserts should be enabled regardless of build system settings, but this is merely a workaround. Instead, libmagic should function properly (and, in particular, fail properly) regardless of whether asserts are enabled.
During live update and crash recovery, the following VM error may be seen:
VM: cannot fork with physically contig memory
The error indicates that it is currently not possible to mark physically contiguous memory as copy-on-write, which is true. However, the error may occur during a live update, when VM copies over the memory-mapped pages of a service's old instance to the new instance. The error is therefore not the result of a fork(2) call. In addition, the error code thrown by the function producing the error message, is ignored by its caller, with as result that the reference count of the contiguous memory range is increased anyway, which is exactly what needs to happen for live update operations. Thus, during live updates, this error is both misleading and meaningless. However, we have to review whether it is still useful to keep around the error for other scenarios.
The following publication covers the MINIX3 live update architecture, design, and implementation, and provides more details on various theoretical and practical aspects.