====== Live update and rerandomization ======
MINIX3 now has support for live update and rerandomization of its system services. These features are based on LLVM bitcode compilation and instrumentation in combination with various run-time extensions. Live update and rerandomization support is currently fully functional, although still in an experimental state, not enabled by default, and available for x86 only. This document describes the basic idea, provides instructions on how to enable and use the functionality, provides more in-depth information for developers, and lists open issues and further reading material.
===== Introduction =====
This section contains a high-level overview of the live update and rerandomization functionality.
==== Live update ====
A live update is an update to a software component while it is active, allowing the component's code and data to be changed without affecting the environment around it. The MINIX3 live update functionality allows such updates to be applied to its **system services**: the usermode server and driver processes that, in addition to the microkernel, make up the operating system. As a result, these services can be be updated at run time without requiring a system reboot. There is no support for live updating the microkernel or user applications at this time.
The live update procedure can be summarized as follows. The component responsible for orchestrating live updates is the RS (Reincarnation Server) service. When RS applies an update to a particular system service, it first brings that service to a stop in a known **quiescence** state, ensuring that the live update will not interfere with the service's normal operation, by exploiting the message-based nature of MINIX3. A new instance of the service is created. This new instance performs its own **state transfer**, copying and adjusting all the relevant data from the old instance to itself. If the state transfer succeeds, the new instance continues to run, and the old instance is killed. If the state transfer fails, RS performs a **rollback**: the new instance is killed, and the system resumes execution of the old instance. In order to maintain the illusion to the rest of the system that there only ever was one service process, the process slots of the old and the new instance are swapped before the new instance gets to run, and swapped back upon rollback.
The MINIX3 live update system allows updates to all system services. Those include the RS service itself, and the VM (Virtual Memory) service. The VM service can be updated with severe restrictions only, however. The system also supports **multicomponent** live updates: atomic live updates of several system services at once, possibly including RS and/or VM. In principle, this allows for an atomic live update of the entire MINIX3 service layer.
The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM "optimization" passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization, the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the **magic pass**.
In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the **magic runtime library** or //libmagicrt//. Together, the magic pass and runtime library make up the **magic framework**.
==== Live rerandomization ====
Live rerandomization consists of randomizing the internal address space layout of a component at run time. While the concept of ASR or ASLR - Address Space (Layout) Randomization - is well known, most implementations are rather limited: they perform such randomization only once, when starting a process; they merely randomize the base location of entire process regions, for example the process stack; and, they apply the concept to user processes only. In contrast, the MINIX3 live rerandomization can randomize the address space layout of operating system services, as often as desired, and with fine granularity. In order to achieve this, the live rerandomization makes use of live updates.
The fundamental approach consists of a two-step process. First, new versions of the service program are generated, using link-time randomization of various parts of its program binary. Ideally, this would be done at run time; due to various limitations, MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another.
The randomization of binaries is done with another link-time pass, called the **asr pass**. The magic runtime library implements various runtime aspects of ASR rerandomization during live update.
===== Users guide =====
In this section, we explain how to set up a MINIX3 system that supports live update and rerandomization, and we describe how to use the new functionality when running MINIX3.
==== Setting up the system ====
We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only.
The current procedure has been tested only from **Linux** as host platform, and may require minor adjustments on other host platforms. We provide a few additional instructions for those other platforms, but these may currently not be complete. Please feel free to add more instructions to this page, and/or open GitHub issues for other platforms and link to them from here.
After setting up an initial environment, the first step is to obtain the MINIX3 source code. After that, the next step is to build an LLVM toolchain with LTO support, which is needed because the regular MINIX3 crosscompilation LLVM toolchain does not include LTO support (yet - we are working on this). Once the LTO-supporting toolchain has been built, the final step is to build the MINIX3 sources, with extra flags to enable magic instrumentation and possibly ASR rerandomziation.
Once these steps have been completed successfully for the first time, one can later update the MINIX3 source and then rebuild the system. The LTO-supporting toolchain need not be rebuilt unless we upgrade the LLVM source code itself.
We will now go through all steps in detail. At the end of this section, there is also a summary of the commands to issue.
All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''sudo apt-get'' example in the first subsection, require more than ordinary user privileges.
=== Setting up the environment ===
The initial step is to set up a crosscompilation environment. General information about setting up a crosscompilation environment can be found on the [[.:crosscompiling|crosscompilation page]]. As one example, the reference platform used to test the instructions in this document was the developer desktop edition of Ubuntu 14.04, a.k.a. ''ubuntu-14.04.2-desktop-i386.iso'', with the following extra packages installed:
$ sudo apt-get install curl clang binutils zlibc zlib1g zlib1g-dev libncurses-dev qemu-system-x86
The MINIX3 build system uses one single directory in which to place all its files. This directory is one level up from the root of the MINIX3 source directory. Thus, it is advisable to create this containing directory at a location known to have enough free hard disk space. Here we use ''/home/user/minix-liveupdate'' as an example, but the location is entirely up to you. The containing directory will end up having one subdirectory for the MINIX3 source code (called ''minix-src'' in this document), one subdirectory for the LLVM LTO toolchain (called ''obj_llvm.i386''), and one subdirectory for the crosscompilation tool chain and compiled objects (called ''obj.i386''). Thus, the ultimate directory structure will look like this:
/home/user/minix-liveupdate/minix-src
/home/user/minix-liveupdate/obj_llvm.i386
/home/user/minix-liveupdate/obj.i386
You have to choose a location for the containing directory, and create it yourself. The three subdirectories should be created automatically as part of the following steps. However, it has been reported that on some platforms (e.g., FreeBSD), some or all of these directories have to be created manually; this can be done with nothing more than a few basic ''mkdir'' commands. In terms of disk space usage, expect to be needing a bare minimum of **30GB** for the combination of these three subdirectories, with a recommended **40GB** of available space.
=== Obtaining or updating the MINIX3 source code ===
The first real step is to fetch the MINIX3 source code. Other wiki pages cover this in more detail, but the simplest approach is to check out the sources from the main MINIX3 repository using [[.:usinggit|git]]:
$ cd /home/user/minix-liveupdate
$ git clone git://git.minix3.org/minix minix-src
This will create a ''minix-src'' subdirectory with the latest version of the MINIX3 source code.
Later on, a newer version of the source code can be pulled from the MINIX3 repository:
$ cd /home/user/minix-liveupdate/minix-src
$ git pull
In both cases, the next step is now to build the source code.
=== Building the LTO toolchain ===
The second step is to build the LLVM LTO infrastructure, if it has not yet been built before. Eventually, this will be done automatically as part of the regular build. For now, we have a script that can perform the build, called ''generate_gold_plugin.sh''. It is located in the ''minix/llvm'' subdirectory of the MINIX3 source tree. The basic procedure therefore consists of the following steps (but read this entire section first):
$ cd /home/user/minix-liveupdate/minix-src/minix/llvm
$ ./generate_gold_plugin.sh
On some platforms, it may be needed to specify the C/C++ compiler and/or the name of the GNU make utility, which can be done as follows:
$ CC=clang CXX=clang++ MAKE=make ./generate_gold_plugin.sh
On FreeBSD and similar platforms, one may have to ensure that GNU make is installed (typically as ''gmake'') first, and pass in ''MAKE=gmake'' to point to it.
This step may take several hours. It can be sped up by supplying a number of parallel jobs, through a ''JOBS=n'' variable:
$ JOBS=8 ./generate_gold_plugin.sh
As stated before, after this command has finished successfully, it need not be reissued until LLVM is upgraded in the MINIX3 source tree. This is a rare event which is typically part of a larger resynchronization with NetBSD code, and we will clearly announce such events. When this happens, it may be advisable to remove the entire ''obj_llvm.i386'' directory as well as any files in ''minix-src/minix/llvm/bin'', before rerunning the generate_gold_plugin.sh script.
=== Building the system ===
The third step consists of building the system and generating a bootable image out of it. When run for the first time, this step will also build the regular (non-LTO) crosscompilation toolchain. The first run may therefore (also) take several hours. The build procedure is just like regular MINIX3 crosscompilation, differing in only two aspects.
First, the appropriate build variables must be passed in to enable the desired functionality. In order to build the system with live update support through magic instrumentation, the build system must be invoked with the ''MKMAGIC'' build variable set to //yes//. This will perform a bitcode build of the entire system, and perform magic instrumentation on all system services.
In order to build the system with ASR instrumentation, the build system must be invoked with the ''MKASR'' build variable set to //yes//. This will automatically enable magic instrumentation, perform ASR randomization on all system services, and pregenerate a number of ASR-rerandomized service binaries for each service. This number can be controlled with an additional ''ASRCOUNT=n'' build variable, where the //n// value must be between 1 and 65536 (inclusive). The default //ASRCOUNT// is 3.
Second, in order to build a hard disk image suitable for use by the resulting bitcode builds, the ''x86_hdimage.sh'' script must be invoked with the **-b** flag. This will enlarge the generated image to account for the larger binaries, and enable inclusion of ASR-rerandomized binaries if necessary.
These two aspects can be covered in a single build command. The following short procedure will build a hard disk image with magic instrumentation:
$ cd /home/user/minix-liveupdate/minix-src
$ BUILDVARS="-V MKMAGIC=yes" ./releasetools/x86_hdimage.sh -b
In order to speed up the build, a number of parallel jobs may be supplied. It is typically advisable to use as many jobs as there are hardware threads of execution (i.e., CPU cores or hyperthreads) in the system:
$ JOBS=8 BUILDVARS="-V MKMAGIC=yes" ./releasetools/x86_hdimage.sh -b
It may be necessary to ensure that clang is used as the compiler:
$ CC=clang CXX=clang++ JOBS=8 BUILDVARS="-V MKMAGIC=yes" ./releasetools/x86_hdimage.sh -b
Also, some platforms may not be able to compile the compiler toolchain for the target platform due to running out of memory. In that case, it is possible to build an image that does not come with its own compiler toolchain, by passing in the ''MKLLVMCMDS=no'' build variable. This build variable can also be used simply to speed up the compilation procedure.
$ BUILDVARS="-V MKMAGIC=yes -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
In order to build an image with ASR randomization, including four additional ASR-rerandomized versions of each system service, use the following build variables:
$ BUILDVARS="-V MKASR=yes -V ASRCOUNT=4" ./releasetools/x86_hdimage.sh -b
Obviously, all variables shown above can be combined as appropriate. The author of this document has used the following command line on several occasions:
$ CC=clang CXX=clang++ JOBS=4 BUILDVARS="-V MKASR=yes -V ASRCOUNT=2 -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
After the first run, the build system will perform recompilation of only the parts of the source code that have changed, and should not take nearly as long to run as the first time. In case of unexpected problems when rebuilding, it may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entirety. This can be done by going to the top-level ''obj.i386'' directory and deleting all files and directories in there, except the ''tooldir.{yourplatform}'' subdirectory. Fully rebuilding the MINIX3 source code will take longer than an incremental rebuild, but since the crosscompilation toolchain is left as is, it will still be nowhere close as long as the first run.
As explained in more detail on the [[.:crosscompiling|crosscompilation page]], it is also possible to rebuild particular parts of the system without going through the entire "make build" process. This involves the use of the ''nbmake-i386'' tool and generally requires a good understanding of the compilation process.
=== Running the image ===
The x86_hdimage command produces a bootable MINIX3 hard disk image file. The generated image file is called ''minix_x86.img'' and located in the root of the MINIX3 source tree - ''minix-src'' in our examples. Once an image has been generated, it can be run. The most convenient way to run the image is to use **qemu/KVM**. This can be done using the command as given at the end of the x86_hdimage output.
While explaining the use of qemu is beyond the scope of this document, it may be useful to look into the ''-append'', ''-curses'', and ''-serial file:..'' qemu command line arguments. The following command line will launch qemu with KVM support (remove ''--enable-kvm '' to disable KVM support), a curses-based user interface, and system output redirected to a file named ''serial.out'':
$ cd /home/user/minix-liveupdate/minix-src
$ (cd ../obj.i386/destdir.i386/boot/minix/.temp && qemu-system-i386 --enable-kvm -m 256 -kernel kernel -initrd "mod01_ds,mod02_rs,mod03_pm,mod04_sched,mod05_vfs,mod06_memory,mod07_tty,mod08_mib,mod09_vm,mod10_pfs,mod11_mfs,mod12_init" -hda ../../../../../minix-src/minix_x86.img -curses -serial file:../../../../../minix-src/serial.out -append "rootdevname=c0d0p0 cttyline=0")
Extra [[usersguide:bootmonitor|boot options]] can be supplied in the (space-separated) list that follows the ''-append'' switch. For example, adding '' rs_verbose=1'' will enable verbose output in the RS service, which is highly useful for debugging issues with live update.
=== Summary ===
The following commands can be used to obtain and build a MINIX3 system that supports live update and live rerandomization, including three alternative rerandomized versions of all system services, in addition to the randomized standard ones:
$ export CC=clang CXX=clang++ JOBS=8
$ cd /home/user/minix-liveupdate
$ git clone git://git.minix3.org/minix minix-src
$ cd minix-src/minix/llvm
$ ./generate_gold_plugin.sh
$ cd ../..
$ BUILDVARS="-V MKASR=yes -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
The entire procedure will typically take about 30GB of disk space and several hours of time.
Sometime later, the following steps can be used to update the installation to a newer MINIX3 version:
$ cd /home/user/minix-liveupdate/minix-src
$ git pull
$ CC=clang CXX=clang++ JOBS=8 BUILDVARS="-V MKASR=yes -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
In contrast to the initial run, the entire update procedure should take no more than an hour.
==== Using live update ====
Once an instrumented MINIX3 system has been built and started, it should be ready for live updates. MINIX3 offers two scripts that make use of the live update functionality: one for testing the infrastructure, and one for performing runtime ASR rerandomization. In addition, the user may perform live updates manually. In this section, we cover both parts.
The commands in this section are to be run within MINIX3, rather than on the host system. They must be run as root, because performing a live update of a system service requires superuser privileges. These two things are reflected by the ''minix#'' prompt used in the examples below.
=== Pre-provided scripts ===
The MINIX3 distribution comes with two scripts that can be used to test and use the live update and rerandomization functionality. The first one is //testrelpol//. This script may be used for basic regression testing of the MINIX3 live update infrastructure. The second one is //update_asr//. This command performs live rerandomization of system services at runtime.
== Infrastructure testing: testrelpol ==
The MINIX3 test suite has a test script that tests the basic MINIX3 crash recovery and live update functionality. The script is called **testrelpol** and can be found in ''/usr/tests/minix-posix'':
minix# cd /usr/tests/minix-posix
minix# ./testrelpol
For its live update tests, this script does //not// use the magic framework for state transfer at all. Instead it uses **identity transfer** which performs a basic memory copy between the old and the new instance. As a result, the testrelpol script should succeed whether or not services are instrumented. However, it may not work reliably on MINIX3 systems that are not built for magic instrumentation (i.e., built with neither ''MKMAGIC=yes'' nor ''MKASR=yes'').
== Live rerandomization: update_asr ==
As we have shown before, the ''MKASR=yes'' host-side build variable performs the //build-time// preparation of a MINIX3 system for live rerandomization. Complementing this, the //run-time// side of the live rerandomization is provided by means of the **update_asr** command. The update_asr command will update system services to their next pregenerated ASR-rerandomized version, using a cyclic system. Live rerandomization is not automatic, and thus, the MINIX3 system administrator is responsible for running the update_asr command at appropriate times.
By default, the update_asr command performs one round of ASR rerandomization, updating each service to its next version:
minix# update_asr
By default, this command will report errors only. More verbose information can be shown using the ''-v'' switch:
minix# update_asr -v
For further details about this command, see the update_asr(8) manual page.
Aside from providing actual security benefits, the update_asr script is the **most complete test** of the live update and rerandomization functionality at this time. It uses the magic framework for state transfer, with full relocation of all state, and it applies the runtime ASR features. As of writing, it runs in the default qemu environment without any errors or subsequent issues.
The only aspect that is not tested with this command, is whether ASR rerandomization is //effective//, that is, whether all parts of service address space were properly randomized by the asr pass. After all, ASR rerandomization between identical service copies works just as well, but provides substantially fewer security guarantees. Developers working on the asr pass are encouraged to verify its effectiveness manually, for example using nm(1) on generated service binaries on the host side.
=== Live update commands ===
RS can be instructed to perform live updates through the minix-service(8) command, specifically through its **minix-service update** subcommand. This command is also used by the automated scripts. For a full overview of the command's functionality, please see the minix-service(8) manual page as well as the command's output when it is run with no parameters.
In its most fundamental form, the //minix-service update// command will update a running service, identified by its label, to a new version provided as an on-disk binary file. It is however also possible to tell RS to update the service into a copy of itself. In addition, various flags and options can be used for fine-grained control of the live update action. The basic syntax to perform a live update on a single system service is as follows:
minix# minix-service [flags] update [self|] -label [options]
Through various combinations of this command's parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions for the overall live update infrastructure in general, and state transfer in particular. We will now go through all of them, and explain how they can be performed. For more details regarding what is actually going on below the surface, please consult the developers guide section of this document.
== Identity transfer ==
The first update type is **identity transfer**. In this case, the service is updated to an identical copy of itself, with all functions and static data in the new instance located at the exact same addresses as the old instance. Identity transfer bluntly copies over entire memory sections at once, thus requiring no instrumentation at all. This makes it suitable for testing of the MINIX3-specific side of the live update infrastructure, hence its use in the ''testrelpol'' script. Identity transfer is the default of the minix-service(8) command when "self" is given instead of a path to a new binary:
minix# minix-service update self -label pm
This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 system services. As mentioned, it is guaranteed to work only when the system was built with ''MKMAGIC=yes'', although it will mostly work on systems built without magic support as well. It works regardless of whether the target service was instrumented with the magic framework (or ASR).
If the live update is successful, the minix-service(8) command will be silent, but RS will print a system message that the update succeeded:
RS: update succeeded
If the system was started on qemu with ''OUT=F'', this message will end up in ''serial.out''. Otherwise, the message should show up in the MINIX3 system log (''/var/log/messages'') and possibly on the first console.
If the live update fails, RS should print an error to the system log, and minix-service(8) will complain. In order to debug such failures, it may be useful to enable verbose mode in RS, buy starting the system with ''rs_verbose=1'' as shown earlier.
== Self state transfer ==
The second update type is **self state transfer**. Self state transfer also performs an update of a service into an identical copy of itself, but instead uses the state transfer functionality of the magic framework. Thus, self state transfer requires that the service be instrumented properly. This update type can be used to test whether a service's state can be transferred without problems. Please note that many of the points covered here also apply to the remaining two update types, as all three are using the state transfer of the magic framework.
Self state transfer is performed by supplying the ''-t'' flag along with "self" to the minix-service update command:
minix# minix-service -t update self -label pm
This command will perform self state transfer of the PM service. The libmagicrt state transfer routine in the new service instance will print additional system messages while it is running. Upon success, the system output will look somewhat like this:
total remote functions: 57. relocated: 54
total remote sentries: 186. relocated normal: 84 relocated string: 101
total remote dsentries: 5
st_data_transfer: processing sentries
st_data_transfer: processing dsentries
st_data_transfer: processing sentries
st_data_transfer: processing dsentries
st_state_transfer: state transfer is done, num type transformations: 0
RS: update succeeded
If the state transfer routine is not able to perform state transfer successfully, it will print messages that start with ''[ERROR]''. RS will then roll back the service to the old instance, and both RS and minix-service(8) will report failure. Self state transfer should succeed for all MINIX3 system services that have been built with bitcode and instrumented with libmagicrt and the magic pass. As of writing, there are no system services for which self state transfer is known to result in ''[ERROR]'' lines and subsequent live update failure. However:
* It is possible that new changes to system services, and even usage scenarios which we have not yet tested, do result in state transfer errors. Such errors should be resolved. The developers guide further below contains information on how to resolve some of these errors.
* Currently, one service is not built with bitcode, namely the memory driver. It is therefore also not instrumented. An attempt to perform self state transfer on any service that is not instrumented will result in a "Function not implemented (error 78)" error. For services other than the memory driver, this is usually a good indication that a step was missed during the build phase.
* Some services have no state to transfer, in which case their new instances will perform a fresh start instead of state transfer. In that case, live update with self state transfer will succeed, but not print the state transfer system messages shown above. This is the case for the IS (Information Server) and readclock.drv services, for example.
* Some services may only be updated once brought into a specific state of quiescence, because the default quiescence state is not sufficiently restrictive. In that case, the user must specify an alternative quiescence state explicitly, through the minix-service(8) ''-state'' option. This currently applies to all services that make use of userspace threads, namely the VFS, ahci, and virtio_blk services. These services must be updated using quiescence state 2 (//request free//) rather than state 1 (//work free//):
minix# minix-service -t update self -label vfs -state 2
Omitting the appropriate state parameter may result in a crash of the service after live update. At the moment, the update_asr(8) script has hardcoded knowledge about these necessary states. None of this is great, and we will be working towards a situation where the default state will not result in a crash - see the section on open issues further below.
* State transfer may be slow, and RS applies a rather strict default timeout for live updates. Therefore, it may sometimes be necessary to set a longer timeout in order to avoid needless failures. This can be done through the ''-maxtime'' option to minix-service(8):
minix# minix-service -t update self -label vfs -state 2 -maxtime 120HZ
The maximum time is specified in clock ticks by default, but may be given in seconds by appending "HZ" to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system's clock frequency, also known as its HZ setting. The above example allows the live update of VFS to take up to two minutes.
== ASR rerandomization ==
The third update type is **ASR rerandomization**. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service performs state transfer into a rerandomized version of the same service. This involves specifying the path to a rerandomized ASR binary to the minix-service(8) command, as well as the ''-a'' flag. The ''-a'' flag tells the new instance to enable the run-time parts of rerandomization during the live update.
minix# minix-service -a update /service/asr/pm-1 -progname pm -label pm
In a system that has been built with ASR rerandomization, the (randomized) base service binaries are located in ''/service'' and the (randomized) alternative service binaries are located as numbered files in ''/service/asr''. As mentioned before, the update_asr(8) command can be used to perform these updates semi-automatically.
Compared to self state transfer, ASR rerandomization comes with one extra restriction: the VM service cannot be subjected to forms of state transfer more complicated than self state transfer. For this reason, VM is also skipped by the update_asr(8) command. We will explain the restrictions regarding the VM service in the developers guide.
== Functional update ==
The final update type is a **functional update**. Compared to self state transfer, ASR rerandomization relocates code and more data. However, for ASR rerandomization, there are still fundamentally no differences between the old and the new version of the service. In contrast, in the case of a functional update, the service performs state transfer into a new program. While this new program is typically highly similar, it may be different from the running service in various ways.
In terms of the minix-service(8) command, such functional updates can be performed by simply using //minix-service update// with a new binary. For example, one could test a new version of the UDS (UNIX Domain Sockets) service, without installing it into ''/service'' yet, and without affecting its open sockets:
minix# minix-service update /usr/src/minix/net/uds/uds -label uds
The possibility of actual differences between the old and new service versions adds an extra dimension for the state transfer. Additional state transfer problems can be expected in this case, and must be dealt with accordingly. The developers guide will (eventually) elaborate on this point.
Similarly, depending on the nature of the update, the update action may require a specific state of quiescence. Taking UDS as an example, an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors be in flight at the time of the update. The old instance of the service must support this as a custom quiescence state. This custom state can then be specified through the ''-state'' option of the //minix-service update// command.
Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned!
== Multicomponent updates ==
From the user's perspective, updating multiple services at once is not much more complex than updating a single service. First, a number of **minix-service update** commands should be issued, just as before, but each with the ''-q'' flag added:
minix# minix-service -q -t update /service/pm -label pm
minix# minix-service -q -t update /service/vfs -label vfs -state 2
Then, the entire update can be launched with the **minix-service sysctl upd_run** command:
minix# minix-service sysctl upd_run
The RS output will be much more verbose in this case. Note that timeouts are still to be specified on a per-service basis, rather than for the entire update at once. If necessary, any queued //minix-service update// commands may be canceled with the **upd_stop** subcommand:
minix# minix-service sysctl upd_stop
This will cancel the entire multicomponent live update action.
===== Developers guide =====
This part of the document provides in-depth information for developers. We start with information for system service developers, explaining how to support live update for a newly written service. This requires limited understand of the details of the live update infrastructure, and is therefore a somewhat separate section.
The rest of the developers guide is targeted towards people who maintain the live update infrastructure. We first cover some of the theoretical and practical aspects of the live update approach and infrastructure on MINIX3. We then elaborate on several practical aspects related to state transfer using the magic framework, including how to prevent and resolve state transfer issues.
==== Writing a service ====
This section is for writers of system services. We cover two aspects: general requirements for live updates, and specifying custom live update quiescence states.
=== General requirements ===
In by far most cases, allowing future live updates on a new service requires **no action at all** from the service developer. That is, if the service has been written properly, it can also be updated. Specifically, a service can be updated if it meets these conditions:
- It uses the System Event Framework (SEF) API throughout the service;
- It has one main message processing loop;
- It performs all its initialization in SEF initialization callback routines;
- It does not suffer from specific state transfer issues.
The first three points are required for all services in any case, and are not specific to live update. These points are therefore covered better on other pages, in particular those on [[.:driverprogramming|programming device drivers on MINIX3]] and the [[.:sef|System Event Framework API]] (warning: currently outdated). We do explain the reason behind these three points in detail later on.
Only the fourth point is specific to live update, and is relevant only to a small subset of services. This point is covered in more detail in the "State transfer in practice" section below. Specifically, as a service developer, you will want to verify that your new service does not suffer from potential issues with long-running memory grants, userspace threads, and physically unmovable memory. Then, you will want to test **self state transfer** on your service, and resolve any state transfer errors that come up. Only in these cases does the SEF live-update API (that is, the sef_*_lu_*(3) calls) become relevant. We do not elaborate on most of the SEF API in this document.
=== Custom quiescence states ===
In certain cases, a service may have to meet custom requirements before it is allowed to be updated. This depends on both the service and the update. We previously gave an example regarding the UDS service and transferring file descriptors before. As another example, an update that affects message protocols may have to ensure that the service has no outstanding requests to other services using that protocol. As yet another example, certain drivers may want to avoid being updated while certain types of DMA are ongoing, etcetera.
It is up to the writer of the service to implement any such custom quiescence states, assigning a number to each of them. It is then up to the system administrator to supply such a state with the //minix-service update// command, using the ''-state '' option. Some of the quiescence states are predefined; others must be defined by the service developer explicitly. The following states are defined:
* State **1** (''SEF_LU_STATE_WORK_FREE''): work free. This state ensures that the service is not currently performing any work. The fact that the service is being prepared at the time of verifying the quiescence state implies that it is not doing any other work, and thus, SEF is hardcoded to accept updates in this state. The service developer can not override the check for this state.
* State **2** (''SEF_LU_STATE_REQUEST_FREE''): request free. This state ensures that the service is not currently processing any requests from other services. The state is not valid by default, and may be implemented by the service writer.
* State **3** (''SEF_LU_STATE_PROTOCOL_FREE''): protocol free. This state ensures that the service is not currently engaged in any protocol exchange with other services. The state is not valid by default, and may be implemented by the service writer.
* State **4** to **6**: predefined states for specific purposes. These states are handled entirely by RS and SEF, and not relevant for service developers.
* State **7** and higher (''SEF_LU_STATE_CUSTOM_BASE''+//n//): custom states. These states may be used by services to define their own custom states. The namespace is per-service, so each service may define its custom states with numbers starting from 7 (''SEF_LU_STATE_CUSTOM_BASE+0'').
Thus, a service writer may want to implement states 2, 3, and/or any additional states starting from 7. This involves two necessary parts, and a third optional part.
First, the service must use the sef_setcb_lu_state_isvalid(3) SEF API call to specify a callback routine which specifies whether a particular state is valid for the service. In order to allow for states 2 and 3, but not any custom states, the standard sef_cb_lu_state_isvalid_standard(3) SEF callback routine may be given:
sef_setcb_lu_state_isvalid(sef_cb_lu_state_isvalid_standard);
The service would typically issue this call before calling sef_startup(3). In order to allow for additional custom states, a custom callback routine must be supplied:
sef_setcb_lu_state_isvalid(my_state_isvalid);
This routine has the signature ''int my_state_isvalid(int state, int flags)'', and will be called when a live update is initiated through minix-service(8). As its most important parameter, ''state'' is the requested quiescence state. The ''flags'' parameter contains update flags and is typically unused. The routine must return ''TRUE'' if the state is valid for the service, and ''FALSE'' otherwise. Most services will want to allow the standard states as well as any custom states:
#define MY_CUSTOM_STATE_0 (SEF_LU_STATE_CUSTOM_BASE+0)
#define MY_CUSTOM_STATE_n (SEF_LU_STATE_CUSTOM_BASE+n)
return SEF_LU_STATE_IS_STANDARD(state) || (state >= MY_CUSTOM_STATE_0 && state <= MY_CUSTOM_STATE_n);
Second, the service must use the sef_setcb_lu_prepare(3) SEF API call to specify a callback routine which verifies whether the service accepts a live update for a particular state, typically also before calling sef_startup(3):
sef_setcb_lu_prepare(my_lu_prepare);
This routine has the signature ''int my_lu_prepare(int state)'', and will be called when a live update is initiated through minix-service(8), after ensuring the given state is valid. Again, ''state'' is the requested quiescence state. The function must return ''OK'' if the live update can proceed in this state, and ''ENOTREADY'' otherwise. It should check the standard states and/or any custom states, typically in a switch statement.
Third, the service may optionally provide a quiescence state debugging function through the sef_setcb_lu_state_dump(3) SEF API call. The given callback routine has the signature ''int my_lu_state_dump(int state)'' and should use the sef_lu_dprint(3) printf-like function to print information about the given quiescence state and its current internal state as appropriate, using newline-terminated lines.
==== What is where ====
We now get into the details of the live update infrastructure. For many parts of the story, it may be useful to take a look at the actual source code as well. In this section we give a quick overview of what parts of the source code are where, and what they do.
The LLVM instrumentation passes are located in ''minix/llvm'' of the MINIX3 source code, along with generate_gold_plugin.sh script described in the users guide. The following relevant LLVM passes are located in ''minix/llvm/passes'':
* The **WeakAliasModuleOverride** pass resolves a particular issue with weak symbols being used in assembly code. TODO
* The **sectionify** pass is used to tag certain functions and data of bitcode modules as belonging to a certain section. Its main purpose is to tag certain parts of the compiled code such that the magic pass (see below), in a subsequent run over the same code, will treat the tagged parts as special. For example, it is used to ignore all variables in the libc malloc code for state transfer, for reasons explained later.
* The **magic** pass performs link-time static analysis and instrumentation of system services. It is responsible for supplying libmagicrt (see below) with the necessary information to allow for state transfer at runtime, by including descriptions of data types, global variables, and other information, in the service module. In addition, it is responsible for replacing certain function calls in the module, in particular memory management functions, with calls to wrappers in libmagicrt. This allows for runtime tracking of dynamically allocated memory objects.
* The **asr** pass performs randomization of the service binary, for example by rearranging functions, basic blocks within functions, and data, adding padding between those, and letting functions allocate stack padding. The ASR pass does not deal with randomization of dynamically allocated objects. Instead, it passes some settings on to libmagicrt.
In addition to the passes, the following pieces of system functionality are especially important for live update:
* The magic runtime library, **libmagicrt**, is the runtime component of system services. It implements the actual state transfer routine, which uses both the information embedded in the service by the magic pass, and the tracking information it has gathered about dynamically allocated memory objects at run time. It also implements that actual runtime tracking. Furthermore, libmagicrt implements the aforementioned runtime part of the ASR functionality. For example, libmagicrt can add extra padding when performing memory allocations. The magic runtime library is located in ''minix/lib/libmagicrt''.
* The glue between system services and libmagicrt is implemented as part of the **System Event Framework** (SEF) library routines. These routines also handle the communication between the system service and RS. Use of SEF is compulsory for all system services. The SEF code is part of **libsys**. Its implementation can be found in the ''minix/lib/libsys/sef*.c'' files.
* The source code of **RS**, the Reincarnation Server, is located in ''minix/servers/rs''. RS uses live update functionality implemented in the kernel, located in ''minix/kernel'', and VM, located in ''minix/servers/vm''.
==== The infrastructure ====
We now elaborate on various MINIX3-specific aspects that are important to understand regarding live update. We describe the live update procedure, show the consequences of the quiescence approach, list the properties of various process memory sections, describe the two supported types of state transfer, and elaborate on the exceptions to the general model for various core system services.
=== The update procedure ===
We first describe the live update procedure in more depth.
In general, properly achieving //quiescence// is one of the main challenges for a live update system. For example, if a live update changes the implementation of a particular function, the component being updated must not be executing that function at the time of the live update - if it is, the live update will most likely result in a crash of the component. In MINIX3, the quiescence issue is resolved in a way that leaves little room for problems, by exploiting MINIX3's message-based nature. In essence, all the MINIX3 services consist of a main message loop that repeatedly receives a message and processes this message. MINIX3 supports no kernel threads, and thus, the MINIX3 services have no internal CPU-level concurrency. As a result, a message can be used to enforce quiescence.
MINIX3 live updates are orchestrated by the RS (Reincarnation Server) service. The administrator of the system first compiles a new version of the service into an executable on disk, and then instructs RS to update a particular running system service into the new version, through the minix-service(8) utility. RS starts by loading the new version of the service as a new service process, without letting it run. Thus, there are temporarily two instances of the service: the old instance, which is still running, and the new instance, which contains the new code but not yet any of the necessary state.
RS then asks the old instance of the service to prepare to be updated, by sending a __prepare__ request message to it. At the moment that the service receives and processes the preparation message, it is by definition in a known state, as it cannot also be doing something else at the same time. While this is a good start for quiescence, the service may have to meet additional requirements regarding its current activity, depending on the service and the type of live update. The administrator provides the intended //quiescence state// for the live update when starting the update, and the service itself determines whether or not it is //ready// when handling the __prepare__ message. If the service decides that it does not meet the given quiescence requirements, the live update is aborted.
However, if the old instance does meet the requirements, it acknowledges that it is ready by sending a __ready__ message to RS, blocking on receipt of a reply from RS. Thus, the old instance is effectively stopped in a known state. In order to maintain the externally visible state (most importantly, the communication endpoint) of the service being updated, the process slots of the old and the new service instances are swapped. The new instance, now in the original process slot, is then allowed to run. Upon startup, the new instance finds out from RS that it is the new instance of an old, stopped process, and attempts to perform state transfer from this old process into itself.
State transfer requires transfer of all individual pieces of data from the old to the new process, possibly to a new location. This is performed by the magic framework. In a nutshell, the magic state transfer approach relies on having a full view of all the individual pieces of data that make up the process, along with type information about the data, including for example structure layouts and types of pointers. For static data, this information is generated by the magic pass through static analysis performed at compile time, and included with the service binary. For dynamic data, the information is collected and maintained by the magic runtime library attached to the service. The end result is that the state transfer framework knows about all global variables and functions, and for each pointer, what type of data the pointer points to.
This knowledge, in addition to full access to the memory of the old instance through a special memory grant, allows the libmagicrt state transfer procedure in the new instance to iterate over all data of the old process. This procedure recursively follows any pointers it encounters, and //pairs// each piece of data with the corresponding piece of data in the new process, copying over and adjusting (as necessary) the data for the new layout as necessary. In certain cases, the state transfer system may not be able to pair all pieces of data, or deal with all pointers. In that case, state transfer fails. Annotations in the service source code, as well as custom data transfer methods, can be provided in order to aid the state transfer process.
Regardless of whether state transfer succeeded or failed, the new instance sends the result of the state transfer to RS using an __init__ request message. If state transfer succeeded, RS allows the new instance to continue to run, and kills the process of the old instance. If the state transfer fails, RS again swaps the process slots of the old and the new instance, allows the old instance to run again, and kills the new instance. In both cases, RS communicates the result to the minix-service(8) utility as well, ultimately letting the system administrator know about the outcome of the live update.
For multicomponent live updates, all affected services are first brought into the //ready// state, after which they are all updated. Any service failing to get ready in the preparation phase will cause an abort of the entire update, and any service failing the state transfer phase causes a rollback of the entire update.
Updating the RS and VM components requires various deviations from the procedure sketched above. In addition, support for live updating the VM service is limited. We elaborate on these points later on.
=== The quiescence model ===
We describe the quiescence model in a bit more detail, in order to make two points: 1) the implementation of system services must follow a basic standard structure in order to allow for live update, and 2) the process stack is and can be disregarded for the purpose of state transfer.
The following piece of pseudocode represents a simplified and flattened version of the general structure of each system service:
main:
# initialization
receive INIT message from RS
if INIT message requests a FRESH start:
perform service initialization
if INIT message requests a LIVE UPDATE start:
perform state transfer
send result of performed action to RS
# there should be nothing else here
# the main message loop
while true:
receive message
if message from RS and message is PREPARE:
# for simplicity, we are always ready
send READY message to RS
receive response message from RS
# if we get here, the live update has failed
continue
handle message
As can be seen, the service's initialization code starts by learning from RS what type of initialization it should perform.
This can be either //fresh// initialization of the system service, or state transfer for the purpose of live update (for simplicity we disregard crash recovery). If the service is started anew, typically during system boot, it will perform the service initialization. Such fresh initialization typically consists of initializing global variables, performing initialization-only procedures, etcetera. In contrast, if a new service instance is started for the purpose of live update, it will skip the fresh initialization and instead perform state transfer from the old instance.
In practice, all interaction with RS is implemented in the System Event Framework (SEF) library code. The service-specific actions such as the fresh initialization action are implemented as callbacks from SEF. In the case of fresh initialization, the service is to provide a callback function to SEF using the sef_setcb_init_fresh(3) API call. The default state transfer action for a //live update// start does not require code in the actual service at all.
If the service has initialization code that is called outside of the "fresh initialization" procedure, for example at the "there should be nothing else here" point, then this code will also be called in case of a live update, possibly undoing the effects of the state transfer. Therefore, services must perform initialization only from the designated initialization routines.
After either type of initialization, the service will enter the main message loop, where it will repeatedly receive a message and handle that message. If the received message is a __prepare__ request from RS, then the service is about to be updated, and it sends a __ready__ message to RS, blocking until it gets a response. If the live update is successful, this old instance will never get a response, and instead be killed.
As can be seen, in terms of the process stack of the service, the execution path from main() to the point where the service gets blocked receiving the __ready__ response from RS (let's call this the //quiescence point//) is short and simple. As a result, if the state transfer procedure restored the new instance's stack and program counter to continue from the quiescence point, the result would essentially be the same as not doing so: in both cases, the new service would end up at the start of the message loop. Therefore, the MINIX3 state transfer approach chooses to disregard the execution context of the old process, thus obviating the need to transfer the stack altogether. This is viable only due to the well defined quiescence model.
However, it is possible that the functions leading up to the quiescence point, including the main message loop, have local variables on the stack that maintain long-running state. For example, the main() function could maintain a counter for the number of messages received so far. The values of such variables will be lost during the live update. If this were a major issue, the live update framework could be made to instrument the stack as well, but this could come at great cost since instrumenting only the stack of functions leading up to the quiescence point would be difficult. In practice, not having essential long-running variables in main() is rather simple, and we have not seen problems so far.
=== Process sections ===
The address space of a process is typically made up of various memory sections with different purposes, and MINIX3 system services are no different. There are important differences between various sections when it comes to state transfer.
* The new instance's **text** section is already as it should be: it contains the new code which has been loaded for the new instance by RS.
* The new instance's **data** section is initialized as though the service just started, and the state in this section must be transferred from the old service.
* As explained in the previous section, the **stack** section of the old instance can be ignored altogether, instead letting the new instance naturally reconstruct the stack by going through the regular process starting procedure to get back into main() and the message loop.
* The new instance will have an empty **heap** section. Its state transfer procedure will have to use the brk(2) system call in order to request heap memory for itself so that it can transfer the heap state from the old service.
* For the memory-mapped pages that make up the old instance's **mmap** section, things are slightly different: MINIX3 ensures that the new instance automatically inherits a copy-on-write version of all memory-mapped pages. Thus, the new instance will automatically have the old instance's memory-mapped pages mapped into its address space. For some pages, copy-on-write mappings are not possible. This is the case with memory-mapped I/O and for memory used for DMA transfers. Such pages are mapped with full sharing between the two instances.
For a live update of the VM service, the last two points are different. We describe the exceptions for VM in a later section.
With this situation as a given, MINIX3 allows for two forms of state transfer: identity transfer, and state transfer by the magic framework. These forms of state transfer are covered in the next two sections.
=== Identity transfer ===
The simple case is identity transfer. Identity transfer is a minimal state transfer approach that can only transfers state from an old instance to a new instance of exactly the same service, that is, a process with exactly the same address space layout and functionality. Identity transfer is also supported when the target service has not been instrumented, and in fact even when the system has not been compiled using LLVM bitcode altogether.
Since new instance is a newly started copy of the same service, it already has a text section that is identical to the old instance. As described, the stack section need not be transferred, and the mmap section is inherited automatically.
Therefore, identity transfer is concerned with the data and heap sections only. The new instance's identity transfer procedure starts by copying over the old instance's entire data section to itself. This includes the variable that contains the size of the old instance's heap (''_brksize''). The identity transfer procedure then calls brk(2) to allocate a heap for itself which is just as large, and copies over the old instance's entire heap section it itself as well. The identity transfer procedure is implemented in the System Event Framework (SEF) as part of libsys.
If the system is not built with ''MKMAGIC=yes'', which means that ''_MINIX_MAGIC'' is not defined, then the mmap section of the process is not well delineated and may in fact overlap with other memory areas. This is intentional, as it ensures that for such a set-up, the address space layout of services is not unnecessarily restricted and services can use the full address space for, say, a page cache. However, as a result, some memory-mapped areas may not be mapped into the new process, possibly leading to segmentation fault after the live update. Therefore, even identity transfer is not expected to be reliable on a system //not// built with ''MKMAGIC=yes''. Eventually, MINIX3 should be changed to use another approach for transferring memory-mapped regions to the new process altogether, which is either not based on ranges or not the default at all. See also the section on open issues in this document.
=== Magic state transfer ===
The other case is state transfer by the magic framework. This type of state transfer is used by the //self state transfer//, //ASR rerandomization//, and //functional update// update types as covered in the users guide. This form of state transfer relies on the magic pass and library to implement instrumentation and runtime support for state transfer. Again, state transfer is performed by the new instance of the service, using full access to the address space of its old instance.
The magic framework's state transfer procedure transfers data objects one by one. This includes all //static// objects. In this context, an object may for example be one global variable. The actual transfer of an object is not a simple memory copy; it involves analyzing any pointers in the object and adjusting these pointers as appropriate to match the address layout of the new instance.
The state transfer procedure also transfers //dynamic// data objects, which are located in the heap and mmap sections of the old instance. In essence, the procedure recreates the heap and mmap sections during the state transfer, by allocating new heap or mapped memory for each dynamic object, and then transferring its actual contents. This again includes pointer analysis and adjustment. Here, one object is one piece of memory created by a call to malloc(3) or mmap(2), for example.
Since MINIX3 already transfers the mmap section to the new instance automatically, the state transfer framework starts by unmapping all memory-mapped areas that it knows it will recreate. However, since some memory areas (the aforementioned memory-mapped I/O and DMA memory) cannot be recreated by the magic framework, these are not destroyed and recreated. These areas are called //special//, //out-of-band// memory. The service has to tell the magic runtime library about special memory areas. For the two common ways of allocating such memory, alloc_contig(3) and vm_map_phys(2), this is done automatically by libsys.
Out-of-band memory is seen as opaque, physically and virtually unmovable memory, and ignored entirely for the purpose of state transfer. Thus, if a piece of out-of-band memory contains a pointer to a piece of memory that is //not// marked as out-of-band, this pointer will be missed during state transfer. For the aforementioned (memory-mapped I/O and DMA) memory types, this is not a problem.
The default of inheriting the entire mmap section leads to the situation that if the magic framework misses any memory-mapped areas for any reason, these will effectively translate to a memory leak in the new instance. Currently, one such memory leak is addressed explicitly: the page directory that is allocated with mmap(2) internally by the libc malloc code.
The state transfer procedure may fail if its analysis is not successful, in which the system will roll back to the old instance, and the live update fails. It is then up to the programmer to deal with such problems. This may involve annotating source code, for example to instruct the state transfer procedure to ignore certain pointers, or to copy over certain data as is. It may involve adding special state transfer routines to libmagicrt, which deal with fundamentally problematic cases such as unions. In rare cases, it may involve adapting source code to avoid state transfer problems. We discuss all this in more detail later.
In the case of self state transfer, all static objects will have the same location in both the old and the new instance. However, due to their dynamic recreation, the addresses of dynamic objects may change during self state transfer.
In the case of ASR rerandomization, not just the dynamic part, but also the static part of the address space will have objects that are relocated between the old and the new instance. In addition, ASR rerandomization permutes the order in which the old instance's dynamic objects are allocated in the new instance. Finally, the asr pass may insert padding which may expose wrong assumptions about alignment of various buffers. Thus, while live rerandomization is a security feature, in practice it may expose not only additional problems with state transfer, but also bugs in the service itself.
In the case of a functional update, the new instance may be fundamentally different from the old instance. Unlike the previous cases of state transfer, such live updates may involve functions and global variables that are added or removed, thus causing problems in the //pairing// part of the state transfer. The programmer may have to provide explicit state transfer routines in order to deal with these problems.
=== Exceptions for services ===
While MINIX3 allows all of its services to be updated, certain services require special exceptions to allow for live updates, because they are crucial to the live update process itself. These services are RS and VM. This section elaborates on the exceptions made for RS and VM, and explains why VM in particular cannot be updated arbitrarily.
== The RS service ==
TODO
== The VM service ==
MINIX3 has limited support for performing a live update of the VM (Virtual Memory) service. There are two reasons why VM is a special case. First, VM provides essential memory management and page fault handling functionality to the other system services. Thus, the live update must ensure that none of that functionality is required during the course of a live update that includes VM. Second, VM's core data structures include page tables. If these page tables are changed during a live update, it may be impossible to perform a proper rollback.
During normal operation, VM may allocate memory for itself. VM has both a heap and dynamic pages, implementing special local versions of brk(2) and mmap(2) to support this. In particular, page tables are stored in dynamically allocated memory, effectively in VM's mmap section. For VM, the live update procedure must therefore include the transfer of such dynamic state from the old to the new VM instance.
Since page tables cannot simply be copied, they are made visible to the new instance by mapping the old instance's dynamic memory ranges directly into the new instance's address space. That means that any changes made to the dynamic data structures by the new instance (page tables included) becomes visible to the old instance after a rollback. However, the two instances do each have their own static memory (i.e., text and data sections, as well as a preallocated stack). Thus, any changes to dynamic memory made by the new instance, would create a potential mismatch between the static and dynamic memory in the old instance after rollback.
Therefore, in order to allow for rollback, VM must not make any changes to its dynamic memory during the live update. That also means that it may not allocate memory during the live update, not for other processes and not for itself. This situation leads to the following exceptions:
* First and foremost, since the new VM instance essentially inherits the old instance's dynamic memory, the dynamic memory must be ignored by the state transfer framework. For this reason, at startup, VM tells libmagicrt that its entire dynamic memory region consists of special, out-of-band data. As a result, any pointers in this region will not be analyzed or adjusted by the state transfer procedure. This is a good thing, as changes to such pointers would not be undone after a rollback. However, the main consequence is that if the static memory layout of the VM process changes, any pointers in dynamic memory that point to static memory will become invalid. Therefore, updates to VM are limited to the **identity transfer** and **self state transfer** update types.
* Another effect of the automatic dynamic memory inheritance is that dynamic memory allocations need and must not be tracked. Therefore, dynamic memory allocation functions are not instrumented in VM at all, requiring an instrumentation override. This override also requires the need to disable some other instrumentation features, such as the aforementioned libc malloc page directory exception. The features are disabled during VM's linking process, through special statements in its Makefile.
* After a rollback, the old VM instance still has to perform a small number of corrective actions to undo changes made by the new instance. These actions are however kept to a minimum. In the future, more extended non-transparent rollback may be the key to allowing more invasive live updates to the VM service.
* The state transfer procedure requires some temporary memory to do its job. Since it cannot allocate such memory directly, an //initialization buffer// is preallocated in the new VM instance, and the state transfer procedure uses this buffer instead of allocating memory dynamically.
* RS requests VM to preallocate (//pin//) RS's memory before starting a live update, so that RS will not require VM's functionality during the live update.
* For multicomponent live update operations that include VM, all memory-modifying actions are performed before, rather than during, the actual live update operation, using special preparation requests sent by RS to VM. The memory of all new instances is also preallocated in order to avoid memory allocation and pagefaults during the live update. The old VM instance is the last process that is made ready for the update, and the new VM instance is the first process that gets to run right after.
* Despite the pinning, the new VM instance may have to handle brk(2) system calls coming from other new service instances that are all part of the same multicomponent live update. IPC filters are used to ensure that the new VM instance gets requests only from the other services in this group, and not from any other running services. Note that VM does not make any changes to its dynamic memory while handling a brk(2) call. Also, since all memory is preallocated, the VM instance should never get any pagefaults or handle-memory requests for other services' new instances; such requests are blocked by the IPC filters as well. If they do occur, they should result in a timeout of the entire multicomponent live update.
Overall, it should be clear that live update for the VM service is rather brittle. Eventually, a full revision of the live update approach for VM will have to reveal whether some or all of the current restrictions can be lifted.
==== State transfer in practice ====
In this section, we elaborate on some of the practical details of the state transfer of the magic framework, mainly aiming to allow developers to resolve real-world state transfer failures.
We do //not// get into most of the theoretical side of the state transfer, and we skip over many other practical details. Interested readers are advised to read the published work of Cristiano Giuffrida - see the "Further reading" section at the bottom of this document.
=== Some basics and terminology ===
The magic framework keeps track of each //static// object of data using a **sentry** ("state entry") data structure. The framework keeps track of each //dynamic// object of data using a **dsentry** ("dynamic state entry") data structure, which itself has an embedded //sentry// data structure. The magic pass installs libmagicrt wrappers around memory allocation routines so that it can allocate extra memory to store the dsentry metadata right before the actual memory object. Special, out-of-band memory regions are maintained in **obdsentry** ("out-of-band dynamic state entry") data structur. Since no extra memory can be allocated next to the actual memory object in this case, obdsentries themselves are (currently) stored as static data as part of libmagicrt's own state.
The magic framework also uses the concept of a **selement** ("state element"), which is a particular element within a state entry; for example, it can be one particular field in a structure. State is transferred one element at a time. If the state transfer procedure encounters a problem, it will report about the state element that is causing the problem.
Each pointer in the service process is expected to point a data type known to libmagicrt. All the possible data types that can be used by the service are enumerated through static analysis by the magic pass, and stored in a **type** table as part of the instrumented service. It may happen that one data type is cast to another, either in the source code of the service or as a result of the LLVM compilation and linking process. As a result, while the static analysis may conclude that a pointer is for one type, runtime state transfer may find that the pointer was (for example) allocated for another type. Normally, such mismatches would cause state transfer to fail, but casting makes this a legitimate case. Therefore, the magic framework has the notion of **compatible types**: if type A is cast to type B anywhere, type A is marked as a compatible type for type B, and finding type A when transferring data of supposed type B will not result in state transfer failure. The magic pass adds a list of compatible types to the service binary as well, all for use by libmagicrt at state transfer time.
=== Annotation ===
In particular the analysis part of state transfer may not always succeed, for a variety of reasons. In particular, the state transfer framework has problems with unknown pointers, unions, and more generally cases of ambiguity. Such issues can often be resolved by the programmer through annotation in source code, which instructs the state transfer framework what to do with a particular variable. A variable can be annotated by prefixing either its type name (through ''typedef'') or its variable name with the annotation prefix followed by an underscore (e.g., ''noxfer_foo''). The following annotation prefixes are supported by the magic framework.
* **noxfer**: No Transfer. This annotation will prevent transfer of the state altogether, instead zeroing out the memory in the new instance. As an example, the noxfer annotiation can be used in cases where analysis is failing (e.g., in unions) and the new instance will never be using the old instance's data anyway. A practical example where it is used is the ''message'' type. This data type contains a complicated union, and the quiescence model typically ensures that transferring this state is not necessary, as the service being updated is not involved in processing a message at the time of the update.
* **ixfer**: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain external pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer.
* **cixfer**: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''void *'' but may be used to store a small integer as well.
* **pxfer**: Pointer Transfer. This annotation forces a value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. The pxfer annotation may also be used for a union of (differently typed) pointers. Thus, in some cases, a union-of-structures can be split up into a union of non-pointers and one or more unions of pointers, marking the non-pointer union with ''ixfer'' and the pointer union(s) with ''pxfer''. This is indeed how ''pxfer'' is currently used in practice as well.
* **sxfer**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them, etcetera. The sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure's fields as appropriate when annotating the union as sxfer.
The transfer exception is applied to the type (or variable) with the annotation. For example, a noxfer typedef for a pointer to a structure will refrain from transferring that pointer:
typedef struct foo * noxfer_foo_ptr_t; /* annotate the pointer */
struct foo my_foo_struct; /* the structure will be transferred */
noxfer_foo_ptr_t my_foo_pointer = &my_foo_struct; /* the pointer will not be transferred */
However, a pointer to a noxfer typedef'ed structure will be transferred; the contents of the structure will not:
typedef struct foo noxfer_foo_t; /* annotate the structure */
noxfer_foo_t my_foo_struct; /* the structure will not be transferred */
noxfer_foo_t * my_foo_pointer = &my_foo_struct; /* the pointer will be transferred */
It is possible to enable debugging flags in libmagicrt such that it will print more details on how it handles annotated exceptions: in ''minix/lib/libmagicrt/include/st/callback.h'', change ''ST_CB_DEFAULT_FLAGS'' from ''(ST_CB_PRINT_ERR)'' to ''(ST_CB_PRINT_ERR|ST_CB_PRINT_DBG)''. The debugging statements will be sent to the system log, and have a ''[DEBUG]'' prefix.
=== Custom state transfer routines ===
Custom state transfer routines can be used in cases where annotation does not suffice.
TODO
There is currently one example case where a custom state transfer routine is used, namely for the ''dsi_u'' union in the ''struct data_store'' structure which is used by the Data Store (DS) service and defined in ''minix/servers/ds/store.h''. The custom state transfer routine is located in ''minix/lib/libmagicrt/magic_ds.c'', and provides the state transfer process with information as to which of the union's fields should be transferred.
=== Preventing state transfer issues ===
In some cases, small adjustments must be made to a service in order to prevent issues with state transfer. These types of issues will not result in failure of the state transfer procedure; instead, they may result in a crash of the new instance after a seemingly successful live update. We cover three topics: memory grants, userspace threads, and physically unmovable regions.
== Memory grants ==
One potential issue concerns memory grants. Each service has a memory grant table, which is an array of all the memory grants that allow other processes to read and/or write the service's memory. If the service has any grants active at the time of a live update, the grants should in theory be adjusted in accordance with any relocation of the memory pointed to by the grants.
However, the main union of the grant structure (''cp_grant_t'', defined in ''minix/include/minix/safecopies.h'') is currently marked as //ixfer//, meaning it will be transferred as is. This is not a problem for grants that point to memory //outside// the process being updated, and that means that indirect and magic grants pose no problem for state transfer. It is however a problem for grants that point to memory //inside// the process being updated, that is, for **direct grants**.
For this reason, for a service that may potentially have direct grants active at the time of the live update, its writer has two options: 1) implement a custom state transfer routine for the ''cp_grant_t'' structure in libmagicrt, thus resolving the problem described in this entire section altogether, or 2) prevent live updates of the service whenever the service has active memory grants. The first option is preferred. In any case, the potential consequence of doing neither is that the service ends up suffering from arbitrary memory corruption after a live update, since the transferred direct grant will point to the wrong memory location.
The live update system itself actually relies on the presence of a long-running direct grant, which provides access of the process's full address space to the process itself. The new instance uses this grant during a live update to access the memory contents of the old instance. Since the grant provides access to the process's entire address space, it does not suffer from the problem above.
== Userspace threads ==
Userspace threads pose a problem for state transfer as well. We have previously explained that the process stack of the old instance can be disregarded by the state transfer procedure because it is "naturally" recreated in the new instance. The same does not apply to the stacks of userspace threads, since stack variables are not tracked at run time: even though the threads' stacks are transferred to the new instance by the magic framework, they are seen as blobs of (typically) memory-mapped character arrays. The result is that any pointers on these stacks will not be known to libmagicrt and thus not be transferred properly. In addition, thread context (CPU register) state will typically be stored as an array of integers, and similarly end up being skipped by the state transfer procedure. The result is that while state transfer may (appear to) succeed, the service will crash after completion of the live update.
At this time, the recommended solution is for the service to shut down all threads explicitly before starting state transfer, and to recreate the threads both after successful live update and as part of a rollback. The service may refuse to be updated if any of its threads are in use and cannot be shut down. The last point requires that the service supply a custom callback routine to SEF to perform that check for a quiescence state other than the default, through sef_setcb_lu_prepare(3). In order to allow the use of a nondefault state, a sef_setcb_lu_state_isvalid(3) callback routine must be supplied as well. For VFS and libblockdriver, we have chosen the following approach:
* Before SEF startup, the call ''sef_setcb_lu_state_isvalid(sef_cb_lu_state_isvalid_standard)'' is used to mark all standard quiescence states as valid, including ''SEF_LU_STATE_REQUEST_FREE'' and ''SEF_LU_STATE_PROTOCOL_FREE''.
* At the same time, a custom callback function is set using sef_setcb_lu_prepare(3). When SEF calls this function with either the ''SEF_LU_STATE_REQUEST_FREE'' or the ''SEF_LU_STATE_PROTOCOL_FREE'' state, the function will first check whether all worker threads are idle. If they are not, it will return a failure, aborting the live update. If they are, it will shut them down and report success.
* Similarly, a custom callback function is set using sef_setcb_lu_state_changed(3). When SEF calls this function with the //old// state being either ''SEF_LU_STATE_REQUEST_FREE'' or ''SEF_LU_STATE_PROTOCOL_FREE'', the function recreates the worker threads. This ensures that worker threads are recreated in the old instance upon a rollback.
* Finally, the service supplies its own state transfer hook using sef_setcb_init_lu(3). This function will first call the normal state transfer function using SEF_CB_INIT_LU_DEFAULT(), returning an error if state transfer failed. If state transfer succeeded however, and the preparation state given in ''info->prepare_state'' was either ''SEF_LU_STATE_REQUEST_FREE'' or ''SEF_LU_STATE_PROTOCOL_FREE'', then it continues by recreating the worker threads. This ensures that the new instance has worker threads before it leaves the state transfer phase.
The result of this approach is that updates must be invoked with ''-state 2'' (//request free//) or ''-state 3'' (//protocol free//) in order to guarantee proper state transfer. As a sidenote, none of these issues are a problem for identity transfer, which should continue to work even with ''-state 1'' (//work free//, the default).
== Physically unmovable regions ==
Another case where the programmer may have to ensure that state transfer does not result in problems that will surface only after the live update, is when a service uses memory areas that are physically unmovable. Such memory areas are typically in use for DMA purposes. If the state transfer procedure changes the physical location of the buffers, DMA may be performed from or to the original physical location, resulting in garbage and possibly arbitrary memory corruption. Such DMA areas must be marked as special out-of-band memory in libmagicrt, and unmarked when freed, using the sef_llvm_add_special_mem_region(3) and sef_llvm_del_special_mem_region(3) SEF calls. This is done automatically by the alloc_contig(3) and free_contig(3) wrapper routines, but must be done explicitly for memory allocated in different ways.
However, this is only necessary if DMA can happen across a live update. In cases where it is known that no DMA can possibly be ongoing during the live update, the regions are not actually physically unmovable, and therefore need not be marked as such. For example, this is the case for the file system buffer cache implemented in libminixfs. This library allocates and manages buffers without using physically contiguous memory and alloc_contig(3), instead using mmap(2) directly and requesting DMA I/O in page-sized chunks (in order to avoid DMA issues on ARM). Therefore, it would be affected by the above problem, were it not for the fact that all its block I/O calls are synchronous. Any future introduction of more asynchrony will turn this situation into a real problem for live update, though.
As we mentioned before, memory-mapped I/O poses a similar problem. However, the only way to map such I/O memory is currently through the vm_map_phys(2) and vm_unmap_phys(2) calls, of which the libsys wrappers automatically call the special-memory marking/unmarking functions as well.
=== Resolving state transfer errors ===
If the magic state transfer procedure encounters problems, it will report failure, with details written to the system log entries using an ''[ERROR]'' prefix. In this section, we cover a number of common reasons for state transfer to fail in practice, including some example errors and workarounds.
== Dangling pointers ==
In order to know how to transfer a piece of memory, the magic runtime library must know about the data type associated to that piece of memory. If no type information is known for a piece of memory, it cannot be transferred. There are various reasons why libmagicrt might not have type information about a piece of memory. The simplest one is a case of a **dangling pointer**: a pointer that used to be valid at some point, but no longer is, because the memory pointed to has been deallocated. While the actual program may know not to use that particular pointer anymore, the state transfer routine does not have such knowledge. A typical error resulting from a dangling pointer may look like this, with some important parts of the output highlighted:
* **[ERROR]** uncaught ptr with violations. Current state element:
* SELEMENT: ''(parent=sbuf.1900354961, num=1, depth=0, address=0xdfb760a8, **name**=**sbuf**.1900354961, type=TYPE: (id=53 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=01010000, values=%%''%%, type_str=i8/**char%%*%%**)) ''
* SEL_ANALYZED: ''(num=1, type=ptr, flags(DIVW)=1110, **value**=**0x080cb49f**, trg_name=, trg_offset=0, trg_flags(RL)=D0, trg_selements=(#1|0: 1|p=SELEMENT: (parent=???, num=0, depth=0, address=0x00000000, name=???, type=TYPE: (id=0 , name=**UNKNOWN_TYPE**, size=0, num_child_types=0, type_id=4, bit_width=0, flags(ERDIVvUP)=10000000, values=%%''%%, type_str=UNKNOWN_TYPE/UNKNOWN_TYPE)))) ''
* SEL_STATS: ''(type=ptr, trg_flags(RL)=D0, ptr_found=1, **unknown_found=1**, violations=1) ''
In this case, the global variable **sbuf** (suffixed with a tag to make its name unique) is a char* pointer to location 0x080cb49f. Since the magic runtime library knows no type information about this target (//trg//) memory location, it marks the location with the placeholder type UNKNOWN_TYPE and aborts state transfer because an unknown type was found. Another example:
* **[ERROR]** uncaught ptr with violations. Current state element:
* SELEMENT: ''(parent=inode.3951291702, num=80, depth=2, address=0xdfbe3210, **name**=**inode.3951291702/4/i_data**, type=TYPE: (id=61 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=01010000, values=%%''%%, type_str=i8/**char%%*%%**)) ''
* SEL_ANALYZED: ''(num=80, type=ptr, flags(DIVW)=1110, **value**=**0x08108098**, trg_name=, trg_offset=0, trg_flags(RL)=H0, trg_selements=(#1|0: 1|p=SELEMENT: (parent=???, num=0, depth=0, address=0x00000000, name=???, type=TYPE: (id=0 , name=**UNKNOWN_TYPE**, size=0, num_child_types=0, type_id=4, bit_width=0, flags(ERDIVvUP)=10000000, values=%%''%%, type_str=UNKNOWN_TYPE/UNKNOWN_TYPE)))) ''
* SEL_STATS: ''(type=ptr, trg_flags(RL)=H0, ptr_found=1, **unknown_found=1**, violations=1) ''
In this case, the **i_data** field of the fifth element (**/4/**) of the global **inode** structure, also a char* pointer, is pointing to address 0x08108098 which is unknown to libmagicrt. The pointer address typically allows one to determine what kind of memory it is, by means of the memory sections of the process. In this particular example, the address was somewhat higher than the service's data end, thus suggesting the memory pointed to is heap memory. This matched with the source code of the service (PFS, the Pipe File Server), which dynamically allocates and frees the i_data buffers using malloc(3) and free(3).
It is up to the programmer of the service to ensure that the state transfer routine will not attempt to transfer a dangling pointer. This can be as simple as zeroing out the pointer after use, which is usually good practice anyway:
free(rip->i_data);
rip->i_data = NULL;
That is the solution that we applied in both cases.
== External pointers ==
A similar problem occurs when a process has a pointer that is only valid in the address space of another process, or possibly the kernel. Unlike dangling pointers, such external pointers are never valid, and thus do not need to be transferred as pointers. The magic framework must be instructed to that end, for example using //noxfer// annotation. However, external pointers often end up in the local address space as a result of copying in entire structures at once (we already gave process tables as an example), in which case it may be necessary to use //ixfer// rather than //noxfer//. For example, the ProcFS (/proc File System) service has several instances of the following construction:
typedef struct mproc ixfer_mproc_t;
static ixfer_mproc_t mproc;
In some cases, it may make more sense to zero out pointers instead. In other cases, we have changed code to retrieve not entire kernel tables but only specific values, or to use the kernel-mapped pages instead of copies of kernel structures to retrieve values. The magic runtime library already ignores pointers into kernel space (that is, 0xf0000000 and higher) altogether.
Theoretically it is possible that remote pointers end up being valid in the local address space by sheer luck. In known cases of copying in external pointers, it is best to not to rely on failures in the magic framework, but rather annotate the code in a proactive manner.
== Weak symbols ==
If a service uses weak symbols, the code and data pointed to by these weak symbols may not be included in the linked service object at the time that the instrumentation passes are run. These weak symbols will be resolved and included only after the instrumentation stage. This results in the situation that some of the code and data that is part of the service, will not have been analyzed by the magic pass. The result is a range of possible state transfer failures, including cases where pointers end up pointing to unknown static memory and cases where memory allocation is not properly instrumented, ultimately leading to pointers to unknown dynamic memory.
The following example is from the DS service, where its use of the weak aliases for regcomp(3) and regfree(3) resulted in regcomp's malloc(3) calls not being instrumented:
* **[ERROR]** uncaught ptr with violations. Current state element:
* SELEMENT: ''(parent=ds_subs.1944246923, num=9, depth=3, address=0xdfbe6108, name=**ds_subs.1944246923/0/regex/re_g**, type=TYPE: (id=18 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=00000000, values=%%''%%, type_str=opaque*)) ''
* SEL_ANALYZED: ''(num=9, type=ptr, flags(DIVW)=1110, **value**=**0x08111000**, trg_name=, trg_offset=0, trg_flags(RL)=, trg_selements=(#1|0: 1|p=SELEMENT: (parent=???, num=0, depth=0, address=0x00000000, name=???, type=TYPE: (id=0 , name=**UNKNOWN_TYPE**, size=0, num_child_types=0, type_id=4, bit_width=0, flags(ERDIVvUP)=10000000, values=%%''%%, type_str=UNKNOWN_TYPE/UNKNOWN_TYPE)))) ''
* SEL_STATS: ''(type=ptr, ptr_found=1, **unknown_found=1**, violations=1) ''
In this case, the pointer **ds_subs[0].regex.re_g** ended up pointing to the unknown heap-section value of 0x08111000. We worked around this issue by forcing DS to use the targets of the weak aliases, _regcomp and _regfree, rather than their original names, using Makefile hacks.
== Code used only in libmagicrt ==
If the magic runtime library itself uses other library modules, for example from libc, and these modules are not already used by the service itself anyway, then the bitcode linker may not include them in the linked object on which the instrumentation passes are run. Again, this may result in various failures, and unknown pointers in particular:
* **[ERROR]** uncaught ptr with violations. Current state element:
* SELEMENT: ''(parent=_ctype_tab_, num=1, depth=0, address=0xdfb760a8, **name**=**_ctype_tab_**, type=TYPE: (id=204 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=11000000, values=%%''%%, type_str=i16/unsigned short*)) ''
* SEL_ANALYZED: ''(num=1, type=ptr, flags(DIVW)=1110, **value**=**0x0809ccb6**, trg_name=, trg_offset=0, trg_flags(RL)=, trg_selements=(#1|0: 1|p=SELEMENT: (parent=???, num=0, depth=0, address=0x00000000, name=???, type=TYPE: (id=0 , name=**UNKNOWN_TYPE**, size=0, num_child_types=0, type_id=4, bit_width=0, flags(ERDIVvUP)=10000000, values=%%''%%, type_str=UNKNOWN_TYPE/UNKNOWN_TYPE)))) ''
* SEL_STATS: ''(type=ptr, ptr_found=1, **unknown_found=1**, violations=1) ''
In this particular failure case, the global **_ctype_tab_** variable pointed into another global variable, at location 0x0809ccb6 the data section. The other global variable was invisible to the magic pass, so no **sentry** object could be created for it. As a result, libmagicrt did not know about the target of the pointer. The _ctype_tab_ variable itself was used by the '''' isalpha(3) (etc) set of macros from within libmagicrt. We worked around this issue by putting our own replacement set of macros in libmagicrt instead.
== Assembly code ==
Yet another case that leads to invisibility of certain aspects is the direct inclusion of assembly code. Assembly code is machine code, not bitcode, and thus, the bitcode instrumentation passes will have problems processing them. Needless to say, the use of assembly code should be minimal throughout the source code. In cases where it cannot be avoided, custom solutions have to be found for any resulting state transfer problems. Fortunately, much of the assembly in use by services these days is the result of optimized str*(3) and mem*(3) functions, which require no special treatment for the purpose of state transfer.
== Incompatible types ==
Finally, we describe one class of state transfer failures which are the result of shortcomings in the magic instrumentation framework itself. LLVM bitcode has the notion of an **opaque** data type. The opaque data type is used for data of which the type has been declared but not defined, typically as a result of forward declarations of structures (''struct foo;''). Instead of resolving these types after they have been instantiated, LLVM tends to cast between various data types which are identical except for the presence of opaque pointers. As a result, opaque pointers may show up in various places in linked bitcode.
The magic pass should mark all these practically identical data types as //compatible types//. However, due to the fact that the casts can take rather complex forms, this is not always happening. The result is that in some cases, state transfer may fail because libmagicrt erroneously detects an incompatibility between a pointer type and the type of data being pointed to. As an example, the following state transfer error was reported during state transfer of the PM service:
* **[ERROR]** uncaught ptr with violations. Current state element:
* SELEMENT: ''(parent=timers.515278380, num=1, depth=0, address=0xdfb760a8, **name**=**timers**.515278380, type=TYPE: (id=96 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=01000000, values=%%''%%, **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func opaque%%*%%**, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*)) ''
* SEL_ANALYZED: ''(num=1, type=ptr, flags(DIVW)=1110, value=0x08147460, trg_name=mproc, trg_offset=274616, trg_flags(RL)=D0, trg_selements=(**#2**|0: **1**|o=SELEMENT: (parent=mproc, num=0, depth=0, address=0x08147460, name=**mproc/143/mp_timer**, type=TYPE: (id=38 , name=minix_timer, size=16, num_child_types=4, type_id=9, bit_width=0, flags(ERDIVvUP)=00000000, values=%%''%%, names='minix_timer_t|minix_timer', **type_str**={ $minix_timer tmr_next { $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/*, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/%%*%%**, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } })), **2**|o=SELEMENT: (parent=mproc, num=0, depth=0, address=0x08147460, name=mproc/143/mp_timer/tmr_next, type=TYPE: (id=37 , name=, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=00000000, values=%%''%%, **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/%%*%%**, tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*)))) ''
* SEL_STATS: ''(type=ptr, trg_flags(RL)=D0, ptr_found=1, **other_types_found=1**, violations=1) ''
In this case, the analysis failed on the global **timers** variable. The analysis dump shows that two matching types (**#2**) were found, both associated with the **mproc[143].mp_timer** structure field, but neither type matched the type of the pointer. A closer look at the textual representations of the pointer type (the **type_str** of the primary //selement//) and of the data types (the //type_str// of the target //selement//s) reveals that there is only one difference between the two: the **tmr_func** field of the structure type to which the //timers// variable should point is an **opaque** pointer, whereas the same //tmr_func// field of the target structures is a particular function pointer (to a function referred to as **hash_3792421438**). The remainder of the types are the same.
The type strings are somewhat difficult to read. The asterisk at the end of a { structure } block indicates a pointer to this structure. In this case, the //timers// variable is a pointer to a **minix_timer_t** structure. The **\n** notation indicates type recursion of the type **n** levels up. In the type string of //timers//, the **\2** after the **tmr_next** field indicates that it is again a pointer to //minix_timer_t//: one type level up is the structure itself, two type levels up is the pointer to the //minix_timer_t// structure. In this case there are no three levels up, but in other cases three levels up could for example be a pointer to a pointer to a structure, etcetera. Although irrelevant in this case, the name of each structure is prefixed with a dollar sign, and **(U)** denotes a union.
In this case, the analysis failed because it foudn different, incompatible, and therefore **other** types, even though the opaque pointer and the function pointer were really the same field types. A look at the corresponding declarations in ''minix/include/minix/timers.h'' shows that there is indeed a forward declaration of ''struct minix_timer'' which is the cause of LLVM's link-time introduction of casts. We resolved this case by extending the casting analysis of the magic pass to include casts of structures through function prototypes.
The following example also resulted from the same forward declaration of MINIX3 timer structures, this time in the sched (scheduling) service:
* **[ERROR]** uncaught ptr with violations. Current state element:
* SELEMENT: ''(parent=sched_timer.29458437, num=4, depth=1, address=0xdfbe70b0, **name**=**sched_timer.29458437/tmr_func**, type=TYPE: (id=17 , name=tmr_func_t, size=4, num_child_types=1, type_id=10, bit_width=0, flags(ERDIVvUP)=00000000, values=%%''%%, type_str=**opaque%%*%%**)) ''
* SEL_ANALYZED: ''(num=4, type=ptr, flags(DIVW)=1110, value=0x08048dc0, trg_name=**balance_queues**.29458437, trg_offset=0, trg_flags(RL)=T0, trg_selements=(#1|0: 1|o=SELEMENT: (parent=???, num=0, depth=0, address=0x08048dc0, name=???, type=TYPE: (id=119 , name=, size=1, num_child_types=0, type_id=4, bit_width=0, flags(ERDIVvUP)=11000001, values=%%''%%, type_str=**hash_3792445575/**)))) ''
* SEL_STATS: ''(type=ptr, trg_flags(RL)=T0, ptr_found=1, **other_types_found=1**, violations=1) ''
In this case, the type mismatch was not between two structures that differed in opaque fields, but between two function pointers themselves: the function pointer in **sched_timer.tmr_func**, and the function it is pointing to, **balance_queues**. Registering these types as compatible would result in much more complexity in the magic pass, and likely still not resolve the more general problem of opaque pointers. This is currently one of the open issues, and we believe that another approach would be more viable; see below. In this particular case, it turned out that the sched service did not need to use timers at all, and we simplified it by getting rid of its use of timers altogether. Obviously, adapting the actual functionality of a service to allow for state transfer is not always an option, nor is it generally the right approach: the core code of system services should not have to be (re)written specifically to allow for state transfer.
===== Open issues =====
In this section, we describe what we believe are currently the main open issues related to live update. For most issues, no active development is ongoing. We therefore invite any interested parties to work on resolving these issues, and welcome both inquiries and updates regarding the current status on both the [[https://groups.google.com/forum/#!forum/minix3|MINIX3 newsgroup]] and [[info@minix3.org|info@minix3.org]].
This section is roughly sorted by order of importance, starting with the most important issues.
==== The build system ====
As shown in the setup part of the users guide, the live update functionality requires that a a separate instance of the LLVM toolchain be built. Unlike the standard toolchain, this separate instance is suitable for Link-Time Optimization (LTO). It is built by ''minix/llvm/generate_gold_plugin.sh'', and placed in ''obj_llvm.i386''. The exact same LLVM 3.6.1 source code is used to compile both the LTO-enabled toolchain and the additional regular crosscompilation toolchain in ''obj.i386'', using the exact configuration flags. The separate compilation is necessary only because of a problem with makefiles.
NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain being built in a different way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation.
The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the necessary parts of the toolchain without the need to build LLVM twice.
As part of this, the generated instrumentation passes should not be placed in the ''minix/llvm/bin'' subdirectory of the source MINIX3 tree. Instead, they should end up in an appropriate subdirectory of ''obj.i386'', thereby keeping the source directory clean.
==== Instrumentation ====
A number of shortcomings in our instrumentation passes currently lead to potential problems at run time.
=== Type unification ===
As shown in the developers guide, the magic instrumentation pass is not always capable of establishing that two different data types are in fact compatible, resulting in state transfer errors at run time. The main cause of these issues lies in LLVM's use of the **opaque** placeholder data type. We described the practical results of this in the earlier "Incompatible types" section.
This problem is a product of circumstances. Between LLVM 2.x and LLVM 3.x, a significant change was made in LLVM regarding the way that types are handled. In a nutshell, rather than unifying various instances of the same data type at compile time, LLVM 3.x keeps these instances as separate types, instead using bit casting between the types to resolve the resulting incompatibilities at link time. More details about this change can be found in the LLVM blog post
[[http://blog.llvm.org/2011/11/llvm-30-type-system-rewrite.html|LLVM 3.0 Type System Rewrite]] by Chris Lattner.
However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has created the situation that not all cases of identical types are properly registered as //compatible types//. As of writing, this has not yet been a real problem, but it is likely to become a problem in the future.
We believe that the right solution would be the introduction of a new **type unification pass**. This pass would unify all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. The pass could then be run before the magic pass. This would not only resolve the complete problem, but also free the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagicrt to search through compatible types.
=== ASR skipping libmagicrt ===
The ASR pass currently exempts all of the magic runtime library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagicrt is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service.
The exact reasons as to why this exception was made are currently unknown. However, if possible, this overall limitation should be resolved by either removing the exception or at least narrowing it to the exact scope of the problem.
==== Memory management ====
The MINIX3 memory management, implemented in the VM service, currently has a number of significant limitations and missing features. Some of its problems are relevant for live update only. Other problems are merely becoming more visible as a result of enabling or using live update functionality.
=== Region transfer issues ===
As we already mentioned earlier, the transfer of memory-mapped pages requires that these pages be in a strictly delineated address range. This range may not overlap with any of the process's other sections' address ranges. The range is hardcoded globally, and thus, defined much more strictly than necessary for most service processes. Moreover, the definition indiscriminately affects all processes, including application processes. The result is that if the system is built with live update support, all processes are severely restricted in how much of their address space they can use for memory-mapped regions. Conversely, if the system is not built with live update support, even identity transfer may fail.
Another problem mentioned before, is the bulk transfer of all pages in the process's mmap section, regardless of whether the state transfer framework knows about them. This could easily lead to memory leaks due to transfer of untracked pages.
We believe that both points could be resolved with a system that does not automatically transfer memory-mapped pages from the old to the new instance, but rather performs such transfer on demand, so that the (identity or magic) state transfer routine can determine exactly what memory to transfer.
=== Out-of-memory issues ===
MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocated pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no free memory, the service will be killed, possibly taking down the entire system with it. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent.
The live update and rerandomization support is making this situation even more problematic. The magic runtime library uses extra dynamic memory, and is not particularly careful about using preallocated memory where necessary. The ASR functionality increases memory usage even further. For example, its stack padding feature requires a considerable amount of extra stack space. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.
Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulted-in stack pages illustrates that this is not a viable option in general. There has been a partial attempt to prepare file system service's buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a certain amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services.
In the meantime, it can be expected that **test64** of the MINIX3 test set - the test case that tests one particular scenario of running out of memory - will causes test or system failure in an increasing number of cases. It may have to be removed from the default set of tests in the short term.
=== Contiguous/DMA memory ===
In addition, MINIX3 does not deal with running out of physically contiguous memory at all. Some services require blocks of physically contiguous memory for DMA purposes. VM currently has no way to recombine fragmented blocks of free memory into larger physically contiguous ranges. In addition, some services require memory that is located in the lower 1MB or 16MB of the physical system memory. The support in VM for obtaining memory in those ranges is very limited as well. Both of these cases may result in the inability for a system service to obtain its needed resources if it is not started immediately at system bootup time.
These problems are not particularly important for live update, since the new instance will inherit special memory from its old memory by default. They are important for crash recovery however, and they are known to cause failures in the ''testrelpol'' test set on occasion.
=== Page protection ===
Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was originally created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection.
Since libmagicrt performs extra memory allocations, this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''PROT_NONE'' protection instead. Theoretically, this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken.
Ideally, this issue would be resolved by implementing proper support for page protection in VM, including for example an implementation of mprotect(2).
==== Runtime infrastructure ====
We now list a number of other issues concerning the MINIX3 runtime infrastructure side of live update.
=== Default states ===
The case of userspace threads has shown that it may be not just useful, but actually //necessary// for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. These services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that not just users, but also scripts - the update_asr(8) script in particular - be aware of specific services requiring custom quiescence state. This is inconvenient and dangerous.
The default quiescence state is currently hardcoded in the minix-service(8) utility, in the form of ''DEFAULT_LU_STATE'' in ''minix/commands/minix-service/minix-service.c''. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not yet clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service's own SEF routines. Otherwise, the SEF may have to send the default state as extra data to RS at service initialization time.
=== Policy redundancy ===
While the following issue is relevant more for crash recovery than for live update, it is included here because it affects the infrastructure supporting the ''testrelpol'' script.
Each service effectively knows what its own crash recovery policy should be. Separately, procfs has a policy table with an entry for each service in ''minix/fs/procfs/service.c'', containing the same crash recovery policy information, for export to userland and ''testrelpol'' in particular. This is effectively redundant information.
Ideally, each service would communicate its policy to RS. That information can then be used by procfs to expose the policy information to userland, thus eliminating the redundancy.
=== Live update of VM ===
Earlier in this document, we have described the limitations of performing live updates on the VM service, as well as the reasons behind these limitations. Despite a large number of exceptions that allow VM to be updated at all, the resulting situation is that VM can still not be subjected to any meaningful type of update.
It is unclear whether all these limitations are fundamental, however. We believe it may be possible to restructure the VM live update facilities to resolve at least some of the limitations. For example, it might be possible to store the pagetables in a separate memory section, and make actual copies of all or most other dynamic memory in VM. The out-of-band region could then be limited to the pagetable memory, thus allowing for relocation of at least static memory. Furthermore, more explicit rollback support in the old VM instance might even allow changes to VM's own pagetable, thereby possibly allowing dynamic memory allocation during the live update. It remains to be seen whether any of this is possible in practice.
=== Timed retries of safecopies ===
If process A is being updated, process B should temporarily not make use of process A's grants, because during the live update, those grants may be inaccessible, invalid, etcetera. The kernel currently has a simple way to enforce that rule, by responding to process B's safecopy kernel call with an ''ENOTREADY'' error response whenever process A is being updated. The service-side libsys implementation of sys_*safecopy*(2) automatically suspends the calling service for a short while (using tickdelay(3)) and then retries the safecopy. This shortcut approach works, but it is not ideal: it should not be the responsibility of system services to determine when the safecopy can be retried again, and the approach could lead to starvation.
Instead, the kernel should block the caller of a safecopy call for the duration of its target's live update procedure, retrying the safecopy operation and unblocking the caller only once the target is no longer being updated. A proper implementation of this functionality requires several cases to be covered: indirect grants, either the granter or the grantee being terminated or having its process slots swapped, etcetera. As a possible simplification, the kernel could internally retry the safecopy operation more often than necessary, since the caller would simply remain blocked if the retried safecopy operation hits a case of live update again.
=== Copying asynsend tables ===
In a very specific scenario, the kernel performs a memory copy of the entire asynsend table between two processes of which the slots are being swapped. Although it is not yet clear which exact circumstances cause the need for this memory copy, the actual copy action relies on very specific conditions which are not fully validated before the copy action. Thus, this is a rather dangerous kernel feature.
A rather long comment in ''minix/lib/libsys/sef_liveupdate.c'' elaborates on the specifics of this case, and suggests why RS may be the only affected service. If the comment is correct, it may be possible to engineer another solution for RS in particular, and remove the copy hack from the kernel.
==== Other issues ====
A number of miscellaneous issues remain. The first issue, regarding performance, is a relatively important issue. The other issues listed in this section are relatively minor.
=== Performance ===
The performance of various parts of the live update infrastructure is not fantastic. This is true for both the instrumentation passes and, more importantly, the run-time functionality. As one of the effects, live update operations may have to be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option.
We have not yet looked into the causes of the poor performance. Part of it may be due to the extra memory allocations performed by libmagicrt, but that is only a guess. This issue is therefore rather open ended. Statistical profiling may provide at least some hints.
=== Grant table transfer ===
Currently, the safecopy memory grant tables of system services are transferred as is: the main union of the ''cp_grant_t'' structure as defined in ''include/minix/safecopies.h'' is marked as **ixfer**.
In some scenarios, however, it is possible that during a service's live update, the service has grants allocated for remote services. For direct grants (of type ''CPF_DIRECT''), ''cp_direct.cp_start'' is actually a pointer into the local address space. The identity transfer therefore prevents this local pointer from being updated. Especially with ASR, there is a risk that after the live update, the grant points to arbitrary memory within the updated service. In the worst case, a remote user of the grant may end up overwriting this arbitrary memory in the updated service.
To resolve this, the grant structure should not be using **ixfer** for its main union. This probably means that a custom state transfer routine for the grant structure must be written, so as to use a pointer transfer only for ''CPF_DIRECT'' grants.
The same does //not// apply to magic grants (of type ''CPF_MAGIC''), as ''cp_magic.cp_start'' is an address in a remote process, which is either a userland process or a system process blocked on a call to VFS (as of writing, only VFS can use magic grants at all), and thus never subject to live update while the magic grant is active.
=== Testrelpol failure ===
If the ''testrelpol'' script is run a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //testrelpol//. If that is the case, it should be possible to change //testrelpol// to disable the exponential backoff using existing minix-service(8) flags.
=== Libmagicrt asserts ===
The implementation of the magic runtime library currently relies on asserts being enabled. We have changed its Makefile so that asserts should be enabled regardless of build system settings, but this is merely a workaround. Instead, libmagicrt should function properly (and, in particular, fail properly) regardless of whether asserts are enabled.
=== VM fork warning ===
During live update and crash recovery, the following VM error may be seen:
VM: cannot fork with physically contig memory
The error indicates that it is currently not possible to mark physically contiguous memory as copy-on-write, which is true. However, the error may occur during a live update, when VM copies over the memory-mapped pages of a service's old instance to the new instance. The error is therefore not the result of a fork(2) call. In addition, the error code thrown by the function producing the error message, is ignored by its caller, with as result that the reference count of the contiguous memory range is increased anyway, which is exactly what needs to happen for live update operations. Thus, during live updates, this error is both misleading and meaningless. However, we have to review whether it is still useful to keep around the error for other scenarios.
=== State transfer prefixes ===
State transfer makes exceptions based on name prefixes. Some of these name prefixes are overly broad. For example, it is possible that the current exception of the prefix ''st_'' also ends up matching certain variables in actual service code by accident. At the very least, all exception prefixes should start with ''magic_''.
===== Further reading =====
The following publication covers the MINIX3 live update architecture, design, and implementation, and provides more details on various theoretical and practical aspects.
* Cristiano Giuffrida, [[http://www.minix3.org/theses/Cristiano_Giuffrida_PhD_thesis.pdf|Safe and Automatic Live Update]], Ph.D. thesis, 2014