User Tools

Site Tools


developersguide:liveupdate

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
developersguide:liveupdate [2015/09/16 16:46]
dcvmoole [Developers guide] extension and restructuring
developersguide:liveupdate [2022/02/12 22:42]
stux renamed service(8) to minix-service(8) in various places
Line 1: Line 1:
-<div center round important>​ 
-**This is a draft.** This page is currently not yet visible to the general public. The plan is that once all live update patches have been merged, this page will be moved to its final location in the wiki's developers guide. Until then, various internal wiki links will appear to be broken. -David 
-</​div>​ 
- 
 ====== Live update and rerandomization ====== ====== Live update and rerandomization ======
  
Line 21: Line 17:
 The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM "​optimization"​ passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization,​ the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the **magic pass**. ​ The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM "​optimization"​ passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization,​ the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the **magic pass**. ​
  
-In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the **magic library** or //libmagic//. Together, the magic pass and library make up the **magic framework**.+In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the **magic ​runtime ​library** or //libmagicrt//. Together, the magic pass and runtime ​library make up the **magic framework**.
  
 ==== Live rerandomization ==== ==== Live rerandomization ====
Line 29: Line 25:
 The fundamental approach consists of a two-step process. First, new versions of the service program are generated, using link-time randomization of various parts of its program binary. Ideally, this would be done at run time; due to various limitations,​ MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another. The fundamental approach consists of a two-step process. First, new versions of the service program are generated, using link-time randomization of various parts of its program binary. Ideally, this would be done at run time; due to various limitations,​ MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another.
  
-The randomization of binaries is done with another link-time pass, called the **asr pass**. The magic library implements various runtime aspects of ASR rerandomization during live update.+The randomization of binaries is done with another link-time pass, called the **asr pass**. The magic runtime ​library implements various runtime aspects of ASR rerandomization during live update.
  
 ===== Users guide ===== ===== Users guide =====
Line 37: Line 33:
 ==== Setting up the system ==== ==== Setting up the system ====
  
-We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only. The current procedure is not quite ideal, but it is what we have right now, and it should work.+We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only.
  
-After setting up an initial environment,​ the MINIX3 update cycle basically consists of four steps: obtaining or updating ​the MINIX3 source code, building ​the systeminstrumenting ​the system services, and generating a bootable image. We will go through all steps in detail. At the end of this section, there is also a summary of the commands to issue.+The current procedure has been tested only from **Linux** as host platform, and may require minor adjustments on other host platforms. We provide a few additional instructions for those other platforms, but these may currently not be complete. Please feel free to add more instructions to this page, and/or open GitHub issues for other platforms and link to them from here. 
 + 
 +After setting up an initial environment,​ the first step is to obtain ​the MINIX3 source code. After that, the next step is to build an LLVM toolchain with LTO supportwhich is needed because ​the regular MINIX3 crosscompilation LLVM toolchain does not include LTO support (yet - we are working on this). Once the LTO-supporting toolchain has been builtthe final step is to build the MINIX3 sources, with extra flags to enable magic instrumentation ​and possibly ASR rerandomziation. 
 + 
 +Once these steps have been completed successfully for the first time, one can later update the MINIX3 source and then rebuild the system. The LTO-supporting toolchain need not be rebuilt unless we upgrade the LLVM source code itself. 
 + 
 +We will now go through all steps in detail. At the end of this section, there is also a summary of the commands to issue.
  
 All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges. All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges.
Line 55: Line 57:
   /​home/​user/​minix-liveupdate/​obj.i386   /​home/​user/​minix-liveupdate/​obj.i386
  
-You have to choose a location for the containing directory, and create it yourself. The three subdirectories ​will be created automatically as part of the following steps. In terms of disk space usage, expect to be needing a bare minimum of **30GB** for the combination of these three subdirectories,​ with a recommended **40GB** of available space.+You have to choose a location for the containing directory, and create it yourself. The three subdirectories ​should ​be created automatically as part of the following steps. However, it has been reported that on some platforms (e.g., FreeBSD), some or all of these directories have to be created manually; this can be done with nothing more than a few basic ''​mkdir''​ commands. In terms of disk space usage, expect to be needing a bare minimum of **30GB** for the combination of these three subdirectories,​ with a recommended **40GB** of available space.
  
 === Obtaining or updating the MINIX3 source code === === Obtaining or updating the MINIX3 source code ===
Line 73: Line 75:
 In both cases, the next step is now to build the source code. In both cases, the next step is now to build the source code.
  
-=== Building the system ​===+=== Building the LTO toolchain ​===
  
-The next step consists of building ​the systemWhen run for the first time, this step will also build the LLVM LTO infrastructure, the crosscompilation toolsand the instrumentation. The first run may take several hours.+The second ​step is to build the LLVM LTO infrastructure,​ if it has not yet been built beforeEventually, this will be done automatically as part of the regular build. For nowwe have a script that can perform ​the buildcalled ''​generate_gold_plugin.sh''​. It is located in the ''​minix/​llvm''​ subdirectory of the MINIX3 source tree. The basic procedure therefore consists of the following steps (but read this entire section ​first):
  
-The center of all the instrumentation activities is the ''​minix/llvm''​ subdirectory of the MINIX3 source treeThis directory contains the instrumentation passes, runtime library, and supporting scriptsThis step and the next steps therefore assume this subdirectory as the current directory:+  $ cd /​home/​user/​minix-liveupdate/​minix-src/​minix/llvm 
 +  $ ./​generate_gold_plugin.sh
  
-  $ cd minix-src/minix/llvm+On some platforms, it may be needed to specify the C/C++ compiler and/or the name of the GNU make utility, which can be done as follows:
  
-It may be necessary to ensure that clang is used as the compiler, by exporting the following shell variablesGCC should work as well, but has not been tested as thoroughly.+  $ CC=clang CXX=clang++ MAKE=make ​./​generate_gold_plugin.sh
  
-  $ export CC=clang CXX=clang+++On FreeBSD and similar platforms, one may have to ensure that GNU make is installed (typically as ''​gmake''​) first, and pass in ''​MAKE=gmake''​ to point to it.
  
-Then, the system ​can built with support for instrumentation ​by running the ''​configure.llvm''​ script in the current directorywith the ''​MKMAGIC'' ​build variable ​set to //yes//. To build the infrastructure and system without parallel compilation,​ simply run the script this way:+This step may take several hours. It can be sped up by supplying a number of parallel jobsthrough a ''​JOBS=n''​ variable:
  
-  $ BUILDVARS="-V MKMAGIC=yes" ​./configure.llvm+  $ JOBS=./generate_gold_plugin.sh
  
-Alternativelya number of parallel jobs may be suppliedIt is typically advisable to use as many jobs as there are hardware threads of execution (i.e., CPU cores or hyperthreads) ​in the system:+As stated beforeafter this command has finished successfully,​ it need not be reissued until LLVM is upgraded in the MINIX3 source treeThis is a rare event which is typically ​part of a larger resynchronization with NetBSD code, and we will clearly announce such events. When this happens, it may be advisable to remove the entire ''​obj_llvm.i386''​ directory ​as well as any files in ''​minix-src/​minix/​llvm/​bin'',​ before rerunning ​the generate_gold_plugin.sh script.
  
-  $ JOBS=8 BUILDVARS="-V MKMAGIC=yes" ./​configure.llvm+=== Building the system ===
  
-After the first run, the ''​configure.llvm''​ will perform recompilation ​of only the parts of the source code that have changed, and should not take nearly as long to run as the first time. In case of unexpected problems when rebuildingit may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entirety. This can be done by going to the top-level ''​obj.i386''​ directory and deleting all files and directories in there, except the ''​tooldir.{yourplatform}''​ subdirectory. Fully rebuilding the MINIX3 ​source code will take longer than an incremental rebuild, but since the crosscompilation ​toolchain is left as isit will still be nowhere close as long as the first run.+The third step consists ​of building ​the system and generating a bootable image out of it. When run for the first time, this step will also build the regular (non-LTO) crosscompilation toolchainThe first run may therefore (also) take several hoursThe build procedure is just like regular ​MINIX3 crosscompilation, ​differing in only two aspects.
  
-As explained ​in more detail on the [[.:​crosscompiling|crosscompilation page]], it is also possible ​to rebuild particular parts of the system ​without going through the entire "​make ​build" process. This involves the use of the ''​nbmake-i386''​ tool and generally requires a good understanding of the compilation process. It may be worth mentioning that the first ''​configure.llvm''​ run saves the ''​MKMAGIC'' ​value, so this variable ​need not be passed ​to ''​nbmake-i386''​ each timeWe give one example ​of how to use nbmake-i386 in a later section.+First, the appropriate build variables must be passed ​in to enable ​the desired functionalityIn order to build the system ​with live update support ​through ​magic instrumentation, ​the build system must be invoked with the ''​MKMAGIC'' ​build variable ​set to //yes//This will perform a bitcode build of the entire system, and perform magic instrumentation on all system services.
  
-=== Rebuilding ​the instrumentation ===+In order to build the system with ASR instrumentation, the build system must be invoked with the ''​MKASR''​ build variable set to //yes//. This will automatically enable magic instrumentation,​ perform ASR randomization on all system services, and pregenerate a number of ASR-rerandomized service binaries for each service. This number can be controlled with an additional ''​ASRCOUNT=n''​ build variable, where the //n// value must be between 1 and 65536 (inclusive). The default //​ASRCOUNT//​ is 3.
  
-When building the system for the first timethis step may be skipped, as it is performed automatically. However, when the source code is changed ​for any of the LLVM passes or the magic library - that is, the source code in ''​minix/llvm'' ​- the changed component ​must be recompiled. ​**Warning**: updating ​the MINIX3 source code with ''​git pull''​ may also upgrade any of these componentsin which case it is the responsibility ​of the user (you) to recompile and reinstall them!+Secondin order to build a hard disk image suitable ​for use by the resulting bitcode builds, the ''​x86_hdimage.sh'' ​script ​must be invoked with the **-b** flag. This will enlarge ​the generated image to account for the larger binariesand enable inclusion ​of ASR-rerandomized binaries if necessary.
  
-Once we properly integrate the LLVM LTO infrastructure into the MINIX3 ​build system, this step should disappear altogether.+These two aspects can be covered in a single ​build commandThe following short procedure will build a hard disk image with magic instrumentation:​
  
-== Rebuilding libmagic ==+  $ cd /​home/​user/​minix-liveupdate/​minix-src 
 +  $ BUILDVARS="-V MKMAGIC=yes" ./​releasetools/​x86_hdimage.sh -b
  
-This substep must be performed whenever ​the source code of the magic library changesThis is necessary due to the fact that libmagic'​s dependency tracking is not working correctlywhich means the automated step in ''​configure.llvm''​ may not recompile ​the library properly.+In order to speed up the build, a number ​of parallel jobs may be suppliedIt is typically advisable ​to use as many jobs as there are hardware threads of execution (i.e.CPU cores or hyperthreads) ​in the system:
  
-The source code of libmagic is located in the ''​minix/​llvm/static/magic''​ subdirectory of the MINIX3 source codeTo (re)compile and install libmagic, go to its source directory, issue a ''​make clean''​ and a ''​make install'':​+  $ JOBS=8 BUILDVARS="​-V MKMAGIC=yes"​ ./releasetools/x86_hdimage.sh -b
  
-  $ cd static/​magic +It may be necessary to ensure that clang is used as the compiler:
-  $ make clean install+
  
-The library is installed to ''​minix/llvm/bin''​. In a later step, the ''​relink.llvm''​ script will pick it up from there.+  $ CC=clang CXX=clang++ JOBS=8 BUILDVARS="​-V MKMAGIC=yes"​ ./releasetools/x86_hdimage.sh -b
  
-== Rebuilding a pass ==+Also, some platforms may not be able to compile the compiler toolchain for the target platform due to running out of memory. In that case, it is possible to build an image that does not come with its own compiler toolchain, by passing in the ''​MKLLVMCMDS=no''​ build variable. This build variable can also be used simply to speed up the compilation procedure.
  
-This substep is also performed automatically for the first time, by the ''​generate_gold_plugin.sh''​ script invoked from ''​configure.llvm''​. However, whenever the source code of any of the LLVM instrumentation passes changes, that pass must be recompiled and installed.+  $ BUILDVARS="​-V MKMAGIC=yes -V MKLLVMCMDS=no"​ ./​releasetools/​x86_hdimage.sh -b
  
-The source code of the passes is located in the ''​minix/​llvm/​passes''​ subdirectory of the MINIX3 source code. A pass can be compiled and installed by going to its ''​minix/​llvm/​passes/​{pass}''​ subdirectory,​ and issuing ''​make install''​.+In order to build an image with ASR randomization,​ including four additional ASR-rerandomized versions ​of each system service, use the following build variables:
  
-For example, to recompile and install the magic pass:+  $ BUILDVARS="​-V MKASR=yes -V ASRCOUNT=4"​ ./​releasetools/​x86_hdimage.sh -b
  
-  $ cd passes/​magic +Obviously, all variables shown above can be combined as appropriate. The author of this document has used the following command line on several occasions:
-  $ make install+
  
-The passes are installed to ''​minix/llvm/bin''​. In a later step, the ''​build.llvm''​ script will pick them up from there.+  $ CC=clang CXX=clang++ JOBS=4 BUILDVARS="​-V MKASR=yes -V ASRCOUNT=2 -V MKLLVMCMDS=no"​ ./releasetools/x86_hdimage.sh -b
  
-=== Instrumentation ​and image building ===+After the first run, the build system will perform recompilation of only the parts of the source code that have changed, ​and should not take nearly as long to run as the first time. In case of unexpected problems when rebuilding, it may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entirety. This can be done by going to the top-level ''​obj.i386''​ directory and deleting all files and directories in there, except the ''​tooldir.{yourplatform}''​ subdirectory. Fully rebuilding the MINIX3 source code will take longer than an incremental rebuild, but since the crosscompilation toolchain is left as is, it will still be nowhere close as long as the first run.
  
-After building the system, two more steps need to be performed: instrumentation of system services, and generation of a bootable hard disk image. These steps must be performed every time the system is built, including the first timeIn particularevery time a system service is (re)compiled, it must be (re)instrumented afterwards. Furthermore,​ every time any part of the compiled MINIX3 installation ​is changed, a new image must be built. +As explained in more detail on the [[.:crosscompiling|crosscompilation page]], it is also possible ​to rebuild particular parts of the system ​without going through ​the entire ​"make build" ​process. This involves ​the use of the ''​nbmake-i386'' ​tool and generally requires ​good understanding ​of the compilation process.
- +
-In order to generate a fully instrumented system image with a number ​of pregenerated ASR binaries for all services, one can run a command that automates both steps. This approach is recommended for most users and covered in the first subsection. Alternatively,​ the details of manual instrumentation and image building are covered in the two subsections after. +
- +
-== The easy way: bulk ASR generation == +
- +
-The ''​clientctl''​ script in ''​minix/​llvm''​ provides a convenient way to instrument all services for live update and rerandomization,​ generate a number of rerandomized versions of each service, and build a hard disk image. The command has the following syntax: +
- +
-  $ ./clientctl buildasr [N] +
- +
-Here, N is an optional parameter specifying the number of rerandomized binaries that should be generated in addition to the standard set of randomized binaries. N defaults to 1. For example, the following command will produce a system ​with four randomized sets of service binaries: one set of ASR-randomized services that are used by default, and three extra rerandomized binaries to which the system can switch at run time: +
- +
-  $ ./clientctl buildasr 3 +
- +
-The result is a MINIX3 hard disk image file which can be booted in (for example) qemu; see further below. +
- +
-== The manual way (1/2): instrumentation == +
- +
-Instrumentation takes place at the granularity of individual system services. The ''​minix/​llvm''​ directory contains scripts that allow for relinking services against runtime libraries, and instrumenting services with LLVM passes. The general procedure is like this: +
- +
-  - First, the service is compiled and linked to its basic form. +
-  - Then, the resulting linked bitcode object is relinked with **libmagic**. +
-  - Finally, link-time instrumentation is applied by running the **magic pass**, possibly as well as the **asr pass**, on the linked bitcode object. +
- +
-Each step also (re)generates a ready-to-execute machine code version of the service. +
- +
-Step 1 happens in the "building the system" ​step, using ''​configure.llvm'',​ as explained in a previous section. +
- +
-Step 2 is done with the ''​relink.llvm''​ script in ''​minix/​llvm''​. This script will relink services against a space-separated list of libraries. For live update, only the magic library is relevant: +
- +
-  $ ./​relink.llvm magic +
- +
-This command will relink all services against libmagic, thus providing them with runtime support for live update. +
- +
-Step 3 is done with the ''​build.llvm''​ script in ''​minix/​llvm''​. This script will instrument services with a space-separated list of LLVM passes. For live update, the magic pass should be used: +
- +
-  $ ./​build.llvm magic +
- +
-This command will instrument all services with the magic pass, performing static analysis and changing the service to include the information necessary for libmagic to perform live updates at runtime. +
- +
-For live rerandomization support, one must apply not only the magic pass, but also the asr pass: +
- +
-  $ ./​build.llvm magic asr +
- +
-The resulting service will not only be ready for live update, but also be subjected to fine-grained randomization. It will also be supplied with parameters to perform the runtime component of rerandomization when requested during live updates. +
- +
-For reference, ​the ''​clientctl buildasr''​ command shown above performs this step multiple times to generate different rerandomized versions of each service, storing each in a different location. +
- +
-We now describe some details that might be useful to know about relinking and applying passes: +
- +
-  * By default, ''​relink.llvm''​ and ''​build.llvm''​ perform their respective actions on all system services. It is however possible instrument only a subset of services, leaving the other services untouched. This can be done by passing a ''​C''​ shell variable with a comma-separated list of individual services. For example, the following command relinks the PM (Process Manager) service against the magic library: +
- +
-  $ C=pm ./​relink.llvm magic +
- +
-In the ''​C''​ variable, the pseudo-targets ''​servers'',​ ''​fs'',​ ''​net'', ​and ''​drivers''​ can be used to perform script'​s actions on the services in the corresponding subdirectories in the MINIX3 source tree. The ''​rd''​ pseudo-target regenerates the ramdisk, which must be redone after changing any service on the ramdisk. For example, the following command instruments core servers and file system services with the magic and asr passes, and regenerates the ramdisk: +
- +
-  $ C=servers,​fs,​rd ./​build.llvm magic asr +
- +
-The ''​clientctl buildasr''​ command accepts this optional ''​C''​ shell variable as well. It will however remove any previously generated ASR-rerandomized binaries of all services, irrespective of the ''​C''​ variable. +
- +
-  * Each of the three steps undoes the effects of prior invocations of both that step and subsequent steps, but not earlier steps. In other words: compiling and linking ​service (step 1) will undo any previous relinking and instrumentation. Relinking a service (step 2) will similarly undo any previous relinking and instrumentation ​of the same service. Instrumenting a service (step 3) will undo any previous instrumentation,​ reapplying the instrumentation to the same relinked binary. For this reason, a single ''​build.llvm''​ invocation must be used to apply all passes at once. +
- +
-  * Instrumentation with the magic pass will fail if the service has not been relinked with libmagic first. The same applies to the asr pass. However, the asr pass will not fail if the service has not been instrumented with the magic pass. Instrumenting a service with the asr pass but not the magic pass is of limited use: the service will be randomized, but cannot be subjected to live rerandomization. +
- +
-== The manual way (2/2): building the image == +
- +
-Finally, a MINIX3 image can be built from the compiled MINIX3 code using the ''​clientctl''​ **buildimage** command: +
- +
-  $ ./clientctl buildimage +
- +
-This command produces a bootable MINIX3 hard disk image file. The generated image file is called ''​minix_x86.img''​ and located in the root of the MINIX3 source tree - ''​minix-src''​ in our examples. +
- +
-This command is called automatically as part of ''​clientctl buildasr''​.+
  
 === Running the image === === Running the image ===
  
-Once a hard disk image has been generated, it can be run. The most convenient way to run the image is to use **qemu**. ​For convenience, ​the ''​clientctl''​ script in ''​minix/​llvm''​ has a **run** ​command ​to run the image in qemu without further effort:+The x86_hdimage command produces ​bootable MINIX3 ​hard disk image file. The generated image file is called ''​minix_x86.img''​ and located in the root of the MINIX3 source tree - ''​minix-src''​ in our examples. Once an image has been generated, it can be run. The most convenient way to run the image is to use **qemu/KVM**. This can be done using the command ​as given at the end of the x86_hdimage output.
  
-  $ OUT=F ./clientctl run+While explaining the use of qemu is beyond the scope of this document, it may be useful to look into the ''​-append'',​ ''​-curses'',​ and ''​-serial file:..''​ qemu command line arguments. The following command line will launch qemu with KVM support (remove ''<​nowiki>​--enable-kvm<​/nowiki>''​ to disable KVM support), a curses-based user interface, and system output redirected to a file named ''​serial.out'':​
  
-The ''​OUT''​ shell variable can be set to other values to control what to do with serial outputThe ''​F''​ value specifies that the serial output will be redirected to a ''​F''​ile,​ namely ''​serial.out''​ in the current directoryThe other supported settings are ''​S''​tdout''​C''​onsoleand ''​P''​ty. +  $ cd /​home/​user/​minix-liveupdate/​minix-src 
- +  $ (cd ../obj.i386/​destdir.i386/​boot/​minix/​.temp && qemu-system-i386 --enable-kvm -m 256 -kernel kernel -initrd "​mod01_ds,mod02_rs,mod03_pm,​mod04_sched,​mod05_vfs,​mod06_memory,​mod07_tty,​mod08_mib,​mod09_vm,​mod10_pfs,​mod11_mfs,​mod12_init"​ -hda ../​../​../​../​../​minix-src/​minix_x86.img -curses -serial file:../​../​../​../​../​minix-src/​serial.out -append "​rootdevname=c0d0p0 cttyline=0")
-Extra [[usersguide:bootmonitor|boot options]] can be supplied through the APPEND variable: +
- +
-  $ OUT=F APPEND="rs_verbose=1"​ ./clientctl run+
  
-This example will enable verbose output in the RS service, which is highly useful for debugging issues with live update.+Extra [[usersguide:​bootmonitor|boot options]] can be supplied in the (space-separated) list that follows the ''​-append''​ switch. For example, adding ''​ rs_verbose=1'' ​will enable verbose output in the RS service, which is highly useful for debugging issues with live update. ​
  
 === Summary === === Summary ===
  
-The following commands can be used to obtain, build, instrument, ​and start a MINIX3 system that supports live update and live rerandomization,​ including three alternative rerandomized versions of all system services, in addition to the randomized standard ones:+The following commands can be used to obtain and build a MINIX3 system that supports live update and live rerandomization,​ including three alternative rerandomized versions of all system services, in addition to the randomized standard ones:
  
 +  $ export CC=clang CXX=clang++ JOBS=8
 +  $ cd /​home/​user/​minix-liveupdate
   $ git clone git://​git.minix3.org/​minix minix-src   $ git clone git://​git.minix3.org/​minix minix-src
   $ cd minix-src/​minix/​llvm   $ cd minix-src/​minix/​llvm
-  $ export CC=clang CXX=clang++ +  $ ./​generate_gold_plugin.sh 
-  $ JOBS=8 ​BUILDVARS="​-V ​MKMAGIC=yes" ./configure.llvm +  $ cd ../.. 
-  $ ./clientctl buildasr 3 +  $ BUILDVARS="​-V ​MKASR=yes -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
-  $ OUT=F ./clientctl run+
  
 The entire procedure will typically take about 30GB of disk space and several hours of time. The entire procedure will typically take about 30GB of disk space and several hours of time.
Line 232: Line 160:
 Sometime later, the following steps can be used to update the installation to a newer MINIX3 version: Sometime later, the following steps can be used to update the installation to a newer MINIX3 version:
  
-  $ cd minix-src/minix/llvm+  $ cd /home/user/minix-liveupdate/minix-src
   $ git pull   $ git pull
-  $ export ​CC=clang CXX=clang++ +  $ CC=clang CXX=clang++ JOBS=8 BUILDVARS="​-V ​MKASR=yes -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
-  $ JOBS=8 BUILDVARS="​-V ​MKMAGIC=yes" ./configure.llvm +
-  $ for pass in WeakAliasModuleOverride sectionify magic asr; do (cd passes/$pass && make clean install); done +
-  $ (cd static/​magic && make clean install) +
-  $ ./clientctl buildasr 3 +
-  $ OUT=F ./clientctl run+
  
 In contrast to the initial run, the entire update procedure should take no more than an hour. In contrast to the initial run, the entire update procedure should take no more than an hour.
- 
-Instead of the ''​./​clientctl buildasr 3''​ step in the above two examples, one can for example also instrument the system for live update but not live rerandomization,​ using the following three replacement steps: 
- 
-  $ ./​relink.llvm magic 
-  $ ./​build.llvm magic 
-  $ ./clientctl buildimage 
  
 ==== Using live update ==== ==== Using live update ====
Line 266: Line 183:
   minix# ./​testrelpol   minix# ./​testrelpol
  
-For its live update tests, this script does //not// use the magic framework for state transfer at all. Instead it uses **identity transfer** which performs a basic memory copy between the old and the new instance. As a result, the testrelpol script should succeed whether or not services are instrumented. However, it may not work reliably on MINIX3 systems that are not built for magic instrumentation ​at all (i.e., built without ​''​MKMAGIC=yes''​).+For its live update tests, this script does //not// use the magic framework for state transfer at all. Instead it uses **identity transfer** which performs a basic memory copy between the old and the new instance. As a result, the testrelpol script should succeed whether or not services are instrumented. However, it may not work reliably on MINIX3 systems that are not built for magic instrumentation (i.e., built with neither ​''​MKMAGIC=yes''​ nor ''​MKASR=yes''​).
  
 == Live rerandomization:​ update_asr == == Live rerandomization:​ update_asr ==
  
-As we have shown before, the ''​clientctl buildasr''​ host-side ​command can perform ​the //​build-time//​ preparation of a MINIX3 system for live rerandomization. Complementing this, the //​run-time//​ side of the live rerandomization is provided by means of the **update_asr** command. The update_asr command will update system services to their next pregenerated rerandomized version, using a cyclic system. Live rerandomization is not automatic, and thus, the MINIX3 system administrator is responsible for running the update_asr command at appropriate times.+As we have shown before, the ''​MKASR=yes''​ host-side ​build variable performs ​the //​build-time//​ preparation of a MINIX3 system for live rerandomization. Complementing this, the //​run-time//​ side of the live rerandomization is provided by means of the **update_asr** command. The update_asr command will update system services to their next pregenerated ​ASR-rerandomized version, using a cyclic system. Live rerandomization is not automatic, and thus, the MINIX3 system administrator is responsible for running the update_asr command at appropriate times.
  
 By default, the update_asr command performs one round of ASR rerandomization,​ updating each service to its next version: By default, the update_asr command performs one round of ASR rerandomization,​ updating each service to its next version:
Line 288: Line 205:
 === Live update commands === === Live update commands ===
  
-RS can be instructed to perform live updates through the service(8) command, specifically through its **service update** subcommand. This command is also used by the automated scripts. For a full overview of the command'​s functionality,​ please see the service(8) manual page as well as the command'​s output when it is run with no parameters.+RS can be instructed to perform live updates through the minix-service(8) command, specifically through its **minix-service update** subcommand. This command is also used by the automated scripts. For a full overview of the command'​s functionality,​ please see the minix-service(8) manual page as well as the command'​s output when it is run with no parameters.
  
-In its most fundamental form, the //service update// command will update a running service, identified by its label, to a new version provided as an on-disk binary file. It is however also possible to tell RS to update the service into a copy of itself. In addition, various flags and options can be used for fine-grained control of the live update action. The basic syntax to perform a live update on a single system service is as follows:+In its most fundamental form, the //minix-service update// command will update a running service, identified by its label, to a new version provided as an on-disk binary file. It is however also possible to tell RS to update the service into a copy of itself. In addition, various flags and options can be used for fine-grained control of the live update action. The basic syntax to perform a live update on a single system service is as follows:
  
-  minix# service [flags] update [self|<​binary>​] -label <​label>​ [options]+  minix# ​minix-service [flags] update [self|<​binary>​] -label <​label>​ [options]
  
 Through various combinations of this command'​s parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions for the overall live update infrastructure in general, and state transfer in particular. We will now go through all of them, and explain how they can be performed. For more details regarding what is actually going on below the surface, please consult the developers guide section of this document. Through various combinations of this command'​s parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions for the overall live update infrastructure in general, and state transfer in particular. We will now go through all of them, and explain how they can be performed. For more details regarding what is actually going on below the surface, please consult the developers guide section of this document.
Line 298: Line 215:
 == Identity transfer == == Identity transfer ==
  
-The first update type is **identity transfer**. In this case, the service is updated to an identical copy of itself, with all functions and static data in the new instance located at the exact same addresses as the old instance. Identity transfer bluntly copies over entire memory sections at once, thus requiring no instrumentation at all. This makes it suitable for testing of the MINIX3-specific side of the live update infrastructure,​ hence its use in the ''​testrelpol''​ script. Identity transfer is the default of the service(8) command when "​self"​ is given instead of a path to a new binary:+The first update type is **identity transfer**. In this case, the service is updated to an identical copy of itself, with all functions and static data in the new instance located at the exact same addresses as the old instance. Identity transfer bluntly copies over entire memory sections at once, thus requiring no instrumentation at all. This makes it suitable for testing of the MINIX3-specific side of the live update infrastructure,​ hence its use in the ''​testrelpol''​ script. Identity transfer is the default of the minix-service(8) command when "​self"​ is given instead of a path to a new binary:
  
-  minix# service update self -label pm+  minix# ​minix-service update self -label pm
  
 This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 system services. As mentioned, it is guaranteed to work only when the system was built with ''​MKMAGIC=yes'',​ although it will mostly work on systems built without magic support as well. It works regardless of whether the target service was instrumented with the magic framework (or ASR). This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 system services. As mentioned, it is guaranteed to work only when the system was built with ''​MKMAGIC=yes'',​ although it will mostly work on systems built without magic support as well. It works regardless of whether the target service was instrumented with the magic framework (or ASR).
  
-If the live update is successful, the service(8) command will be silent, but RS will print a system message that the update succeeded:+If the live update is successful, the minix-service(8) command will be silent, but RS will print a system message that the update succeeded:
  
   RS: update succeeded   RS: update succeeded
Line 310: Line 227:
 If the system was started on qemu with ''​OUT=F'',​ this message will end up in ''​serial.out''​. Otherwise, the message should show up in the MINIX3 system log (''/​var/​log/​messages''​) and possibly on the first console. If the system was started on qemu with ''​OUT=F'',​ this message will end up in ''​serial.out''​. Otherwise, the message should show up in the MINIX3 system log (''/​var/​log/​messages''​) and possibly on the first console.
  
-If the live update fails, RS should print an error to the system log, and service(8) will complain. In order to debug such failures, it may be useful to enable verbose mode in RS, buy starting the system with ''​rs_verbose=1''​ as shown earlier.+If the live update fails, RS should print an error to the system log, and minix-service(8) will complain. In order to debug such failures, it may be useful to enable verbose mode in RS, buy starting the system with ''​rs_verbose=1''​ as shown earlier.
  
 == Self state transfer == == Self state transfer ==
Line 316: Line 233:
 The second update type is **self state transfer**. Self state transfer also performs an update of a service into an identical copy of itself, but instead uses the state transfer functionality of the magic framework. Thus, self state transfer requires that the service be instrumented properly. This update type can be used to test whether a service'​s state can be transferred without problems. Please note that many of the points covered here also apply to the remaining two update types, as all three are using the state transfer of the magic framework. The second update type is **self state transfer**. Self state transfer also performs an update of a service into an identical copy of itself, but instead uses the state transfer functionality of the magic framework. Thus, self state transfer requires that the service be instrumented properly. This update type can be used to test whether a service'​s state can be transferred without problems. Please note that many of the points covered here also apply to the remaining two update types, as all three are using the state transfer of the magic framework.
  
-Self state transfer is performed by supplying the ''​-t''​ flag along with "​self"​ to the service update command:+Self state transfer is performed by supplying the ''​-t''​ flag along with "​self"​ to the minix-service update command:
  
-  minix# service -t update self -label pm+  minix# ​minix-service -t update self -label pm
  
-This command will perform self state transfer of the PM service. The libmagic ​state transfer routine in the new service instance will print additional system messages while it is running. Upon success, the system output will look somewhat like this:+This command will perform self state transfer of the PM service. The libmagicrt ​state transfer routine in the new service instance will print additional system messages while it is running. Upon success, the system output will look somewhat like this:
  
   total remote functions: 57. relocated: 54   total remote functions: 57. relocated: 54
Line 332: Line 249:
   RS: update succeeded   RS: update succeeded
  
-If the state transfer routine is not able to perform state transfer successfully,​ it will print messages that start with ''​[ERROR]''​. RS will then roll back the service to the old instance, and both RS and service(8) will report failure. Self state transfer should succeed for all MINIX3 system services that have been built with bitcode and instrumented with libmagic ​and the magic pass. As of writing, there are no system services for which self state transfer is known to result in ''​[ERROR]''​ lines and subsequent live update failure. However:+If the state transfer routine is not able to perform state transfer successfully,​ it will print messages that start with ''​[ERROR]''​. RS will then roll back the service to the old instance, and both RS and minix-service(8) will report failure. Self state transfer should succeed for all MINIX3 system services that have been built with bitcode and instrumented with libmagicrt ​and the magic pass. As of writing, there are no system services for which self state transfer is known to result in ''​[ERROR]''​ lines and subsequent live update failure. However:
  
   * It is possible that new changes to system services, and even usage scenarios which we have not yet tested, do result in state transfer errors. Such errors should be resolved. The developers guide further below contains information on how to resolve some of these errors.   * It is possible that new changes to system services, and even usage scenarios which we have not yet tested, do result in state transfer errors. Such errors should be resolved. The developers guide further below contains information on how to resolve some of these errors.
Line 340: Line 257:
   * Some services have no state to transfer, in which case their new instances will perform a fresh start instead of state transfer. In that case, live update with self state transfer will succeed, but not print the state transfer system messages shown above. This is the case for the IS (Information Server) and readclock.drv services, for example.   * Some services have no state to transfer, in which case their new instances will perform a fresh start instead of state transfer. In that case, live update with self state transfer will succeed, but not print the state transfer system messages shown above. This is the case for the IS (Information Server) and readclock.drv services, for example.
  
-  * Some services may only be updated once brought into a specific state of quiescence, because the default quiescence state is not sufficiently restrictive. In that case, the user must specify an alternative quiescence state explicitly, through the service(8) ''​-state''​ option. This currently applies to all services that make use of userspace threads, namely the VFS, ahci, and virtio_blk services. These services must be updated using quiescence state 2 (//request free//) rather than state 1 (//work free//):+  * Some services may only be updated once brought into a specific state of quiescence, because the default quiescence state is not sufficiently restrictive. In that case, the user must specify an alternative quiescence state explicitly, through the minix-service(8) ''​-state''​ option. This currently applies to all services that make use of userspace threads, namely the VFS, ahci, and virtio_blk services. These services must be updated using quiescence state 2 (//request free//) rather than state 1 (//work free//):
  
-  minix# service -t update self -label vfs -state 2+  minix# ​minix-service -t update self -label vfs -state 2
  
 Omitting the appropriate state parameter may result in a crash of the service after live update. At the moment, the update_asr(8) script has hardcoded knowledge about these necessary states. None of this is great, and we will be working towards a situation where the default state will not result in a crash - see the section on open issues further below. Omitting the appropriate state parameter may result in a crash of the service after live update. At the moment, the update_asr(8) script has hardcoded knowledge about these necessary states. None of this is great, and we will be working towards a situation where the default state will not result in a crash - see the section on open issues further below.
  
-  * State transfer may be slow, and RS applies a rather strict default timeout for live updates. Therefore, it may sometimes be necessary to set a longer timeout in order to avoid needless failures. This can be done through the ''​-maxtime''​ option to service(8):+  * State transfer may be slow, and RS applies a rather strict default timeout for live updates. Therefore, it may sometimes be necessary to set a longer timeout in order to avoid needless failures. This can be done through the ''​-maxtime''​ option to minix-service(8):
  
-  minix# service -t update self -label vfs -state 2 -maxtime 120HZ+  minix# ​minix-service -t update self -label vfs -state 2 -maxtime 120HZ
  
 The maximum time is specified in clock ticks by default, but may be given in seconds by appending "​HZ"​ to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system'​s clock frequency, also known as its HZ setting. The above example allows the live update of VFS to take up to two minutes. The maximum time is specified in clock ticks by default, but may be given in seconds by appending "​HZ"​ to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system'​s clock frequency, also known as its HZ setting. The above example allows the live update of VFS to take up to two minutes.
Line 354: Line 271:
 == ASR rerandomization == == ASR rerandomization ==
  
-The third update type is **ASR rerandomization**. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service performs state transfer into a rerandomized version of the same service. This involves specifying the path to a rerandomized ASR binary to the service(8) command, as well as the ''​-a''​ flag. The ''​-a''​ flag tells the new instance to enable the run-time parts of rerandomization during the live update.+The third update type is **ASR rerandomization**. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service performs state transfer into a rerandomized version of the same service. This involves specifying the path to a rerandomized ASR binary to the minix-service(8) command, as well as the ''​-a''​ flag. The ''​-a''​ flag tells the new instance to enable the run-time parts of rerandomization during the live update.
  
-  minix# service -a update /​service/​asr/​1/pm -label pm+  minix# ​minix-service -a update /​service/​asr/​pm--progname ​pm -label pm
  
-In a system that has been built with ASR rerandomization,​ the (randomized) base service binaries are located in ''/​service''​ and the (randomized) alternative service binaries are located ​in numbered ​subdirectories ​in ''/​service/​asr''​. As mentioned before, the update_asr(8) command can be used to perform these updates semi-automatically.+In a system that has been built with ASR rerandomization,​ the (randomized) base service binaries are located in ''/​service''​ and the (randomized) alternative service binaries are located ​as numbered ​files in ''/​service/​asr''​. As mentioned before, the update_asr(8) command can be used to perform these updates semi-automatically.
  
 Compared to self state transfer, ASR rerandomization comes with one extra restriction:​ the VM service cannot be subjected to forms of state transfer more complicated than self state transfer. For this reason, VM is also skipped by the update_asr(8) command. We will explain the restrictions regarding the VM service in the developers guide. Compared to self state transfer, ASR rerandomization comes with one extra restriction:​ the VM service cannot be subjected to forms of state transfer more complicated than self state transfer. For this reason, VM is also skipped by the update_asr(8) command. We will explain the restrictions regarding the VM service in the developers guide.
Line 366: Line 283:
 The final update type is a **functional update**. Compared to self state transfer, ASR rerandomization relocates code and more data. However, for ASR rerandomization,​ there are still fundamentally no differences between the old and the new version of the service. In contrast, in the case of a functional update, the service performs state transfer into a new program. While this new program is typically highly similar, it may be different from the running service in various ways. The final update type is a **functional update**. Compared to self state transfer, ASR rerandomization relocates code and more data. However, for ASR rerandomization,​ there are still fundamentally no differences between the old and the new version of the service. In contrast, in the case of a functional update, the service performs state transfer into a new program. While this new program is typically highly similar, it may be different from the running service in various ways.
  
-In terms of the service(8) command, such functional updates can be performed by simply using //service update// with a new binary. For example, one could test a new version of the UDS (UNIX Domain Sockets) service, without installing it into ''/​service''​ yet, and without affecting its open sockets:+In terms of the minix-service(8) command, such functional updates can be performed by simply using //minix-service update// with a new binary. For example, one could test a new version of the UDS (UNIX Domain Sockets) service, without installing it into ''/​service''​ yet, and without affecting its open sockets:
  
-  minix# service update /​usr/​src/​minix/​net/​uds/​uds -label uds+  minix# ​minix-service update /​usr/​src/​minix/​net/​uds/​uds -label uds
  
 The possibility of actual differences between the old and new service versions adds an extra dimension for the state transfer. Additional state transfer problems can be expected in this case, and must be dealt with accordingly. The developers guide will (eventually) elaborate on this point. The possibility of actual differences between the old and new service versions adds an extra dimension for the state transfer. Additional state transfer problems can be expected in this case, and must be dealt with accordingly. The developers guide will (eventually) elaborate on this point.
  
-Similarly, depending on the nature of the update, the update action may require a specific state of quiescence. Taking UDS as an example, an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors be in flight at the time of the update. The old instance of the service must support this as a custom quiescence state. This custom state can then be specified through the ''​-state''​ option of the //service update// command.+Similarly, depending on the nature of the update, the update action may require a specific state of quiescence. Taking UDS as an example, an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors be in flight at the time of the update. The old instance of the service must support this as a custom quiescence state. This custom state can then be specified through the ''​-state''​ option of the //minix-service update// command.
  
 Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned! Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned!
Line 378: Line 295:
 == Multicomponent updates == == Multicomponent updates ==
  
-From the user's perspective,​ updating multiple services at once is not much more complex than updating a single service. First, a number of **service update** commands should be issued, just as before, but each with the ''​-q''​ flag added:+From the user's perspective,​ updating multiple services at once is not much more complex than updating a single service. First, a number of **minix-service update** commands should be issued, just as before, but each with the ''​-q''​ flag added:
  
-  minix# service -q -t update /service/pm -label pm +  minix# ​minix-service -q -t update /service/pm -label pm 
-  minix# service -q -t update /​service/​vfs -label vfs -state 2+  minix# ​minix-service -q -t update /​service/​vfs -label vfs -state 2
  
-Then, the entire update can be launched with the **service sysctl upd_run** command:+Then, the entire update can be launched with the **minix-service sysctl upd_run** command:
  
-  minix# service sysctl upd_run+  minix# ​minix-service sysctl upd_run
  
-The RS output will be much more verbose in this case. Note that timeouts are still to be specified on a per-service basis, rather than for the entire update at once. If necessary, any queued //service update// commands may be canceled with the **upd_stop** subcommand:+The RS output will be much more verbose in this case. Note that timeouts are still to be specified on a per-service basis, rather than for the entire update at once. If necessary, any queued //minix-service update// commands may be canceled with the **upd_stop** subcommand:
  
-  minix# service sysctl upd_stop+  minix# ​minix-service sysctl upd_stop
  
 This will cancel the entire multicomponent live update action. This will cancel the entire multicomponent live update action.
- 
-==== Useful host commands ==== 
- 
-The host-side ''​clientctl''​ script in ''​minix/​llvm''​ offers a number of additional convenient commands, mainly for developers. We list some of them here. 
- 
-The **buildboot** command installs just the services that are part of the boot image. It can be used instead of ''​clientctl buildimage''​ when only boot-image services have been changed, thus speeding up the development cycle: 
- 
-  $ ./clientctl buildboot 
- 
-Using this command, it is possible to make and test changes to boot system services fairly quickly. As an example, the following set of steps suffices to make and test changes to the PM service: 
- 
-  $ export PATH=$PATH:/​home/​user/​minix-liveupdate/​obj.i386/​tooldir.{platform}/​bin 
-  $ cd minix-src/​minix/​servers/​pm 
-  [make changes to the PM source code] 
-  $ nbmake-i386 all install 
-  $ cd ../../llvm 
-  $ C=pm ./​relink.llvm magic 
-  $ C=pm ./​build.llvm magic 
-  $ ./clientctl buildboot 
-  $ OUT=F ./clientctl run 
- 
-The **unstack** command shows a stacktrace of pretty much any MINIX3 binary in human-readable form: 
- 
-  $ ./clientctl unstack <​name>​ [address [address ..]] 
- 
-For example, to show a stack trace of the PM service in a human-readable form: 
- 
-  $ ./clientctl unstack pm 0x805a7fd 0x80492a5 0x8048050 
- 
-Note that on ASR-enabled installations,​ the unstack command works only on the base versions of system services. There is currently no way to unstack a stacktrace for any of the ASR-rerandomized service binaries. On one occasion, the author of this document has done that process by hand, by finding the matching assembly code of an ASR-rerandomized service'​s crash site in the service'​s base version. 
  
 ===== Developers guide ===== ===== Developers guide =====
Line 450: Line 337:
 In certain cases, a service may have to meet custom requirements before it is allowed to be updated. This depends on both the service and the update. We previously gave an example regarding the UDS service and transferring file descriptors before. As another example, an update that affects message protocols may have to ensure that the service has no outstanding requests to other services using that protocol. As yet another example, certain drivers may want to avoid being updated while certain types of DMA are ongoing, etcetera. In certain cases, a service may have to meet custom requirements before it is allowed to be updated. This depends on both the service and the update. We previously gave an example regarding the UDS service and transferring file descriptors before. As another example, an update that affects message protocols may have to ensure that the service has no outstanding requests to other services using that protocol. As yet another example, certain drivers may want to avoid being updated while certain types of DMA are ongoing, etcetera.
  
-It is up to the writer of the service to implement any such custom quiescence states, assigning a number to each of them. It is then up to the system administrator to supply such a state with the //service update// command, using the ''​-state <​number>''​ option. Some of the quiescence states are predefined; others must be defined by the service developer explicitly. The following states are defined:+It is up to the writer of the service to implement any such custom quiescence states, assigning a number to each of them. It is then up to the system administrator to supply such a state with the //minix-service update// command, using the ''​-state <​number>''​ option. Some of the quiescence states are predefined; others must be defined by the service developer explicitly. The following states are defined:
  
   * State **1** (''​SEF_LU_STATE_WORK_FREE''​):​ work free. This state ensures that the service is not currently performing any work. The fact that the service is being prepared at the time of verifying the quiescence state implies that it is not doing any other work, and thus, SEF is hardcoded to accept updates in this state. The service developer can not override the check for this state.   * State **1** (''​SEF_LU_STATE_WORK_FREE''​):​ work free. This state ensures that the service is not currently performing any work. The fact that the service is being prepared at the time of verifying the quiescence state implies that it is not doing any other work, and thus, SEF is hardcoded to accept updates in this state. The service developer can not override the check for this state.
Line 468: Line 355:
   sef_setcb_lu_state_isvalid(my_state_isvalid);​   sef_setcb_lu_state_isvalid(my_state_isvalid);​
  
-This routine has the signature ''​int my_state_isvalid(int state, int flags)'',​ and will be called when a live update is initiated through service(8). As its most important parameter, ''​state''​ is the requested quiescence state. The ''​flags''​ parameter contains update flags and is typically unused. The routine must return ''​TRUE''​ if the state is valid for the service, and ''​FALSE''​ otherwise. Most services will want to allow the standard states as well as any custom states:+This routine has the signature ''​int my_state_isvalid(int state, int flags)'',​ and will be called when a live update is initiated through ​minix-service(8). As its most important parameter, ''​state''​ is the requested quiescence state. The ''​flags''​ parameter contains update flags and is typically unused. The routine must return ''​TRUE''​ if the state is valid for the service, and ''​FALSE''​ otherwise. Most services will want to allow the standard states as well as any custom states:
  
   #define MY_CUSTOM_STATE_0 (SEF_LU_STATE_CUSTOM_BASE+0)   #define MY_CUSTOM_STATE_0 (SEF_LU_STATE_CUSTOM_BASE+0)
Line 479: Line 366:
   sef_setcb_lu_prepare(my_lu_prepare);​   sef_setcb_lu_prepare(my_lu_prepare);​
  
-This routine has the signature ''​int my_lu_prepare(int state)'',​ and will be called when a live update is initiated through service(8), after ensuring the given state is valid. Again, ''​state''​ is the requested quiescence state. The function must return ''​OK''​ if the live update can proceed in this state, and ''​ENOTREADY''​ otherwise. It should check the standard states and/or any custom states, typically in a switch statement.+This routine has the signature ''​int my_lu_prepare(int state)'',​ and will be called when a live update is initiated through ​minix-service(8), after ensuring the given state is valid. Again, ''​state''​ is the requested quiescence state. The function must return ''​OK''​ if the live update can proceed in this state, and ''​ENOTREADY''​ otherwise. It should check the standard states and/or any custom states, typically in a switch statement.
  
 Third, the service may optionally provide a quiescence state debugging function through the sef_setcb_lu_state_dump(3) SEF API call. The given callback routine has the signature ''​int my_lu_state_dump(int state)''​ and should use the sef_lu_dprint(3) printf-like function to print information about the given quiescence state and its current internal state as appropriate,​ using newline-terminated lines. Third, the service may optionally provide a quiescence state debugging function through the sef_setcb_lu_state_dump(3) SEF API call. The given callback routine has the signature ''​int my_lu_state_dump(int state)''​ and should use the sef_lu_dprint(3) printf-like function to print information about the given quiescence state and its current internal state as appropriate,​ using newline-terminated lines.
Line 487: Line 374:
 We now get into the details of the live update infrastructure. For many parts of the story, it may be useful to take a look at the actual source code as well. In this section we give a quick overview of what parts of the source code are where, and what they do. We now get into the details of the live update infrastructure. For many parts of the story, it may be useful to take a look at the actual source code as well. In this section we give a quick overview of what parts of the source code are where, and what they do.
  
-The LLVM instrumentation ​code is located in ''​minix/​llvm''​ of the MINIX3 source code, along with the supporting scripts ​described in the users guide. The following relevant LLVM passes are located in ''​minix/​llvm/​passes'':​+The LLVM instrumentation ​passes are located in ''​minix/​llvm''​ of the MINIX3 source code, along with generate_gold_plugin.sh script ​described in the users guide. The following relevant LLVM passes are located in ''​minix/​llvm/​passes'':​
  
   * The **WeakAliasModuleOverride** pass resolves a particular issue with weak symbols being used in assembly code. TODO   * The **WeakAliasModuleOverride** pass resolves a particular issue with weak symbols being used in assembly code. TODO
Line 493: Line 380:
   * The **sectionify** pass is used to tag certain functions and data of bitcode modules as belonging to a certain section. Its main purpose is to tag certain parts of the compiled code such that the magic pass (see below), in a subsequent run over the same code, will treat the tagged parts as special. For example, it is used to ignore all variables in the libc malloc code for state transfer, for reasons explained later.   * The **sectionify** pass is used to tag certain functions and data of bitcode modules as belonging to a certain section. Its main purpose is to tag certain parts of the compiled code such that the magic pass (see below), in a subsequent run over the same code, will treat the tagged parts as special. For example, it is used to ignore all variables in the libc malloc code for state transfer, for reasons explained later.
  
-  * The **magic** pass performs link-time static analysis and instrumentation of system services. It is responsible for supplying ​libmagic ​(see below) with the necessary information to allow for state transfer at runtime, by including descriptions of data types, global variables, and other information,​ in the service module. In addition, it is responsible for replacing certain function calls in the module, in particular memory management functions, with calls to wrappers in libmagic. This allows for runtime tracking of dynamically allocated memory objects.+  * The **magic** pass performs link-time static analysis and instrumentation of system services. It is responsible for supplying ​libmagicrt ​(see below) with the necessary information to allow for state transfer at runtime, by including descriptions of data types, global variables, and other information,​ in the service module. In addition, it is responsible for replacing certain function calls in the module, in particular memory management functions, with calls to wrappers in libmagicrt. This allows for runtime tracking of dynamically allocated memory objects.
  
-  * The **asr** pass performs randomization of the service binary, for example by rearranging functions, basic blocks within functions, and data, adding padding between those, and letting functions allocate stack padding. The ASR pass does not deal with randomization of dynamically allocated objects. Instead, it passes some settings on to libmagic.+  * The **asr** pass performs randomization of the service binary, for example by rearranging functions, basic blocks within functions, and data, adding padding between those, and letting functions allocate stack padding. The ASR pass does not deal with randomization of dynamically allocated objects. Instead, it passes some settings on to libmagicrt.
  
 In addition to the passes, the following pieces of system functionality are especially important for live update: In addition to the passes, the following pieces of system functionality are especially important for live update:
  
-  * The magic library, **libmagic**, is the runtime component of system services. It implements the actual state transfer routine, which uses both the information embedded in the service by the magic pass, and the tracking information it has gathered about dynamically allocated memory objects at run time. It also implements that actual runtime tracking. Furthermore, ​libmagic ​implements the aforementioned runtime part of the ASR functionality. For example, ​libmagic ​can add extra padding when performing memory allocations. The magic library is located in ''​minix/​llvm/static/​magic''​.+  * The magic runtime ​library, **libmagicrt**, is the runtime component of system services. It implements the actual state transfer routine, which uses both the information embedded in the service by the magic pass, and the tracking information it has gathered about dynamically allocated memory objects at run time. It also implements that actual runtime tracking. Furthermore, ​libmagicrt ​implements the aforementioned runtime part of the ASR functionality. For example, ​libmagicrt ​can add extra padding when performing memory allocations. The magic runtime ​library is located in ''​minix/​lib/libmagicrt''​.
  
-  * The glue between system services and libmagic ​is implemented as part of the **System Event Framework** (SEF) library routines. These routines also handle the communication between the system service and RS. Use of SEF is compulsory for all system services. The SEF code is part of **libsys**. Its implementation can be found in the ''​minix/​lib/​libsys/​sef*.c''​ files.+  * The glue between system services and libmagicrt ​is implemented as part of the **System Event Framework** (SEF) library routines. These routines also handle the communication between the system service and RS. Use of SEF is compulsory for all system services. The SEF code is part of **libsys**. Its implementation can be found in the ''​minix/​lib/​libsys/​sef*.c''​ files.
  
   * The source code of **RS**, the Reincarnation Server, is located in ''​minix/​servers/​rs''​. RS uses live update functionality implemented in the kernel, located in ''​minix/​kernel'',​ and VM, located in ''​minix/​servers/​vm''​.   * The source code of **RS**, the Reincarnation Server, is located in ''​minix/​servers/​rs''​. RS uses live update functionality implemented in the kernel, located in ''​minix/​kernel'',​ and VM, located in ''​minix/​servers/​vm''​.
Line 515: Line 402:
 In general, properly achieving //​quiescence//​ is one of the main challenges for a live update system. For example, if a live update changes the implementation of a particular function, the component being updated must not be executing that function at the time of the live update - if it is, the live update will most likely result in a crash of the component. In MINIX3, the quiescence issue is resolved in a way that leaves little room for problems, by exploiting MINIX3'​s message-based nature. In essence, all the MINIX3 services consist of a main message loop that repeatedly receives a message and processes this message. MINIX3 supports no kernel threads, and thus, the MINIX3 services have no internal CPU-level concurrency. As a result, a message can be used to enforce quiescence. In general, properly achieving //​quiescence//​ is one of the main challenges for a live update system. For example, if a live update changes the implementation of a particular function, the component being updated must not be executing that function at the time of the live update - if it is, the live update will most likely result in a crash of the component. In MINIX3, the quiescence issue is resolved in a way that leaves little room for problems, by exploiting MINIX3'​s message-based nature. In essence, all the MINIX3 services consist of a main message loop that repeatedly receives a message and processes this message. MINIX3 supports no kernel threads, and thus, the MINIX3 services have no internal CPU-level concurrency. As a result, a message can be used to enforce quiescence.
  
-MINIX3 live updates are orchestrated by the RS (Reincarnation Server) service. The administrator of the system first compiles a new version of the service into an executable on disk, and then instructs RS to update a particular running system service into the new version, through the service(8) utility. RS starts by loading the new version of the service as a new service process, without letting it run. Thus, there are temporarily two instances of the service: the old instance, which is still running, and the new instance, which contains the new code but not yet any of the necessary state.+MINIX3 live updates are orchestrated by the RS (Reincarnation Server) service. The administrator of the system first compiles a new version of the service into an executable on disk, and then instructs RS to update a particular running system service into the new version, through the minix-service(8) utility. RS starts by loading the new version of the service as a new service process, without letting it run. Thus, there are temporarily two instances of the service: the old instance, which is still running, and the new instance, which contains the new code but not yet any of the necessary state.
  
 RS then asks the old instance of the service to prepare to be updated, by sending a __prepare__ request message to it. At the moment that the service receives and processes the preparation message, it is by definition in a known state, as it cannot also be doing something else at the same time. While this is a good start for quiescence, the service may have to meet additional requirements regarding its current activity, depending on the service and the type of live update. The administrator provides the intended //​quiescence state// for the live update when starting the update, and the service itself determines whether or not it is //ready// when handling the __prepare__ message. If the service decides that it does not meet the given quiescence requirements,​ the live update is aborted. RS then asks the old instance of the service to prepare to be updated, by sending a __prepare__ request message to it. At the moment that the service receives and processes the preparation message, it is by definition in a known state, as it cannot also be doing something else at the same time. While this is a good start for quiescence, the service may have to meet additional requirements regarding its current activity, depending on the service and the type of live update. The administrator provides the intended //​quiescence state// for the live update when starting the update, and the service itself determines whether or not it is //ready// when handling the __prepare__ message. If the service decides that it does not meet the given quiescence requirements,​ the live update is aborted.
Line 521: Line 408:
 However, if the old instance does meet the requirements,​ it acknowledges that it is ready by sending a __ready__ message to RS, blocking on receipt of a reply from RS. Thus, the old instance is effectively stopped in a known state. In order to maintain the externally visible state (most importantly,​ the communication endpoint) of the service being updated, the process slots of the old and the new service instances are swapped. The new instance, now in the original process slot, is then allowed to run. Upon startup, the new instance finds out from RS that it is the new instance of an old, stopped process, and attempts to perform state transfer from this old process into itself. However, if the old instance does meet the requirements,​ it acknowledges that it is ready by sending a __ready__ message to RS, blocking on receipt of a reply from RS. Thus, the old instance is effectively stopped in a known state. In order to maintain the externally visible state (most importantly,​ the communication endpoint) of the service being updated, the process slots of the old and the new service instances are swapped. The new instance, now in the original process slot, is then allowed to run. Upon startup, the new instance finds out from RS that it is the new instance of an old, stopped process, and attempts to perform state transfer from this old process into itself.
  
-State transfer requires transfer of all individual pieces of data from the old to the new process, possibly to a new location. This is performed by the magic framework. In a nutshell, the magic state transfer approach relies on having a full view of all the individual pieces of data that make up the process, along with type information about the data, including for example structure layouts and types of pointers. For static data, this information is generated by the magic pass through static analysis performed at compile time, and included with the service binary. For dynamic data, the information is collected and maintained by the magic library attached to the service. The end result is that the state transfer framework knows about all global variables and functions, and for each pointer, what type of data the pointer points to.+State transfer requires transfer of all individual pieces of data from the old to the new process, possibly to a new location. This is performed by the magic framework. In a nutshell, the magic state transfer approach relies on having a full view of all the individual pieces of data that make up the process, along with type information about the data, including for example structure layouts and types of pointers. For static data, this information is generated by the magic pass through static analysis performed at compile time, and included with the service binary. For dynamic data, the information is collected and maintained by the magic runtime ​library attached to the service. The end result is that the state transfer framework knows about all global variables and functions, and for each pointer, what type of data the pointer points to.
  
-This knowledge, in addition to full access to the memory of the old instance through a special memory grant, allows the runtime libmagic ​state transfer procedure in the new instance to iterate over all data of the old process. This procedure recursively follows any pointers it encounters, and //pairs// each piece of data with the corresponding piece of data in the new process, copying over and adjusting (as necessary) the data for the new layout as necessary. In certain cases, the state transfer system may not be able to pair all pieces of data, or deal with all pointers. In that case, state transfer fails. Annotations in the service source code, as well as custom data transfer methods, can be provided in order to aid the state transfer process.+This knowledge, in addition to full access to the memory of the old instance through a special memory grant, allows the libmagicrt ​state transfer procedure in the new instance to iterate over all data of the old process. This procedure recursively follows any pointers it encounters, and //pairs// each piece of data with the corresponding piece of data in the new process, copying over and adjusting (as necessary) the data for the new layout as necessary. In certain cases, the state transfer system may not be able to pair all pieces of data, or deal with all pointers. In that case, state transfer fails. Annotations in the service source code, as well as custom data transfer methods, can be provided in order to aid the state transfer process.
  
-Regardless of whether state transfer succeeded or failed, the new instance sends the result of the state transfer to RS using an __init__ request message. If state transfer succeeded, RS allows the new instance to continue to run, and kills the process of the old instance. If the state transfer fails, RS again swaps the process slots of the old and the new instance, allows the old instance to run again, and kills the new instance. In both cases, RS communicates the result to the service(8) utility as well, ultimately letting the system administrator know about the outcome of the live update.+Regardless of whether state transfer succeeded or failed, the new instance sends the result of the state transfer to RS using an __init__ request message. If state transfer succeeded, RS allows the new instance to continue to run, and kills the process of the old instance. If the state transfer fails, RS again swaps the process slots of the old and the new instance, allows the old instance to run again, and kills the new instance. In both cases, RS communicates the result to the minix-service(8) utility as well, ultimately letting the system administrator know about the outcome of the live update.
  
 For multicomponent live updates, all affected services are first brought into the //ready// state, after which they are all updated. Any service failing to get ready in the preparation phase will cause an abort of the entire update, and any service failing the state transfer phase causes a rollback of the entire update. For multicomponent live updates, all affected services are first brought into the //ready// state, after which they are all updated. Any service failing to get ready in the preparation phase will cause an abort of the entire update, and any service failing the state transfer phase causes a rollback of the entire update.
Line 533: Line 420:
 === The quiescence model === === The quiescence model ===
  
-We describe the quiescence model in a bit more detail, in order to make two points: 1) the implementation of system services must follow a basic standard structure in order to allow for live update, and 2) the process stack is and can be disregarded for the purpose of state transfer. The following piece of pseudocode represents a simplified and flattened version of the general structure of each system service:+We describe the quiescence model in a bit more detail, in order to make two points: 1) the implementation of system services must follow a basic standard structure in order to allow for live update, and 2) the process stack is and can be disregarded for the purpose of state transfer. 
 + 
 +The following piece of pseudocode represents a simplified and flattened version of the general structure of each system service:
  
 <​code>​ <​code>​
Line 540: Line 429:
  receive INIT message from RS  receive INIT message from RS
  if INIT message requests a FRESH start:  if INIT message requests a FRESH start:
- perform ​fresh initialization+ perform ​service ​initialization
  if INIT message requests a LIVE UPDATE start:  if INIT message requests a LIVE UPDATE start:
  perform state transfer  perform state transfer
Line 560: Line 449:
  
 As can be seen, the service'​s initialization code starts by learning from RS what type of initialization it should perform. As can be seen, the service'​s initialization code starts by learning from RS what type of initialization it should perform.
-This can be either //fresh// initialization of the system service, or state transfer for the purpose of live update (for simplicity we disregard crash recovery). If the service is started ​initially, typically during system boot, it will perform the fresh initialization, thereby ​initializing global variables, performing initialization-only procedures, etcetera. In contrast, if a new service instance is started for the purpose of live update, it will skip the fresh initialization and instead perform state transfer from the old instance.+This can be either //fresh// initialization of the system service, or state transfer for the purpose of live update (for simplicity we disregard crash recovery). If the service is started ​anew, typically during system boot, it will perform the service initialization. Such fresh initialization ​typically consists of initializing global variables, performing initialization-only procedures, etcetera. In contrast, if a new service instance is started for the purpose of live update, it will skip the fresh initialization and instead perform state transfer from the old instance.
  
-In practice, all interaction with RS is implemented in the System Event Framework (SEF) library ​that must be used by all system services. The service-specific actions such as the fresh initialization action are implemented as callbacks from SEF.+In practice, all interaction with RS is implemented in the System Event Framework (SEF) library ​code. The service-specific actions such as the fresh initialization action are implemented as callbacks from SEF. In the case of fresh initialization,​ the service is to provide a callback function to SEF using the sef_setcb_init_fresh(3) API call. The default state transfer action for a //live update// start does not require code in the actual service at all.
  
-Thus, if the service has initialization code that is called outside of the "fresh initialization"​ procedure, for example at the "there should be nothing else here" point, then this code will also be called in case of a live update, possibly undoing the effects of the state transfer. Therefore, services must perform initialization only from the designated initialization routines; in the case of fresh initialization,​ this is a callback function provided to SEF using the sef_setcb_init_fresh(3) call. This covers point #1 from the start of this section.+If the service has initialization code that is called outside of the "fresh initialization"​ procedure, for example at the "there should be nothing else here" point, then this code will also be called in case of a live update, possibly undoing the effects of the state transfer. Therefore, services must perform initialization only from the designated initialization routines.
  
-After either type of initialization,​ the service will enter the main message loop afterwards, where it will repeatedly receive a message and handle that message. If the received message is a __prepare__ request from RS, then the service is about to be updated, and it sends a __ready__ message to RS, blocking until it gets a response. If the live update is successful, this old instance will never get a response, and instead be killed. ​+After either type of initialization,​ the service will enter the main message loop, where it will repeatedly receive a message and handle that message. If the received message is a __prepare__ request from RS, then the service is about to be updated, and it sends a __ready__ message to RS, blocking until it gets a response. If the live update is successful, this old instance will never get a response, and instead be killed.
  
-As can be seen, in terms of stack usage, the execution path from main() to the point where the service gets blocked receiving the __ready__ response from RS (let's call it the //​quiescence point//) is short and simple. ​Even if the state transfer procedure ​modified ​the new instance'​s stack and program counter to continue from the quiescence point, the result would essentially be the same as not doing it: in both cases, the new service would end up at the start of the message loop. Therefore, the MINIX3 state transfer approach chooses to disregard the execution context of the old process, thus obviating the need to transfer the stack. This is viable only due to the well defined quiescence model.+As can be seen, in terms of the process ​stack of the service, the execution path from main() to the point where the service gets blocked receiving the __ready__ response from RS (let's call this the //​quiescence point//) is short and simple. ​As a result, ​if the state transfer procedure ​restored ​the new instance'​s stack and program counter to continue from the quiescence point, the result would essentially be the same as not doing so: in both cases, the new service would end up at the start of the message loop. Therefore, the MINIX3 state transfer approach chooses to disregard the execution context of the old process, thus obviating the need to transfer the stack altogether. This is viable only due to the well defined quiescence model.
  
-However, it is possible that the functions leading up to the quiescence point, including the main message loop, have local variables on the stack that maintain long-running state. For example, the main() function could maintain a counter for the number of messages received so far. The values of such variables will be lost during the live update. If this were a major issue, the live update framework could be made to instrument the stack as well, but this could come at great cost since instrumenting only the stack of functions leading up to the quiescence point would be difficult. In practice, not having essential long-running variables in main() is rather simple, and we have not seen problems so far. This covers point #2 from the start of this section.+However, it is possible that the functions leading up to the quiescence point, including the main message loop, have local variables on the stack that maintain long-running state. For example, the main() function could maintain a counter for the number of messages received so far. The values of such variables will be lost during the live update. If this were a major issue, the live update framework could be made to instrument the stack as well, but this could come at great cost since instrumenting only the stack of functions leading up to the quiescence point would be difficult. In practice, not having essential long-running variables in main() is rather simple, and we have not seen problems so far.
  
 === Process sections === === Process sections ===
Line 576: Line 465:
 The address space of a process is typically made up of various memory sections with different purposes, and MINIX3 system services are no different. There are important differences between various sections when it comes to state transfer. The address space of a process is typically made up of various memory sections with different purposes, and MINIX3 system services are no different. There are important differences between various sections when it comes to state transfer.
  
-  * The new instance'​s **text** section is already as it should be: it contains the new code to which  +  * The new instance'​s **text** section is already as it should be: it contains the new code which has been loaded for the new instance by RS.
   * The new instance'​s **data** section is initialized as though the service just started, and the state in this section must be transferred from the old service.   * The new instance'​s **data** section is initialized as though the service just started, and the state in this section must be transferred from the old service.
- 
   * As explained in the previous section, the **stack** section of the old instance can be ignored altogether, instead letting the new instance naturally reconstruct the stack by going through the regular process starting procedure to get back into main() and the message loop.   * As explained in the previous section, the **stack** section of the old instance can be ignored altogether, instead letting the new instance naturally reconstruct the stack by going through the regular process starting procedure to get back into main() and the message loop.
- +  ​* The new instance will have an empty **heap** section. ​Its state transfer procedure ​will have to use the brk(2) system call in order to request heap memory for itself so that it can transfer the heap state from the old service. 
-  ​* The new instance will have an empty **heap** section. ​It will have to use the brk(2) system call in order to request heap memory for itself so that it can transfer the heap state from the old service. +  * For the memory-mapped pages that make up the old instance's **mmap** section, things are slightly differentMINIX3 ensures that the new instance automatically inherits a copy-on-write version of all memory-mapped pages. Thus, the new instance will automatically have the old instance'​s memory-mapped pages mapped into its address space. For some pages, copy-on-write mappings are not possible. This is the case with memory-mapped I/O and for memory used for DMA transfers. Such pages are mapped with full sharing between the two instances.
- +
-  * For the memory-mapped pages that make up the process's **mmap** section, things are slightly differentMINIX3 ensures that the new instance automatically inherits a copy-on-write version of all memory-mapped pages. For some pages, copy-on-write mappings are not possible. This is the case with memory-mapped I/O and for memory used for DMA transfers. Such pages are mapped with full sharing between the two instances.+
  
 For a live update of the VM service, the last two points are different. We describe the exceptions for VM in a later section. For a live update of the VM service, the last two points are different. We describe the exceptions for VM in a later section.
Line 592: Line 477:
 === Identity transfer === === Identity transfer ===
  
-The simple case is identity transfer. Identity transfer is a minimal ​approach to transfer state from an old instance to a new instance of exactly the same service, that is, a process with exactly the same address space layout and functionality. Identity transfer is also supported when the target service has not been instrumented,​ and in fact even when the system has not been compiled using LLVM bitcode altogether.+The simple case is identity transfer. Identity transfer is a minimal ​state transfer ​approach that can only transfers ​state from an old instance to a new instance of exactly the same service, that is, a process with exactly the same address space layout and functionality. Identity transfer is also supported when the target service has not been instrumented,​ and in fact even when the system has not been compiled using LLVM bitcode altogether.
  
 Since new instance is a newly started copy of the same service, it already has a text section that is identical to the old instance. As described, the stack section need not be transferred,​ and the mmap section is inherited automatically. Since new instance is a newly started copy of the same service, it already has a text section that is identical to the old instance. As described, the stack section need not be transferred,​ and the mmap section is inherited automatically.
  
-Therefore, ​the identity transfer is concerned with the data and heap sections only. The identity transfer procedure starts by copying over the old instance'​s entire data section to itself. This includes the variable that contains the size of the old instance'​s heap (''​_brksize''​). The identity transfer procedure then calls brk(2) to allocate a heap for itself which is just as large, and copies over the old instance'​s entire heap section it itself as well.+Therefore, identity transfer is concerned with the data and heap sections only. The new instance'​s ​identity transfer procedure starts by copying over the old instance'​s entire data section to itself. This includes the variable that contains the size of the old instance'​s heap (''​_brksize''​). The identity transfer procedure then calls brk(2) to allocate a heap for itself which is just as large, and copies over the old instance'​s entire heap section it itself as well. The identity transfer procedure is implemented in the System Event Framework (SEF) as part of libsys.
  
-If the system is not built with ''​MKMAGIC=yes'',​ which means that ''​_MINIX_MAGIC''​ is not defined, then the mmap section of the process is not well delineated and may in fact overlap with other memory areas. This is intentional,​ as it ensures that for such a set-up, the address space layout of services is not unnecessarily restricted and services can use the full address space for, say, a page cache. However, as a result, some memory-mapped areas may not be mapped into the new process, possibly leading to segmentation ​faults. Therefore, even identity transfer is not expected to be reliable on a system //not// built with ''​MKMAGIC=yes''​. Eventually, MINIX3 should be changed to use another approach for transferring memory-mapped regions to the new process altogether, which is either not based on ranges or not the default at all. +If the system is not built with ''​MKMAGIC=yes'',​ which means that ''​_MINIX_MAGIC''​ is not defined, then the mmap section of the process is not well delineated and may in fact overlap with other memory areas. This is intentional,​ as it ensures that for such a set-up, the address space layout of services is not unnecessarily restricted and services can use the full address space for, say, a page cache. However, as a result, some memory-mapped areas may not be mapped into the new process, possibly leading to segmentation ​fault after the live update. Therefore, even identity transfer is not expected to be reliable on a system //not// built with ''​MKMAGIC=yes''​. Eventually, MINIX3 should be changed to use another approach for transferring memory-mapped regions to the new process altogether, which is either not based on ranges or not the default at all. See also the section on open issues ​in this document.
- +
-Identity transfer is implemented ​in the System Event Framework (SEF) as part of libsys.+
  
 === Magic state transfer === === Magic state transfer ===
  
-The other case is state transfer by the magic framework. This type of state transfer is used by the //self state transfer//, //ASR rerandomization//,​ and //​functional update// update types as covered in the users guide. This form of state transfer relies on the magic pass and library to implement instrumentation and runtime support for state transfer.+The other case is state transfer by the magic framework. This type of state transfer is used by the //self state transfer//, //ASR rerandomization//,​ and //​functional update// update types as covered in the users guide. This form of state transfer relies on the magic pass and library to implement instrumentation and runtime support for state transfer. Again, state transfer is performed by the new instance of the service, using full access to the address space of its old instance.
  
-The magic framework'​s state transfer procedure transfers ​static ​data objects ​to the new instance ​one by one. In this context, an object may for example be one global variable. The actual transfer of an object is not a simple memory copy; it involves analyzing any pointers in the object and adjusting these pointers as appropriate to match the address layout of the new instance.+The magic framework'​s state transfer procedure transfers data objects one by one. This includes all //static// objects. In this context, an object may for example be one global variable. The actual transfer of an object is not a simple memory copy; it involves analyzing any pointers in the object and adjusting these pointers as appropriate to match the address layout of the new instance.
  
-The state transfer procedure also transfers dynamic data objects ​(in the heap and mmap sectionsof the old instance ​to the same location in the new instance one by one. In essence, the procedure recreates the heap and mmap sections during the state transfer, by allocating new heap or mapped memory for each dynamic object, and then transferring its actual contentsagain including ​pointer analysis and adjustment. Here, one object is one piece of memory created by a call to malloc(3) or mmap(2), for example.+The state transfer procedure also transfers ​//dynamic// data objects, which are located ​in the heap and mmap sections of the old instance. In essence, the procedure recreates the heap and mmap sections during the state transfer, by allocating new heap or mapped memory for each dynamic object, and then transferring its actual contents. This again includes ​pointer analysis and adjustment. Here, one object is one piece of memory created by a call to malloc(3) or mmap(2), for example.
  
-Since MINIX3 already transfers the mmap section to the new instance automatically,​ the state transfer framework starts by unmapping all memory-mapped areas that it knows it will recreate. However, since some memory areas (the aforementioned memory-mapped I/O and DMA memory) cannot be recreated by the magic framework, these are not destroyed and recreated. These areas are called //​special//,​ //​out-of-band//​ memory.+Since MINIX3 already transfers the mmap section to the new instance automatically,​ the state transfer framework starts by unmapping all memory-mapped areas that it knows it will recreate. However, since some memory areas (the aforementioned memory-mapped I/O and DMA memory) cannot be recreated by the magic framework, these are not destroyed and recreated. These areas are called //​special//,​ //​out-of-band//​ memory. The service has to tell the magic runtime library about special memory areas. For the two common ways of allocating such memory, alloc_contig(3) and vm_map_phys(2),​ this is done automatically by libsys.
  
-Out-of-band memory is seen as opaque, unmovable memory, and ignored entirely for the purpose of state transfer. Thus, if a piece of out-of-band memory contains a pointer to a piece of memory that is //not// marked as out-of-band,​ this pointer will be missed during state transfer. For the aforementioned (MMIO and DMA) memory types, this is not a problem. The service has to tell the magic library about special memory areas; for the two common ways of allocating such memory, alloc_contig(3) and vm_map_phys(2),​ this is done automatically by libsys.+Out-of-band memory is seen as opaque, ​physically and virtually ​unmovable memory, and ignored entirely for the purpose of state transfer. Thus, if a piece of out-of-band memory contains a pointer to a piece of memory that is //not// marked as out-of-band,​ this pointer will be missed during state transfer. For the aforementioned (memory-mapped I/O and DMA) memory types, this is not a problem.
  
 The default of inheriting the entire mmap section leads to the situation that if the magic framework misses any memory-mapped areas for any reason, these will effectively translate to a memory leak in the new instance. Currently, one such memory leak is addressed explicitly: the page directory that is allocated with mmap(2) internally by the libc malloc code. The default of inheriting the entire mmap section leads to the situation that if the magic framework misses any memory-mapped areas for any reason, these will effectively translate to a memory leak in the new instance. Currently, one such memory leak is addressed explicitly: the page directory that is allocated with mmap(2) internally by the libc malloc code.
  
-The state transfer procedure may fail if its analysis is not successful, in which the system will roll back to the old instance, and the live update fails. It is up to the programmer to deal with such problems. This may involve annotating source code, for example to instruct the state transfer procedure to ignore certain pointers, or to copy over certain data as is. It may involve adding special state transfer routines to libmagic, which deal with fundamentally problematic cases such as unions. In rare cases, it may involve adapting source code to avoid state transfer problems. We discuss all this in more detail later.+The state transfer procedure may fail if its analysis is not successful, in which the system will roll back to the old instance, and the live update fails. It is then up to the programmer to deal with such problems. This may involve annotating source code, for example to instruct the state transfer procedure to ignore certain pointers, or to copy over certain data as is. It may involve adding special state transfer routines to libmagicrt, which deal with fundamentally problematic cases such as unions. In rare cases, it may involve adapting source code to avoid state transfer problems. We discuss all this in more detail later
 + 
 +In the case of self state transfer, all static objects will have the same location in both the old and the new instance. However, due to their dynamic recreation, the addresses of dynamic objects may change during self state transfer.
  
 In the case of ASR rerandomization,​ not just the dynamic part, but also the static part of the address space will have objects that are relocated between the old and the new instance. In addition, ASR rerandomization permutes the order in which the old instance'​s dynamic objects are allocated in the new instance. Finally, the asr pass may insert padding which may expose wrong assumptions about alignment of various buffers. Thus, while live rerandomization is a security feature, in practice it may expose not only additional problems with state transfer, but also bugs in the service itself. In the case of ASR rerandomization,​ not just the dynamic part, but also the static part of the address space will have objects that are relocated between the old and the new instance. In addition, ASR rerandomization permutes the order in which the old instance'​s dynamic objects are allocated in the new instance. Finally, the asr pass may insert padding which may expose wrong assumptions about alignment of various buffers. Thus, while live rerandomization is a security feature, in practice it may expose not only additional problems with state transfer, but also bugs in the service itself.
Line 624: Line 509:
 === Exceptions for services === === Exceptions for services ===
  
-While MINIX3 allows all of its services to be updated, ​a small number of services require special exceptions to allow for live updates, because they are crucial to the live update process itself. These services are RS and VM. This section elaborates on the exceptions made for RS and VM, and explains why VM in particular cannot be updated arbitrarily.+While MINIX3 allows all of its services to be updated, ​certain ​services require special exceptions to allow for live updates, because they are crucial to the live update process itself. These services are RS and VM. This section elaborates on the exceptions made for RS and VM, and explains why VM in particular cannot be updated arbitrarily.
  
 == The RS service == == The RS service ==
Line 634: Line 519:
 MINIX3 has limited support for performing a live update of the VM (Virtual Memory) service. There are two reasons why VM is a special case. First, VM provides essential memory management and page fault handling functionality to the other system services. Thus, the live update must ensure that none of that functionality is required during the course of a live update that includes VM. Second, VM's core data structures include page tables. If these page tables are changed during a live update, it may be impossible to perform a proper rollback. MINIX3 has limited support for performing a live update of the VM (Virtual Memory) service. There are two reasons why VM is a special case. First, VM provides essential memory management and page fault handling functionality to the other system services. Thus, the live update must ensure that none of that functionality is required during the course of a live update that includes VM. Second, VM's core data structures include page tables. If these page tables are changed during a live update, it may be impossible to perform a proper rollback.
  
-During normal operation, VM may allocate memory for itself. VM has both a heap and dynamic pages, implementing special local versions of brk(2) and mmap(2) to support this. In particular, page tables are stored in dynamically allocated memory, effectively in VM's mmap section.+During normal operation, VM may allocate memory for itself. VM has both a heap and dynamic pages, implementing special local versions of brk(2) and mmap(2) to support this. In particular, page tables are stored in dynamically allocated memory, effectively in VM's mmap section. For VM, the live update procedure must therefore include the transfer of such dynamic state from the old to the new VM instance.
  
-For VM, the live update procedure must therefore include the transfer of such dynamic state from the old to the new VM instance. ​Since page tables cannot simply be copied, they are made visible to the new instance by mapping the old instance'​s dynamic memory ranges directly into the new instance'​s address space. +Since page tables cannot simply be copied, they are made visible to the new instance by mapping the old instance'​s dynamic memory ranges directly into the new instance'​s address space. That means that any changes made to the dynamic data structures by the new instance (page tables included) becomes visible to the old instance after a rollback. However, the two instances do each have their own static memory (i.e., text and data sections, as well as a preallocated stack). Thus, any changes to dynamic memory made by the new instance, would create a potential mismatch between the static and dynamic memory in the old instance after rollback.
- +
-That means that any changes made to the dynamic data structures by the new instance (page tables included) becomes visible to the old instance after a rollback. However, the two instances do each have their own static memory (i.e., text and data sections, as well as a preallocated stack). Thus, any changes to dynamic memory made by the new instance, would create a potential mismatch between the static and dynamic memory in the old instance after rollback.+
  
 Therefore, in order to allow for rollback, VM must not make any changes to its dynamic memory during the live update. That also means that it may not allocate memory during the live update, not for other processes and not for itself. This situation leads to the following exceptions: Therefore, in order to allow for rollback, VM must not make any changes to its dynamic memory during the live update. That also means that it may not allocate memory during the live update, not for other processes and not for itself. This situation leads to the following exceptions:
  
-  * First and foremost, since the new VM instance essentially inherits the old instance'​s dynamic memory, the dynamic memory must be ignored by the state transfer framework. For this reason, at startup, VM tells libmagic ​that its entire dynamic memory region consists of special, out-of-band data. As a result, any pointers in this region will not be analyzed or adjusted by the state transfer procedure; this is a good thing, as changes to such pointers would not be undone after a rollback. However, the main consequence is that if the static memory layout of the VM process changes, any pointers in dynamic memory that point to static memory will become invalid. Therefore, updates to VM are limited to the **identity transfer** and **self state transfer** update types. +  * First and foremost, since the new VM instance essentially inherits the old instance'​s dynamic memory, the dynamic memory must be ignored by the state transfer framework. For this reason, at startup, VM tells libmagicrt ​that its entire dynamic memory region consists of special, out-of-band data. As a result, any pointers in this region will not be analyzed or adjusted by the state transfer procedure. This is a good thing, as changes to such pointers would not be undone after a rollback. However, the main consequence is that if the static memory layout of the VM process changes, any pointers in dynamic memory that point to static memory will become invalid. Therefore, updates to VM are limited to the **identity transfer** and **self state transfer** update types. 
- +  * Another effect ​of the automatic ​dynamic memory inheritance ​is that dynamic memory allocations need and must not be tracked. Therefore, dynamic memory allocation functions are not instrumented in VM at all, requiring an instrumentation override. This override ​also requires the need to disable some other instrumentation features, such as the aforementioned libc malloc page directory exception. The features are disabled ​during ​VM'​s ​linking process, through special statements in its Makefile.
-  * Because ​of the same dynamic memory inheritance,​ dynamic memory allocation functions are not instrumented in VM at all. This also requires the need to disable some other instrumentation features, such as the aforementioned libc malloc page directory exception. The features are disabled ​as part of the VM linking process. +
   * After a rollback, the old VM instance still has to perform a small number of corrective actions to undo changes made by the new instance. These actions are however kept to a minimum. In the future, more extended non-transparent rollback may be the key to allowing more invasive live updates to the VM service.   * After a rollback, the old VM instance still has to perform a small number of corrective actions to undo changes made by the new instance. These actions are however kept to a minimum. In the future, more extended non-transparent rollback may be the key to allowing more invasive live updates to the VM service.
- 
   * The state transfer procedure requires some temporary memory to do its job. Since it cannot allocate such memory directly, an //​initialization buffer// is preallocated in the new VM instance, and the state transfer procedure uses this buffer instead of allocating memory dynamically.   * The state transfer procedure requires some temporary memory to do its job. Since it cannot allocate such memory directly, an //​initialization buffer// is preallocated in the new VM instance, and the state transfer procedure uses this buffer instead of allocating memory dynamically.
- +  ​* RS requests VM to preallocate (//pin//) RS's memory before starting a live update, so that RS will not require VM's functionality during the live update.
-  ​* RS requests VM to preallocate (__pin__) RS's memory before starting a live update, so that RS will not require VM's functionality during the live update. +
   * For multicomponent live update operations that include VM, all memory-modifying actions are performed before, rather than during, the actual live update operation, using special preparation requests sent by RS to VM. The memory of all new instances is also preallocated in order to avoid memory allocation and pagefaults during the live update. The old VM instance is the last process that is made ready for the update, and the new VM instance is the first process that gets to run right after.   * For multicomponent live update operations that include VM, all memory-modifying actions are performed before, rather than during, the actual live update operation, using special preparation requests sent by RS to VM. The memory of all new instances is also preallocated in order to avoid memory allocation and pagefaults during the live update. The old VM instance is the last process that is made ready for the update, and the new VM instance is the first process that gets to run right after.
- 
   * Despite the pinning, the new VM instance may have to handle brk(2) system calls coming from other new service instances that are all part of the same multicomponent live update. IPC filters are used to ensure that the new VM instance gets requests only from the other services in this group, and not from any other running services. Note that VM does not make any changes to its dynamic memory while handling a brk(2) call. Also, since all memory is preallocated,​ the VM instance should never get any pagefaults or handle-memory requests for other services'​ new instances; such requests are blocked by the IPC filters as well. If they do occur, they should result in a timeout of the entire multicomponent live update.   * Despite the pinning, the new VM instance may have to handle brk(2) system calls coming from other new service instances that are all part of the same multicomponent live update. IPC filters are used to ensure that the new VM instance gets requests only from the other services in this group, and not from any other running services. Note that VM does not make any changes to its dynamic memory while handling a brk(2) call. Also, since all memory is preallocated,​ the VM instance should never get any pagefaults or handle-memory requests for other services'​ new instances; such requests are blocked by the IPC filters as well. If they do occur, they should result in a timeout of the entire multicomponent live update.
 +
 +Overall, it should be clear that live update for the VM service is rather brittle. Eventually, a full revision of the live update approach for VM will have to reveal whether some or all of the current restrictions can be lifted.
  
 ==== State transfer in practice ==== ==== State transfer in practice ====
  
-In this section, we elaborate on some of the practical details of the state transfer of the //magic// framework, mainly aiming to allow developers to resolve real-world state transfer failures.+In this section, we elaborate on some of the practical details of the state transfer of the magic framework, mainly aiming to allow developers to resolve real-world state transfer failures.
  
 We do //not// get into most of the theoretical side of the state transfer, and we skip over many other practical details. Interested readers are advised to read the published work of Cristiano Giuffrida - see the "​Further reading"​ section at the bottom of this document. We do //not// get into most of the theoretical side of the state transfer, and we skip over many other practical details. Interested readers are advised to read the published work of Cristiano Giuffrida - see the "​Further reading"​ section at the bottom of this document.
Line 664: Line 543:
 === Some basics and terminology === === Some basics and terminology ===
  
-The magic framework keeps track of each static object of data using a **sentry** ("​state entry"​) data structure. The framework keeps track of each dynamic object of data using a **dsentry** ("​dynamic state entry"​) data structure, which itself has an embedded sentry data structure. The magic framework uses wrappers around memory allocation routines so that it can allocate extra memory to store the dsentry metadata right before the actual memory object. ​Finally, each special, out-of-band memory ​region is maintained in an **obdsentry** ("​out-of-band dynamic state entry"​) data structure. Since no extra memory can be allocated next to the actual memory object in this case, obdsentries themselves are (currently) stored as static data within libmagic's own state.+The magic framework keeps track of each //static// object of data using a **sentry** ("​state entry"​) data structure. The framework keeps track of each //dynamic// object of data using a **dsentry** ("​dynamic state entry"​) data structure, which itself has an embedded ​//sentry// data structure. The magic pass installs libmagicrt ​wrappers around memory allocation routines so that it can allocate extra memory to store the dsentry metadata right before the actual memory object. ​Special, out-of-band memory ​regions are maintained in **obdsentry** ("​out-of-band dynamic state entry"​) data structur. Since no extra memory can be allocated next to the actual memory object in this case, obdsentries themselves are (currently) stored as static data as part of libmagicrt's own state.
  
-The magic framework also uses the concept of a **selement** ("​state element"​),​ which is a particular element within a state entry; for example, it can be one particular field in a structure. State is transferred one element at a time; if the state transfer procedure encounters a problem, it will report about the particular ​state element causing the problem.+The magic framework also uses the concept of a **selement** ("​state element"​),​ which is a particular element within a state entry; for example, it can be one particular field in a structure. State is transferred one element at a time. If the state transfer procedure encounters a problem, it will report about the state element ​that is causing the problem.
  
-Each pointer is expected to point a data type known to libmagic. All the possible data types that can be used by the service are enumerated through static analysis by the magic pass, and stored in a **type** table as part of the instrumented service. It may happen that one data type is cast to another, either in the source code of the service or as a result of the LLVM compilation and linking process. As a result, while the static analysis may conclude that a pointer is for one type, runtime state transfer may find that the pointer was (for example) allocated for another type. Normally, such mismatches would cause state transfer to fail, but casting makes this a legitimate case. As a result, the magic framework has the notion of **compatible types**: if type A is cast to type B anywhere, type A is marked as a compatible type for type B, and finding type A when transferring data of supposed type B will not result in state transfer failure. The magic pass adds a list of compatible types to the service binary as well, all for use by libmagic ​at state transfer time.+Each pointer ​in the service process ​is expected to point a data type known to libmagicrt. All the possible data types that can be used by the service are enumerated through static analysis by the magic pass, and stored in a **type** table as part of the instrumented service. It may happen that one data type is cast to another, either in the source code of the service or as a result of the LLVM compilation and linking process. As a result, while the static analysis may conclude that a pointer is for one type, runtime state transfer may find that the pointer was (for example) allocated for another type. Normally, such mismatches would cause state transfer to fail, but casting makes this a legitimate case. Therefore, the magic framework has the notion of **compatible types**: if type A is cast to type B anywhere, type A is marked as a compatible type for type B, and finding type A when transferring data of supposed type B will not result in state transfer failure. The magic pass adds a list of compatible types to the service binary as well, all for use by libmagicrt ​at state transfer time.
  
 === Annotation === === Annotation ===
  
-In particular the analysis part of state transfer may not always succeed, for a variety of reasons. In particular, the state transfer framework has problems with unknown pointers, unions, and more generally cases of ambiguity. ​In many cases, the issue can be resolved by the programmer through annotation in source code, which instructs the state transfer framework what to do with a particular variable. A variable can be annotated by prefixing either its type name (through ''​typedef''​) or its variable name with the annotation prefix. The following annotation prefixes are supported ​by the magic framework. +In particular the analysis part of state transfer may not always succeed, for a variety of reasons. In particular, the state transfer framework has problems with unknown pointers, unions, and more generally cases of ambiguity. ​Such issues ​can often be resolved by the programmer through annotation in source code, which instructs the state transfer framework what to do with a particular variable. A variable can be annotated by prefixing either its type name (through ''​typedef''​) or its variable name with the annotation prefix ​followed ​by an underscore ​(e.g., ''​noxfer_foo''​)The following ​annotation ​prefixes ​are supported ​by the magic framework.
- +
-**noxfer_**:​ No Transfer. This annotation will prevent transfer of the state altogether, instead zeroing out the memory in the new instance. As an example, the noxfer annotiation can be used in cases where analysis is failing ​(e.g., ​in unions) and the new instance will never be using the old instance's data anyway. A practical example where it is used is the ''​message'' typeThis data type contains a complicated union, and the quiescence model ensures that transferring this state is not necessary: the service being updated should never be involved in processing a message at the time of the update. +
- +
-**ixfer_**: Identity Transfer. This annotation ​will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain remote pointers, which will be unused ​by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer. +
- +
-**cixfer_**:​ Conditional Identity Transfer. This annotation will cause the state transfer ​framework ​to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''​void *''​ but may be used to store a small integer as well. +
- +
-**pxfer_**: Pointer Transfer. This annotation forces a non-pointer value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. As of writing, this annotation is not used in practice.+
  
-**sxfer_**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures ​ in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them. As an example, the sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure'​s fields as appropriate when annotating the union as sxfer.+  ​* **noxfer**: No Transfer. This annotation will prevent transfer of the state altogether, instead zeroing out the memory in the new instance. As an example, the noxfer annotiation can be used in cases where analysis is failing (e.g., in unions) and the new instance will never be using the old instance'​s data anyway. A practical example where it is used is the ''​message''​ type. This data type contains a complicated union, and the quiescence model typically ensures that transferring this state is not necessary, as the service being updated is not involved in processing a message at the time of the update. 
 +  * **ixfer**: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain external pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer. 
 +  * **cixfer**: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''​void *''​ but may be used to store a small integer as well. 
 +  * **pxfer**: Pointer Transfer. This annotation forces a value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. The pxfer annotation may also be used for a union of (differently typed) pointers. Thus, in some cases, a union-of-structures can be split up into a union of non-pointers and one or more unions of pointers, marking the non-pointer union with ''​ixfer''​ and the pointer union(s) with ''​pxfer''​. This is indeed how ''​pxfer''​ is currently used in practice as well. 
 +  * **sxfer**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them, etceteraThe sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure'​s fields as appropriate when annotating the union as sxfer.
  
 The transfer exception is applied to the type (or variable) with the annotation. For example, a noxfer typedef for a pointer to a structure will refrain from transferring that pointer: The transfer exception is applied to the type (or variable) with the annotation. For example, a noxfer typedef for a pointer to a structure will refrain from transferring that pointer:
Line 698: Line 573:
   noxfer_foo_t * my_foo_pointer = &​my_foo_struct;​ /* the pointer will be transferred */   noxfer_foo_t * my_foo_pointer = &​my_foo_struct;​ /* the pointer will be transferred */
  
-It is possible to enable debugging flags in libmagic ​such that it will print more details on how it handles annotated exceptions: in ''​minix/​llvm/​include/​st/​callback.h'',​ change ''​ST_CB_DEFAULT_FLAGS''​ from ''​(ST_CB_PRINT_ERR)''​ to ''​(ST_CB_PRINT_ERR|ST_CB_PRINT_DBG)''​. The debugging statements will be sent to the system log, and have a ''​[DEBUG]''​ prefix.+It is possible to enable debugging flags in libmagicrt ​such that it will print more details on how it handles annotated exceptions: in ''​minix/​lib/​libmagicrt/​include/​st/​callback.h'',​ change ''​ST_CB_DEFAULT_FLAGS''​ from ''​(ST_CB_PRINT_ERR)''​ to ''​(ST_CB_PRINT_ERR|ST_CB_PRINT_DBG)''​. The debugging statements will be sent to the system log, and have a ''​[DEBUG]''​ prefix.
  
 === Custom state transfer routines === === Custom state transfer routines ===
Line 706: Line 581:
 TODO TODO
  
-There is currently one example case where a custom state transfer routine is used, namely for the ''​dsi_u''​ union in the ''​struct data_store''​ structure which is used by the Data Store (DS) service and defined in ''​minix/​servers/​ds/​store.h''​. The custom state transfer routine is located in ''​minix/​llvm/static/​magic/​minix/​magic_ds.c'',​ and basically ​provides the state transfer process with information as to which of the union elements ​should be used for transfer.+There is currently one example case where a custom state transfer routine is used, namely for the ''​dsi_u''​ union in the ''​struct data_store''​ structure which is used by the Data Store (DS) service and defined in ''​minix/​servers/​ds/​store.h''​. The custom state transfer routine is located in ''​minix/​lib/libmagicrt/​magic_ds.c'',​ and provides the state transfer process with information as to which of the union's fields ​should be transferred.
  
 === Preventing state transfer issues === === Preventing state transfer issues ===
  
-In some cases, small adjustments must be made to a service in order to prevent issues with state transfer. These issues will not result in failure of the state transfer procedure; instead, they may result in a crash of the new instance after a seemingly successful live update. We cover three topics: memory grants, userspace threads, and physically unmovable regions.+In some cases, small adjustments must be made to a service in order to prevent issues with state transfer. These types of issues will not result in failure of the state transfer procedure; instead, they may result in a crash of the new instance after a seemingly successful live update. We cover three topics: memory grants, userspace threads, and physically unmovable regions.
  
 == Memory grants == == Memory grants ==
  
-One potential issue concerns memory grants. Each service has a memory grant table, which is an array of all the memory grants that allow other processes to read and/or write the service'​s memory. If the service has any grants active at the time of a live update, the grants should in theory be adjusted ​corresponding to any relocation of the memory pointed to by the grants.+One potential issue concerns memory grants. Each service has a memory grant table, which is an array of all the memory grants that allow other processes to read and/or write the service'​s memory. If the service has any grants active at the time of a live update, the grants should in theory be adjusted ​in accordance with any relocation of the memory pointed to by the grants.
  
 However, the main union of the grant structure (''​cp_grant_t'',​ defined in ''​minix/​include/​minix/​safecopies.h''​) is currently marked as //ixfer//, meaning it will be transferred as is. This is not a problem for grants that point to memory //outside// the process being updated, and that means that indirect and magic grants pose no problem for state transfer. It is however a problem for grants that point to memory //inside// the process being updated, that is, for **direct grants**. However, the main union of the grant structure (''​cp_grant_t'',​ defined in ''​minix/​include/​minix/​safecopies.h''​) is currently marked as //ixfer//, meaning it will be transferred as is. This is not a problem for grants that point to memory //outside// the process being updated, and that means that indirect and magic grants pose no problem for state transfer. It is however a problem for grants that point to memory //inside// the process being updated, that is, for **direct grants**.
  
-For this reason, ​the writer of any service that may potentially have direct grants active at the time of the live update, has two options: 1) implement a custom state transfer routine for the ''​cp_grant_t''​ structure in libmagic, thus resolving the problem described in this section altogether, or 2) block live updates of the service whenever the service has active memory grants. ​Since the first option is the best, we do not describe the second option in detail here. Instead, see the next section regarding blocking updates using custom quiescence states. In any case, the potential consequence of doing neither is that the service ends up suffering from arbitrary memory corruption after a live update, since the transferred direct grant will point to the wrong memory location.+For this reason, ​for a service that may potentially have direct grants active at the time of the live update, ​its writer ​has two options: 1) implement a custom state transfer routine for the ''​cp_grant_t''​ structure in libmagicrt, thus resolving the problem described in this entire ​section altogether, or 2) prevent ​live updates of the service whenever the service has active memory grants. ​The first option is preferred. In any case, the potential consequence of doing neither is that the service ends up suffering from arbitrary memory corruption after a live update, since the transferred direct grant will point to the wrong memory location.
  
-The live update system itself actually relies on the presence of a long-running direct grant, which provides access of the process'​s full address space to the process itself. ​This grant is used during a live update ​by the new instance ​to access the memory contents of the old instance. Since the grant provides access to the process'​s entire address space, it does not suffer from the problem above.+The live update system itself actually relies on the presence of a long-running direct grant, which provides access of the process'​s full address space to the process itself. ​The new instance uses this grant during a live update to access the memory contents of the old instance. Since the grant provides access to the process'​s entire address space, it does not suffer from the problem above.
  
 == Userspace threads == == Userspace threads ==
  
-Userspace threads pose a problem for state transfer as well. We have previously explained that the process stack of the old instance can be disregarded by the state transfer procedure because it is "​naturally"​ recreated in the new instance. The same does not apply to the stacks of userspace threads, since variables ​on the stack are not tracked at run time: even though the threads'​ stacks are transferred to the new instance by the magic framework, they are seen as blobs of (typically) memory-mapped character arrays. The result is that any pointers on these stacks will not be known to libmagic ​and thus not be transferred properly. In addition, thread context (CPU register) state will typically be stored as an array of integers, and similarly end up being skipped by the state transfer procedure. The result is that while state transfer may (appear to) succeed, the service will crash after completion of the live update.+Userspace threads pose a problem for state transfer as well. We have previously explained that the process stack of the old instance can be disregarded by the state transfer procedure because it is "​naturally"​ recreated in the new instance. The same does not apply to the stacks of userspace threads, since stack variables are not tracked at run time: even though the threads'​ stacks are transferred to the new instance by the magic framework, they are seen as blobs of (typically) memory-mapped character arrays. The result is that any pointers on these stacks will not be known to libmagicrt ​and thus not be transferred properly. In addition, thread context (CPU register) state will typically be stored as an array of integers, and similarly end up being skipped by the state transfer procedure. The result is that while state transfer may (appear to) succeed, the service will crash after completion of the live update.
  
 At this time, the recommended solution is for the service to shut down all threads explicitly before starting state transfer, and to recreate the threads both after successful live update and as part of a rollback. The service may refuse to be updated if any of its threads are in use and cannot be shut down. The last point requires that the service supply a custom callback routine to SEF to perform that check for a quiescence state other than the default, through sef_setcb_lu_prepare(3). In order to allow the use of a nondefault state, a sef_setcb_lu_state_isvalid(3) callback routine must be supplied as well. For VFS and libblockdriver,​ we have chosen the following approach: At this time, the recommended solution is for the service to shut down all threads explicitly before starting state transfer, and to recreate the threads both after successful live update and as part of a rollback. The service may refuse to be updated if any of its threads are in use and cannot be shut down. The last point requires that the service supply a custom callback routine to SEF to perform that check for a quiescence state other than the default, through sef_setcb_lu_prepare(3). In order to allow the use of a nondefault state, a sef_setcb_lu_state_isvalid(3) callback routine must be supplied as well. For VFS and libblockdriver,​ we have chosen the following approach:
Line 737: Line 612:
 == Physically unmovable regions == == Physically unmovable regions ==
  
-Another case where the programmer may have to ensure that state transfer does not result in problems that will surface only after the live update, is when a service uses memory areas that are physically unmovable. Such memory areas are typically in use for DMA purposes. If the state transfer procedure changes the physical location of the buffers, DMA may be performed from or to the original physical location, resulting in garbage and possibly arbitrary memory corruption. Such DMA areas must be marked as special out-of-band memory in libmagic, and unmarked when freed, using the sef_llvm_add_special_mem_region(3) and sef_llvm_del_special_mem_region(3) SEF calls. This is done automatically by the alloc_contig(3) and free_contig(3) wrapper routines, but must be done explicitly for memory allocated in different ways.+Another case where the programmer may have to ensure that state transfer does not result in problems that will surface only after the live update, is when a service uses memory areas that are physically unmovable. Such memory areas are typically in use for DMA purposes. If the state transfer procedure changes the physical location of the buffers, DMA may be performed from or to the original physical location, resulting in garbage and possibly arbitrary memory corruption. Such DMA areas must be marked as special out-of-band memory in libmagicrt, and unmarked when freed, using the sef_llvm_add_special_mem_region(3) and sef_llvm_del_special_mem_region(3) SEF calls. This is done automatically by the alloc_contig(3) and free_contig(3) wrapper routines, but must be done explicitly for memory allocated in different ways.
  
-However, this is only necessary if DMA can happen across a live update. In cases where it is known that no DMA can possibly be ongoing during the live update, the regions are not actually physically unmovable, and therefore need not be marked as such. For example, this is the case for the file system buffer cache implemented in libminixfs. This library allocates and manages buffers without using physically contiguous memory and alloc_contig(3),​ instead using mmap(2) directly and requesting DMA I/O in page-sized chunks (in order to avoid DMA issues on ARM). Therefore, it would be affected by the above problem, were it not for the fact that all its block I/O calls are synchronous.+However, this is only necessary if DMA can happen across a live update. In cases where it is known that no DMA can possibly be ongoing during the live update, the regions are not actually physically unmovable, and therefore need not be marked as such. For example, this is the case for the file system buffer cache implemented in libminixfs. This library allocates and manages buffers without using physically contiguous memory and alloc_contig(3),​ instead using mmap(2) directly and requesting DMA I/O in page-sized chunks (in order to avoid DMA issues on ARM). Therefore, it would be affected by the above problem, were it not for the fact that all its block I/O calls are synchronous. Any future introduction of more asynchrony will turn this situation into a real problem for live update, though.
  
 As we mentioned before, memory-mapped I/O poses a similar problem. However, the only way to map such I/O memory is currently through the vm_map_phys(2) and vm_unmap_phys(2) calls, of which the libsys wrappers automatically call the special-memory marking/​unmarking functions as well. As we mentioned before, memory-mapped I/O poses a similar problem. However, the only way to map such I/O memory is currently through the vm_map_phys(2) and vm_unmap_phys(2) calls, of which the libsys wrappers automatically call the special-memory marking/​unmarking functions as well.
Line 745: Line 620:
 === Resolving state transfer errors === === Resolving state transfer errors ===
  
-As mentioned previously, ​the state transfer ​routine reports detected failures, with details written to the system log entries using an ''​[ERROR]''​ prefix. In this section, we cover a number of common reasons for state transfer to fail in practice, including some example errors and workarounds.+If the magic state transfer ​procedure encounters problems, it will report failure, with details written to the system log entries using an ''​[ERROR]''​ prefix. In this section, we cover a number of common reasons for state transfer to fail in practice, including some example errors and workarounds.
  
 == Dangling pointers == == Dangling pointers ==
  
-In order to know how to transfer a piece of memory, the magic library must know about the data type associated to that piece of memory. If no type information is known for a piece of memory, it cannot be transferred. There are various reasons why libmagic ​might not have type information about a piece of memory. The simplest one is a case of a **dangling pointer**: a pointer that used to be valid at some point, but no longer is, because the memory ​itself ​has been deallocated. While the actual program may know not to use that particular pointer anymore, the state transfer routine does not have such knowledge. A typical error resulting from a dangling pointer may look like this, with some important parts of the output highlighted:​+In order to know how to transfer a piece of memory, the magic runtime ​library must know about the data type associated to that piece of memory. If no type information is known for a piece of memory, it cannot be transferred. There are various reasons why libmagicrt ​might not have type information about a piece of memory. The simplest one is a case of a **dangling pointer**: a pointer that used to be valid at some point, but no longer is, because the memory ​pointed to has been deallocated. While the actual program may know not to use that particular pointer anymore, the state transfer routine does not have such knowledge. A typical error resulting from a dangling pointer may look like this, with some important parts of the output highlighted:​
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=sbuf.1900354961,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**sbuf**.1900354961,​ type=TYPE: (id=53 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**)) +  * SELEMENT: ​''<​nowiki>​(parent=sbuf.1900354961,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**sbuf**.1900354961,​ type=TYPE: (id=53 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x080cb49f**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=D0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x080cb49f**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=D0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​
  
-In this case, the global variable **sbuf** (suffixed with a tag to make its name unique) is a char* pointer to location 0x080cb49f. ​The magic library knows no type information about this target (//trg//) memory location, ​therefore ​marks the location with the placeholder type UNKNOWN_TYPEand fails state transfer because an unknown type was found. Another example:+In this case, the global variable **sbuf** (suffixed with a tag to make its name unique) is a char* pointer to location 0x080cb49f. ​Since the magic runtime ​library knows no type information about this target (//trg//) memory location, ​it marks the location with the placeholder type UNKNOWN_TYPE and aborts ​state transfer because an unknown type was found. Another example:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=inode.3951291702,​ num=80, depth=2, address=0xdfbe3210,​ **name**=**inode.3951291702/​4/​i_data**,​ type=TYPE: (id=61 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**)) +  * SELEMENT: ​''<​nowiki>​(parent=inode.3951291702,​ num=80, depth=2, address=0xdfbe3210,​ **name**=**inode.3951291702/​4/​i_data**,​ type=TYPE: (id=61 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=80, type=ptr, flags(DIVW)=1110,​ **value**=**0x08108098**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=H0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=80, type=ptr, flags(DIVW)=1110,​ **value**=**0x08108098**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=H0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=H0,​ ptr_found=1,​ unknown_found=1,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=H0,​ ptr_found=1, ​**unknown_found=1**, violations=1)</​nowiki>''​
  
-In this case, the **i_data** field of the fifth element (**/4/**) of the global **inode** structure, also a char* pointer, is pointing to address 0x08108098 which is similarly ​unknown to libmagic. The pointer address typically allows one to determine what kind of memory it is, by means of the memory sections of the process. In this particular example, the address was somewhat higher than the service'​s data end, thus suggesting the memory pointed to is heap memory. This matched with the source code of the service (PFS, the Pipe File Server), which dynamically allocates and frees the i_data buffers using malloc(3) and free(3).+In this case, the **i_data** field of the fifth element (**/4/**) of the global **inode** structure, also a char* pointer, is pointing to address 0x08108098 which is unknown to libmagicrt. The pointer address typically allows one to determine what kind of memory it is, by means of the memory sections of the process. In this particular example, the address was somewhat higher than the service'​s data end, thus suggesting the memory pointed to is heap memory. This matched with the source code of the service (PFS, the Pipe File Server), which dynamically allocates and frees the i_data buffers using malloc(3) and free(3).
  
 It is up to the programmer of the service to ensure that the state transfer routine will not attempt to transfer a dangling pointer. This can be as simple as zeroing out the pointer after use, which is usually good practice anyway: It is up to the programmer of the service to ensure that the state transfer routine will not attempt to transfer a dangling pointer. This can be as simple as zeroing out the pointer after use, which is usually good practice anyway:
Line 774: Line 649:
 == External pointers == == External pointers ==
  
-A similar problem occurs when a process has a pointer that is only valid in the address space of another process, or possibly the kernel. Unlike dangling pointers, such external pointers are never valid, and thus do not need to be transferred ​in any case. The magic framework must be instructed to that end, for example using //noxfer// annotation. However, external pointers often end up in the local address space as a result of copying in entire structures ​as once (we already gave process tables as an example), in which case it may be necessary to use //ixfer// rather than //noxfer//. For example, the ProcFS (/proc File System) service has several instances of the following construction:​+A similar problem occurs when a process has a pointer that is only valid in the address space of another process, or possibly the kernel. Unlike dangling pointers, such external pointers are never valid, and thus do not need to be transferred ​as pointers. The magic framework must be instructed to that end, for example using //noxfer// annotation. However, external pointers often end up in the local address space as a result of copying in entire structures ​at once (we already gave process tables as an example), in which case it may be necessary to use //ixfer// rather than //noxfer//. For example, the ProcFS (/proc File System) service has several instances of the following construction:​
  
   typedef struct mproc ixfer_mproc_t;​   typedef struct mproc ixfer_mproc_t;​
   static ixfer_mproc_t mproc;   static ixfer_mproc_t mproc;
  
-In some cases, it may make more sense to zero out pointers instead. In other cases, we have changed code to retrieve not entire kernel tables but only specific values, or to use the kernel-mapped pages instead of copies of kernel structures to retrieve values. The magic library already ignores pointers into kernel space (that is, 0xf0000000 and higher) altogether.+In some cases, it may make more sense to zero out pointers instead. In other cases, we have changed code to retrieve not entire kernel tables but only specific values, or to use the kernel-mapped pages instead of copies of kernel structures to retrieve values. The magic runtime ​library already ignores pointers into kernel space (that is, 0xf0000000 and higher) altogether.
  
 Theoretically it is possible that remote pointers end up being valid in the local address space by sheer luck. In known cases of copying in external pointers, it is best to not to rely on failures in the magic framework, but rather annotate the code in a proactive manner. Theoretically it is possible that remote pointers end up being valid in the local address space by sheer luck. In known cases of copying in external pointers, it is best to not to rely on failures in the magic framework, but rather annotate the code in a proactive manner.
Line 785: Line 660:
 == Weak symbols == == Weak symbols ==
  
-If a service uses weak symbols, the code and data pointed to by these weak symbols may not be included in the linked service object at the time that the instrumentation passes are run. However, the weak symbols will be resolved and included after the instrumentation stage. This results in the situation that some of the code and data that is part of the service, will not have been analyzed by the magic pass. The result is a range of possible state transfer failures, including cases where pointers end up pointing to unknown memory and cases where memory allocation is not properly instrumented,​ ultimately leading to pointers to unknown memory ​as well.+If a service uses weak symbols, the code and data pointed to by these weak symbols may not be included in the linked service object at the time that the instrumentation passes are run. These weak symbols will be resolved and included ​only after the instrumentation stage. This results in the situation that some of the code and data that is part of the service, will not have been analyzed by the magic pass. The result is a range of possible state transfer failures, including cases where pointers end up pointing to unknown ​static ​memory and cases where memory allocation is not properly instrumented,​ ultimately leading to pointers to unknown ​dynamic ​memory.
  
 The following example is from the DS service, where its use of the weak aliases for regcomp(3) and regfree(3) resulted in regcomp'​s malloc(3) calls not being instrumented:​ The following example is from the DS service, where its use of the weak aliases for regcomp(3) and regfree(3) resulted in regcomp'​s malloc(3) calls not being instrumented:​
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=ds_subs.1944246923,​ num=9, depth=3, address=0xdfbe6108,​ name=**ds_subs.1944246923/​0/​regex/​re_g**,​ type=TYPE: (id=18 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=opaque*)) +  * SELEMENT: ​''<​nowiki>​(parent=ds_subs.1944246923,​ num=9, depth=3, address=0xdfbe6108,​ name=**ds_subs.1944246923/​0/​regex/​re_g**,​ type=TYPE: (id=18 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=opaque*))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=9, type=ptr, flags(DIVW)=1110,​ **value**=**0x08111000**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=9, type=ptr, flags(DIVW)=1110,​ **value**=**0x08111000**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, ptr_found=1,​ unknown_found=1,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, ptr_found=1, ​**unknown_found=1**, violations=1)</​nowiki>''​
  
-In this case, the pointer **ds_subs[0].regex.re_g** ended up pointing to the unknown heap-section value of 0x08111000. We worked around this issue by forcing DS to use the targets of the weak aliases, _regcomp and _regfree, rather than their original names.+In this case, the pointer **ds_subs[0].regex.re_g** ended up pointing to the unknown heap-section value of 0x08111000. We worked around this issue by forcing DS to use the targets of the weak aliases, _regcomp and _regfree, rather than their original names, using Makefile hacks.
  
-== Code used only in libmagic ​==+== Code used only in libmagicrt ​==
  
-Very similarly to the above case, if the magic library itself uses other library modules, for example from libc, and these modules ​were not already used by the service itself anyway, then the bitcode linker may not include them in the linked object on which the instrumentation passes are run. Again, this may result in various failures, and unknown pointers in particular:+If the magic runtime ​library itself uses other library modules, for example from libc, and these modules ​are not already used by the service itself anyway, then the bitcode linker may not include them in the linked object on which the instrumentation passes are run. Again, this may result in various failures, and unknown pointers in particular:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=_ctype_tab_,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**_ctype_tab_**,​ type=TYPE: (id=204 ​ , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=11000000,​ values=%%''​%%,​ type_str=i16/​unsigned short*)) +  * SELEMENT: ​''<​nowiki>​(parent=_ctype_tab_,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**_ctype_tab_**,​ type=TYPE: (id=204 ​ , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=11000000,​ values=%%''​%%,​ type_str=i16/​unsigned short*))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x0809ccb6**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x0809ccb6**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, ptr_found=1,​ unknown_found=1,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, ptr_found=1, ​**unknown_found=1**, violations=1)</​nowiki>''​
  
-In this particular failure case, the global **_ctype_tab_** variable pointed into another global variable, at location 0x0809ccb6 the data section, but the other global variable was invisible to the magic pass, so no **sentry** object could be created for it. The _ctype_tab_ variable itself was referred ​by the ''<​ctype.h>''​ isalpha(3) (etc) set of macros from within ​libmagic. We worked around this issue by putting our own replacement set of macros in libmagic ​instead.+In this particular failure case, the global **_ctype_tab_** variable pointed into another global variable, at location 0x0809ccb6 the data section. The other global variable was invisible to the magic pass, so no **sentry** object could be created for it. As a result, libmagicrt did not know about the target of the pointer. The _ctype_tab_ variable itself was used by the ''<​ctype.h>''​ isalpha(3) (etc) set of macros from within ​libmagicrt. We worked around this issue by putting our own replacement set of macros in libmagicrt ​instead.
  
 == Assembly code == == Assembly code ==
  
-Yet another case that leads to invisibility of certain aspects is the direct inclusion of assembly code. Assembly code is machine code, not bitcode, and thus, the bitcode instrumentation passes will have problems processing them. Needless to say, the use of assembly code should be minimal throughout the source code. In cases where it cannot be avoided, custom solutions ​will have to be found for any resulting state transfer problems. Fortunately,​ much of the assembly in use by services these days is the result of optimized str*(3) and mem*(3) functions, which require no special treatment for the purpose of state transfer.+Yet another case that leads to invisibility of certain aspects is the direct inclusion of assembly code. Assembly code is machine code, not bitcode, and thus, the bitcode instrumentation passes will have problems processing them. Needless to say, the use of assembly code should be minimal throughout the source code. In cases where it cannot be avoided, custom solutions have to be found for any resulting state transfer problems. Fortunately,​ much of the assembly in use by services these days is the result of optimized str*(3) and mem*(3) functions, which require no special treatment for the purpose of state transfer.
  
 == Incompatible types == == Incompatible types ==
  
-Finally, we describe one class of state transfer failures which are the result of shortcomings in the magic instrumentation framework. LLVM bitcode has the notion of an **opaque** data type, which is the type used for data of which the type has been declared but not defined, typically as a result of forward declarations of structures. ​As a result, opaque pointers may show up in various places: instead ​of resolving these types after they have been instantiated,​ LLVM tends to cast between various data types which are identical except for the presence of opaque pointers. The magic pass should mark all these practically identical data types as //​compatible types//, ​but due to the fact that the casts can take rather complex forms, this is not always happening. The result is that in some cases, state transfer may fail because ​libmagic ​erroneously detects an incompatibility between a pointer type and the type of data being pointed to. As an example, ​this is a state transfer error that was reported during state transfer of the PM service:+Finally, we describe one class of state transfer failures which are the result of shortcomings in the magic instrumentation framework ​itself. LLVM bitcode has the notion of an **opaque** data type. The opaque data type is used for data of which the type has been declared but not defined, typically as a result of forward declarations of structures ​(''​struct foo;''​)Instead ​of resolving these types after they have been instantiated,​ LLVM tends to cast between various data types which are identical except for the presence of opaque pointers. ​As a result, opaque pointers may show up in various places in linked bitcode. 
 + 
 +The magic pass should mark all these practically identical data types as //​compatible types//. However, due to the fact that the casts can take rather complex forms, this is not always happening. The result is that in some cases, state transfer may fail because ​libmagicrt ​erroneously detects an incompatibility between a pointer type and the type of data being pointed to. As an example, ​the following ​state transfer error was reported during state transfer of the PM service:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=timers.515278380,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**timers**.515278380,​ type=TYPE: (id=96 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func opaque%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*)) +  * SELEMENT: ​''<​nowiki>​(parent=timers.515278380,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**timers**.515278380,​ type=TYPE: (id=96 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func opaque%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=1, type=ptr, flags(DIVW)=1110,​ value=0x08147460,​ trg_name=mproc,​ trg_offset=274616,​ trg_flags(RL)=D0,​ trg_selements=(**#​2**|0:​ **1**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=**mproc/​143/​mp_timer**,​ type=TYPE: (id=38 ​  , name=minix_timer,​ size=16, num_child_types=4,​ type_id=9, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ names='​minix_timer_t|minix_timer',​ **type_str**={ $minix_timer tmr_next { $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/​*,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } })), **2**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=mproc/​143/​mp_timer/​tmr_next,​ type=TYPE: (id=37 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=1, type=ptr, flags(DIVW)=1110,​ value=0x08147460,​ trg_name=mproc,​ trg_offset=274616,​ trg_flags(RL)=D0,​ trg_selements=(**#​2**|0:​ **1**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=**mproc/​143/​mp_timer**,​ type=TYPE: (id=38 ​  , name=minix_timer,​ size=16, num_child_types=4,​ type_id=9, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ names='​minix_timer_t|minix_timer',​ **type_str**={ $minix_timer tmr_next { $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/​*,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } })), **2**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=mproc/​143/​mp_timer/​tmr_next,​ type=TYPE: (id=37 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)</​nowiki>''​
  
-In this case, the analysis failed on the global **timers** variable. The analysis dump shows that two matching types (**#2**) were found, both associated with the **mproc[143].mp_timer** structure field, but neither type matched the type of the pointer. A closer look at the textual representations of the pointer type (the **type_str** of the primary //​selement//​) and of the data types (the //​type_str//​ of the target //​selement//​s) reveals that the only difference ​is that the **tmr_func** field of the structure type to which the //timers// variable should point is an **opaque** pointer, whereas the same //​tmr_func//​ field of the target structures is a particular function pointer (to a function referred to as **hash_3792421438**). The remainder of the types are the same.+In this case, the analysis failed on the global **timers** variable. The analysis dump shows that two matching types (**#2**) were found, both associated with the **mproc[143].mp_timer** structure field, but neither type matched the type of the pointer. A closer look at the textual representations of the pointer type (the **type_str** of the primary //​selement//​) and of the data types (the //​type_str//​ of the target //​selement//​s) reveals that there is only one difference ​between the two: the **tmr_func** field of the structure type to which the //timers// variable should point is an **opaque** pointer, whereas the same //​tmr_func//​ field of the target structures is a particular function pointer (to a function referred to as **hash_3792421438**). The remainder of the types are the same.
  
-As an aside, the **\n** notation indicates ​type recursion of the type **n** levels up. The asterisk at the end of a { structure } block indicates a pointer to this structure. In this case, the //timers// variable is a pointer to a **minix_timer_t** structure. In the type string of //timers//, the **\2** after the **tmr_next** field indicates that it is again a pointer to //​minix_timer_t//:​ one type level up is the structure itself, two type levels up is the pointer to the //​minix_timer_t//​ structure. In this case there are no three levels up, but in other cases three levels up could for example be a pointer to a pointer to the structure, etcetera.+The type strings are somewhat difficult to read. The asterisk at the end of a { structure } block indicates a pointer to this structure. In this case, the //timers// variable is a pointer to a **minix_timer_t** structure. The **\n** notation indicates type recursion of the type **n** levels up. In the type string of //timers//, the **\2** after the **tmr_next** field indicates that it is again a pointer to //​minix_timer_t//:​ one type level up is the structure itself, two type levels up is the pointer to the //​minix_timer_t//​ structure. In this case there are no three levels up, but in other cases three levels up could for example be a pointer to a pointer to structure, etcetera. Although irrelevant in this case, the name of each structure is prefixed with a dollar sign, and **(U)** denotes a union.
  
-The analysis ​fails because it finds different, ​noncompatible, and therefore **other** types, even though the opaque pointer and the function pointer ​are really the same field types. A look at the corresponding declarations in ''​minix/​include/​minix/​timers.h''​ shows that there is indeed a forward declaration of ''​struct minix_timer''​ which is the cause of LLVM's link-time introduction of casts. We resolved this case by extending the casting analysis of the magic pass to include casts of structures through function prototypes.+In this case, the analysis ​failed ​because it foudn different, ​incompatible, and therefore **other** types, even though the opaque pointer and the function pointer ​were really the same field types. A look at the corresponding declarations in ''​minix/​include/​minix/​timers.h''​ shows that there is indeed a forward declaration of ''​struct minix_timer''​ which is the cause of LLVM's link-time introduction of casts. We resolved this case by extending the casting analysis of the magic pass to include casts of structures through function prototypes.
  
 The following example also resulted from the same forward declaration of MINIX3 timer structures, this time in the sched (scheduling) service: The following example also resulted from the same forward declaration of MINIX3 timer structures, this time in the sched (scheduling) service:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=sched_timer.29458437,​ num=4, depth=1, address=0xdfbe70b0,​ **name**=**sched_timer.29458437/​tmr_func**,​ type=TYPE: (id=17 ​  , name=tmr_func_t,​ size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=**opaque%%*%%**)) +  * SELEMENT: ​''<​nowiki>​(parent=sched_timer.29458437,​ num=4, depth=1, address=0xdfbe70b0,​ **name**=**sched_timer.29458437/​tmr_func**,​ type=TYPE: (id=17 ​  , name=tmr_func_t,​ size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=**opaque%%*%%**))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=4, type=ptr, flags(DIVW)=1110,​ value=0x08048dc0,​ trg_name=**balance_queues**.29458437,​ trg_offset=0,​ trg_flags(RL)=T0,​ trg_selements=(#​1|0:​ 1|o=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x08048dc0,​ name=???, type=TYPE: (id=119 ​ , name=, size=1, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=11000001,​ values=%%''​%%,​ type_str=**hash_3792445575/​**)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=4, type=ptr, flags(DIVW)=1110,​ value=0x08048dc0,​ trg_name=**balance_queues**.29458437,​ trg_offset=0,​ trg_flags(RL)=T0,​ trg_selements=(#​1|0:​ 1|o=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x08048dc0,​ name=???, type=TYPE: (id=119 ​ , name=, size=1, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=11000001,​ values=%%''​%%,​ type_str=**hash_3792445575/​**))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=T0,​ ptr_found=1,​ other_types_found=1,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=T0,​ ptr_found=1, ​**other_types_found=1**, violations=1)</​nowiki>''​
  
-In this case, the type mismatch was not between two structures that differed in opaque fields, but between two function pointers themselves: the function pointer in **sched_timer.tmr_func**,​ and the function it is pointing to, **balance_queues**. Registering these types as compatible would result in much more complexity in the magic pass, and likely still not resolve the more general problem of opaque pointers. This is currently one of the open issues, and we believe that another approach would be more viable; see below. In this particular case, it turned out that the sched service did not need to use timers at all, and we simplified it by getting rid of its timer altogether. Obviously, adapting the actual functionality of a service to allow for state transfer is not always an option, nor is it generally the right approach: the core code of system services should not have to be (re)written specifically to allow for state transfer.+In this case, the type mismatch was not between two structures that differed in opaque fields, but between two function pointers themselves: the function pointer in **sched_timer.tmr_func**,​ and the function it is pointing to, **balance_queues**. Registering these types as compatible would result in much more complexity in the magic pass, and likely still not resolve the more general problem of opaque pointers. This is currently one of the open issues, and we believe that another approach would be more viable; see below. In this particular case, it turned out that the sched service did not need to use timers at all, and we simplified it by getting rid of its use of timers ​altogether. Obviously, adapting the actual functionality of a service to allow for state transfer is not always an option, nor is it generally the right approach: the core code of system services should not have to be (re)written specifically to allow for state transfer.
  
 ===== Open issues ===== ===== Open issues =====
Line 843: Line 720:
 ==== The build system ==== ==== The build system ====
  
-As shown in the setup part of the users guide, the entire ​live update ​build infrastructure consists of separate set of scripts built on top of the regular build system. These scripts deviate from the standard build system approach in various ways, for example by building ​a separate ​copy of the LLVM toolchain, placing binaries in the MINIX3 source tree, and separately performing scripted steps which should ​be performed from the regular makefile infrastructure insteadAll of these issues should be resolved through proper integration of the live update build infrastructure into the regular build system. +As shown in the setup part of the users guide, the live update ​functionality requires that a a separate ​instance ​of the LLVM toolchain be builtUnlike ​the standard toolchain, this separate instance is suitable for Link-Time Optimization (LTO). It is built by ''​minix/​llvm/​generate_gold_plugin.sh'',​ and placed in ''​obj_llvm.i386''​. ​The exact same LLVM 3.6.1 source code is used to compile both the LTO-enabled toolchain ​and the additional regular crosscompilation toolchain in ''​obj.i386'', ​using the exact configuration flags. The separate compilation is necessary ​only because of a problem with makefiles.
- +
-=== Two LLVM toolchains === +
- +
-A major part of the problem is the current necessity to build a separate instance ​of the LLVM toolchain. This is the instance that is suitable for Link-Time Optimization (LTO)built by ''​minix/​llvm/​generate_gold_plugin.sh'',​ and placed in ''​obj_llvm.i386''​. ​Even though the exact same LLVM 3.source code is used to compile both this and the additional regular crosscompilation toolchain in ''​obj.i386'',​ the separate compilation is necessary because of a problem with makefiles+
- +
-NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain not being built in the same way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation. +
- +
-The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the missing pieces without the need to build LLVM twice. +
- +
-=== Lack of integration === +
- +
-Once that step has been taken, it should be possible to resolve the other issues as well, effectively replacing all the ''​*.llvm''​ scripts in ''​minix/​llvm''​ with extensions in the regular build system, specifically by adapting the ''​share/​mk''​ set of makefiles as appropriate. All of this should be optional, controlled by the ''​MKMAGIC''​ build (pseudo)variable and possibly other, new build variables. Ultimately, relinking with libmagic and invoking the appropriate link-time passes should be performed by those makefiles. As an example, the WeakAliasModuleOverride pass is already invoked this way.+
  
-All passes, as well as the magic library, should be (re)built as part of the standard ​build system ​infrastructureAs we have indicated earlier, the lack of this step puts an unnecessary burden on the user of the system.+NetBSD uses its own set of makefiles to build imported code using its own build system. ​MINIX3 imports this systemand thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain being built in a different way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation.
  
-As part of this, none of the generated binaries should be placed ​in ''​minix/​llvm/​bin''​. Insteadthey should end up in an appropriate subdirectory ​of ''​obj.i386'',​ thereby keeping ​the source directory clean.+The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefilesthereby generating all the necessary parts of the toolchain without the need to build LLVM twice.
  
-Finallyany generated ​ASR-rerandomized service binaries ​should ​automatically ​be removed when the corresponding service is reinstalledso as to prevent that stale ASR binaries ​end up in a generated image.+As part of thisthe generated ​instrumentation passes ​should ​not be placed in the ''​minix/​llvm/​bin''​ subdirectory of the source MINIX3 tree. Insteadthey should ​end up in an appropriate subdirectory of ''​obj.i386'',​ thereby keeping the source directory clean.
  
 ==== Instrumentation ==== ==== Instrumentation ====
Line 869: Line 734:
 === Type unification === === Type unification ===
  
-As shown in the developers guide, the magic instrumentation pass is not always capable of establishing that two different data types are in fact compatible, resulting in state transfer errors at run time. The main cause of these issues lies in LLVM's use of the **opaque** placeholder data type.+As shown in the developers guide, the magic instrumentation pass is not always capable of establishing that two different data types are in fact compatible, resulting in state transfer errors at run time. The main cause of these issues lies in LLVM's use of the **opaque** placeholder data type. We described the practical results of this in the earlier "​Incompatible types" section.
  
-This problem is a product of circumstances. Between LLVM 2.x and LLVM 3.x, a significant change was made in LLVM regarding the way that types are handled. In a nutshell, rather than unifying various instances of the same data type at compile time, LLVM 3.x keeps these instances as separate ​type, instead using bit casting between the types to resolve the resulting incompatibilities at link time. More details about this change can be found in the LLVM blog post +This problem is a product of circumstances. Between LLVM 2.x and LLVM 3.x, a significant change was made in LLVM regarding the way that types are handled. In a nutshell, rather than unifying various instances of the same data type at compile time, LLVM 3.x keeps these instances as separate ​types, instead using bit casting between the types to resolve the resulting incompatibilities at link time. More details about this change can be found in the LLVM blog post 
 [[http://​blog.llvm.org/​2011/​11/​llvm-30-type-system-rewrite.html|LLVM 3.0 Type System Rewrite]] by Chris Lattner. [[http://​blog.llvm.org/​2011/​11/​llvm-30-type-system-rewrite.html|LLVM 3.0 Type System Rewrite]] by Chris Lattner.
  
-However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has resulted in the situation that not all cases of identical types  +However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has created ​the situation that not all cases of identical types are properly registered as //​compatible types//​. ​As of writing, this has not yet been a real problem, but it is likely to become a problem in the future.
-As of writing, this is not yet a real problem, but it eventually will be.+
  
-We believe that the right solution would be a new **type unification pass**, which unifies ​all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. ​This pass would be run before the magic pass, thus resolving ​the original ​problem ​while also freeing ​the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagic ​to search through compatible types.+We believe that the right solution would be the introduction of a new **type unification pass**. This pass would unify all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. ​The pass could then be run before the magic pass. This would not only resolve ​the complete ​problem, but also free the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagicrt ​to search through compatible types.
  
-=== Exceptions for libmagic ​===+=== ASR skipping libmagicrt ​===
  
-The instrumentation framework currently makes more exceptions than it should. In particular, the ASR pass exempts all of the magic library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagic ​is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service. The exact reasons as to why this exception was made are currently unknown. However, if possible, this overall limitation should be resolved by either removing the exception or at least narrowing it to the exact scope of the problem.+The ASR pass currently ​exempts all of the magic runtime ​library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagicrt ​is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service.
  
-In addition, although less importantly,​ state transfer makes some exceptions based on name prefixes, and some of these name prefixes ​are overly broadFor example, it is not impossible that the current exception ​of the prefix ''​st_''​ also ends up matching certain variables in the actual service. At the very least, all exception prefixes should start with ''​magic_''​.+The exact reasons as to why this exception was made are currently unknownHoweverif possible, this overall limitation should be resolved by either removing the exception or at least narrowing ​it to the exact scope of the problem.
  
 ==== Memory management ==== ==== Memory management ====
Line 891: Line 755:
 === Region transfer issues === === Region transfer issues ===
  
-A problem which we already ​flagged ​earlier ​onis the issue that for live update, ​transfer of in particular ​memory-mapped pages requires these pages to be in a strictly delineated address range. This range may not overlap with any of the process'​s other sections'​ address ranges. The range is hardcoded globally, and thus, defined much more strictly than necessary for most service processes. Moreover, the definition indiscriminately affects all processes, including application processes. The result is that when the system is built with live update support, all processes are severely restricted in how much of their address space they can use for memory-mapped regions.+As we already ​mentioned ​earlier, the transfer of memory-mapped pages requires ​that these pages be in a strictly delineated address range. This range may not overlap with any of the process'​s other sections'​ address ranges. The range is hardcoded globally, and thus, defined much more strictly than necessary for most service processes. Moreover, the definition indiscriminately affects all processes, including application processes. The result is that if the system is built with live update support, all processes are severely restricted in how much of their address space they can use for memory-mapped regions. Conversely, if the system is not built with live update support, even identity transfer may fail.
  
 Another problem mentioned before, is the bulk transfer of all pages in the process'​s mmap section, regardless of whether the state transfer framework knows about them. This could easily lead to memory leaks due to transfer of untracked pages. Another problem mentioned before, is the bulk transfer of all pages in the process'​s mmap section, regardless of whether the state transfer framework knows about them. This could easily lead to memory leaks due to transfer of untracked pages.
  
-We believe that both points could be resolved with a system that does not automatically transfer memory-mapped pages from the old to the new instance, but rather performs such transfer on demand, so that the (identity or magic) state transfer routine can determine what memory to transfer.+We believe that both points could be resolved with a system that does not automatically transfer memory-mapped pages from the old to the new instance, but rather performs such transfer on demand, so that the (identity or magic) state transfer routine can determine ​exactly ​what memory to transfer.
  
 === Out-of-memory issues === === Out-of-memory issues ===
  
-MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocation for pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no available ​memory, the service will be killed. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent.+MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocated ​pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no free memory, the service will be killed, possibly taking down the entire system with it. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent.
  
-Live update is making this situation even more problematic. The magic library uses more dynamic memory, and is not particularly careful about using preallocated memory ​when necessary. The ASR functionality increases memory usage, ​including the use of stack space through ​its stack padding feature. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.+The live update ​and rerandomization support ​is making this situation even more problematic. The magic runtime ​library uses extra dynamic memory, and is not particularly careful about using preallocated memory ​where necessary. The ASR functionality increases memory usage even further. For example, its stack padding feature ​requires a considerable amount of extra stack space. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.
  
-Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulting ​in stack pages that this is not a viable option in general. There has been a partial attempt to prepare file system service'​s buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a limited ​amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services.+Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulted-in stack pages illustrates ​that this is not a viable option in general. There has been a partial attempt to prepare file system service'​s buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a certain ​amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services.
  
-In the meantime, it can be expected that **test64** of the MINIX3 test setthe test case that tests one particular scenario of running out of memorywill causes test or system failure in an increasing number of cases. It may have to be removed from the default set of tests in the short term.+In the meantime, it can be expected that **test64** of the MINIX3 test set the test case that tests one particular scenario of running out of memory ​will causes test or system failure in an increasing number of cases. It may have to be removed from the default set of tests in the short term.
  
 === Contiguous/​DMA memory === === Contiguous/​DMA memory ===
  
-In addition, MINIX3 does not deal well with running out of //​special// ​memory. Some services require blocks of physically contiguous memory for DMA purposes. VM currently has no way to recombine fragmented blocks of free memory into contiguous ranges. ​Some services require memory that is located in the lower 1MB or 16MB of the system memory. The support in VM for obtaining memory in those ranges is very limited as well. Both of these cases may result in the inability for a system service to obtain its needed resources if it is not started immediately at system bootup time.+In addition, MINIX3 does not deal with running out of physically contiguous ​memory ​at all. Some services require blocks of physically contiguous memory for DMA purposes. VM currently has no way to recombine fragmented blocks of free memory into larger physically ​contiguous ranges. ​In addition, some services require memory that is located in the lower 1MB or 16MB of the physical ​system memory. The support in VM for obtaining memory in those ranges is very limited as well. Both of these cases may result in the inability for a system service to obtain its needed resources if it is not started immediately at system bootup time.
  
 These problems are not particularly important for live update, since the new instance will inherit special memory from its old memory by default. They are important for crash recovery however, and they are known to cause failures in the ''​testrelpol''​ test set on occasion. These problems are not particularly important for live update, since the new instance will inherit special memory from its old memory by default. They are important for crash recovery however, and they are known to cause failures in the ''​testrelpol''​ test set on occasion.
Line 915: Line 779:
 === Page protection === === Page protection ===
  
-Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection.+Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was originally ​created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection.
  
-Since libmagic ​performs extra memory allocations,​ this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''​PROT_NONE''​ protection instead. Theoretically,​ this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken.+Since libmagicrt ​performs extra memory allocations,​ this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''​PROT_NONE''​ protection instead. Theoretically,​ this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken.
  
 Ideally, this issue would be resolved by implementing proper support for page protection in VM, including for example an implementation of mprotect(2). Ideally, this issue would be resolved by implementing proper support for page protection in VM, including for example an implementation of mprotect(2).
Line 927: Line 791:
 === Default states === === Default states ===
  
-The case of userspace threads has shown that it may be necessary for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. ​Moreover, these services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that both users and scriptsthe update_asr(8) script in particularbe aware of specific services requiring custom quiescence state. This is annoying ​and dangerous.+The case of userspace threads has shown that it may be not just useful, but actually //necessary// for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. ​These services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that not just users, but also scripts ​the update_asr(8) script in particular ​be aware of specific services requiring custom quiescence state. This is inconvenient ​and dangerous.
  
-The default quiescence state is currently hardcoded in the service(8) utility, in the form of ''​DEFAULT_LU_STATE''​ in ''​minix/​commands/​service/​service.c''​. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service'​s own SEF routines. ​If this approach does not workit would also be possible ​to somehow expose each service'​s ​default state through the procfs per-service ​''/​proc/​service/''​ files, so that at least scripts could add any custom ''​-state''​ options automatically.+The default quiescence state is currently hardcoded in the minix-service(8) utility, in the form of ''​DEFAULT_LU_STATE''​ in ''​minix/​commands/​minix-service/minix-service.c''​. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not yet clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service'​s own SEF routines. ​Otherwisethe SEF may have to send the default state as extra data to RS at service ​initialization time.
  
 === Policy redundancy === === Policy redundancy ===
Line 935: Line 799:
 While the following issue is relevant more for crash recovery than for live update, it is included here because it affects the infrastructure supporting the ''​testrelpol''​ script. While the following issue is relevant more for crash recovery than for live update, it is included here because it affects the infrastructure supporting the ''​testrelpol''​ script.
  
-Each service effectively knows what its own crash recovery policy should be. Separately, procfs has a policy table with an entry for each service in ''​minix/​fs/​procfs/​service.c'', ​exposing ​the same crash recovery policy information to userlandand the //testrelpol// script ​in particular. This is effectively redundant information.+Each service effectively knows what its own crash recovery policy should be. Separately, procfs has a policy table with an entry for each service in ''​minix/​fs/​procfs/​service.c'', ​containing ​the same crash recovery policy information, for export ​to userland and ''​testrelpol'' ​in particular. This is effectively redundant information.
  
 Ideally, each service would communicate its policy to RS. That information can then be used by procfs to expose the policy information to userland, thus eliminating the redundancy. Ideally, each service would communicate its policy to RS. That information can then be used by procfs to expose the policy information to userland, thus eliminating the redundancy.
 +
 +=== Live update of VM ===
 +
 +Earlier in this document, we have described the limitations of performing live updates on the VM service, as well as the reasons behind these limitations. Despite a large number of exceptions that allow VM to be updated at all, the resulting situation is that VM can still not be subjected to any meaningful type of update.
 +
 +It is unclear whether all these limitations are fundamental,​ however. We believe it may be possible to restructure the VM live update facilities to resolve at least some of the limitations. For example, it might be possible to store the pagetables in a separate memory section, and make actual copies of all or most other dynamic memory in VM. The out-of-band region could then be limited to the pagetable memory, thus allowing for relocation of at least static memory. Furthermore,​ more explicit rollback support in the old VM instance might even allow changes to VM's own pagetable, thereby possibly allowing dynamic memory allocation during the live update. It remains to be seen whether any of this is possible in practice.
  
 === Timed retries of safecopies === === Timed retries of safecopies ===
  
-If process A is being updated, process B should not make use of process A's grants, because those grants may temporarily ​be inaccessible,​ invalid, etcetera. The kernel currently has a simple way to enforce that rule, by responding to process B's safecopy kernel call with an ''​ENOTREADY''​ error response. The service-side libsys implementation of sys_*safecopy*(2) automatically suspends the calling service for a short while (using tickdelay(3)) and then retries the safecopy. This shortcut approach works, but it is not ideal, in particular because ​it could theoretically ​lead to starvation.+If process A is being updated, process B should ​temporarily ​not make use of process A's grants, because ​during the live update, ​those grants may be inaccessible,​ invalid, etcetera. The kernel currently has a simple way to enforce that rule, by responding to process B's safecopy kernel call with an ''​ENOTREADY''​ error response ​whenever process A is being updated. The service-side libsys implementation of sys_*safecopy*(2) automatically suspends the calling service for a short while (using tickdelay(3)) and then retries the safecopy. This shortcut approach works, but it is not idealit should not be the responsibility of system services to determine when the safecopy can be retried again, and the approach ​could lead to starvation.
  
-Instead, the kernel should block the caller of a safecopy ​kernel ​call for the duration of its target'​s live update procedure, retrying the safecopy operation and unblocking the caller only once the target is no longer being updated. A proper implementation of this functionality requires several cases to be covered: indirect grants, either the granter or the grantee being terminated or having its process slots swapped, etcetera. As a possible simplification,​ internally ​retrying ​the safecopy operation more than once would not be a problem, since the caller would simply remain blocked if the retried safecopy operation hits a case of live update again.+Instead, the kernel should block the caller of a safecopy call for the duration of its target'​s live update procedure, retrying the safecopy operation and unblocking the caller only once the target is no longer being updated. A proper implementation of this functionality requires several cases to be covered: indirect grants, either the granter or the grantee being terminated or having its process slots swapped, etcetera. As a possible simplification, ​the kernel could internally ​retry the safecopy operation more often than necessary, since the caller would simply remain blocked if the retried safecopy operation hits a case of live update again.
  
 === Copying asynsend tables === === Copying asynsend tables ===
  
-In a very specific scenario, the kernel performs a memory copy of the entire asynsend table between two processes of which the slots are being swapped. Although it is not yet clear which exact circumstances cause the need for this memory copy, the actual copy action relies on very specific conditions which are not validated before the copy action.+In a very specific scenario, the kernel performs a memory copy of the entire asynsend table between two processes of which the slots are being swapped. Although it is not yet clear which exact circumstances cause the need for this memory copy, the actual copy action relies on very specific conditions which are not fully validated before the copy action. Thus, this is a rather dangerous kernel feature.
  
 A rather long comment in ''​minix/​lib/​libsys/​sef_liveupdate.c''​ elaborates on the specifics of this case, and suggests why RS may be the only affected service. If the comment is correct, it may be possible to engineer another solution for RS in particular, and remove the copy hack from the kernel. A rather long comment in ''​minix/​lib/​libsys/​sef_liveupdate.c''​ elaborates on the specifics of this case, and suggests why RS may be the only affected service. If the comment is correct, it may be possible to engineer another solution for RS in particular, and remove the copy hack from the kernel.
Line 957: Line 827:
 === Performance === === Performance ===
  
-The performance of various parts of the live update infrastructureboth at instrumentation ​time and (in particular) at run time, is not fantasticOne of the effects ​is that in several cases, live update operations ​must be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option.+The performance of various parts of the live update infrastructure ​is not fantastic. This is true for both the instrumentation ​passes ​and, more importantly,​ the run-time functionalityAs one of the effects, live update operations ​may have to be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option.
  
-We have not yet looked into the causes of the poor performance. This is therefore ​rather open-ended issue.+We have not yet looked into the causes of the poor performance. Part of it may be due to the extra memory allocations performed by libmagicrt, but that is only a guess. This issue is therefore rather open ended. Statistical profiling may provide at least some hints. 
 + 
 +=== Grant table transfer === 
 + 
 +Currently, the safecopy memory grant tables of system services are transferred as is: the main union of the ''​cp_grant_t''​ structure as defined in ''​include/​minix/​safecopies.h''​ is marked as **ixfer**. 
 +In some scenarios, however, it is possible that during a service'​s live update, the service has grants allocated for remote services. For direct grants (of type ''​CPF_DIRECT''​),​ ''​cp_direct.cp_start''​ is actually a pointer into the local address space. The identity transfer therefore prevents this local pointer from being updated. Especially with ASR, there is a risk that after the live update, the grant points to arbitrary memory within the updated service. In the worst case, a remote user of the grant may end up overwriting this arbitrary memory in the updated service. 
 + 
 +To resolve this, the grant structure should not be using **ixfer** for its main union. This probably means that a custom state transfer routine for the grant structure must be written, so as to use a pointer transfer only for ''​CPF_DIRECT''​ grants. 
 + 
 +The same does //not// apply to magic grants (of type ''​CPF_MAGIC''​),​ as ''​cp_magic.cp_start''​ is an address in a remote process, which is either a userland process or a system process blocked on a call to VFS (as of writing, only VFS can use magic grants at all), and thus never subject to live update while the magic grant is active.
  
 === Testrelpol failure === === Testrelpol failure ===
  
-After running ​the ''​testrelpol''​ script a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //​testrelpol//​. If that is the case, it should be possible to change //​testrelpol//​ to disable the exponential backoff using existing service(8) flags.+If the ''​testrelpol''​ script ​is run a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //​testrelpol//​. If that is the case, it should be possible to change //​testrelpol//​ to disable the exponential backoff using existing ​minix-service(8) flags.
  
-=== Libmagic ​asserts ===+=== Libmagicrt ​asserts ===
  
-The implementation of the magic library currently relies on asserts being enabled. We have changed its Makefile so that asserts should be enabled regardless of build system settings, but this is merely a workaround. Instead, ​libmagic ​should function properly (and, in particular, fail properly) regardless of whether asserts are enabled.+The implementation of the magic runtime ​library currently relies on asserts being enabled. We have changed its Makefile so that asserts should be enabled regardless of build system settings, but this is merely a workaround. Instead, ​libmagicrt ​should function properly (and, in particular, fail properly) regardless of whether asserts are enabled.
  
 === VM fork warning === === VM fork warning ===
Line 976: Line 855:
  
 The error indicates that it is currently not possible to mark physically contiguous memory as copy-on-write,​ which is true. However, the error may occur during a live update, when VM copies over the memory-mapped pages of a service'​s old instance to the new instance. The error is therefore not the result of a fork(2) call. In addition, the error code thrown by the function producing the error message, is ignored by its caller, with as result that the reference count of the contiguous memory range is increased anyway, which is exactly what needs to happen for live update operations. Thus, during live updates, this error is both misleading and meaningless. However, we have to review whether it is still useful to keep around the error for other scenarios. The error indicates that it is currently not possible to mark physically contiguous memory as copy-on-write,​ which is true. However, the error may occur during a live update, when VM copies over the memory-mapped pages of a service'​s old instance to the new instance. The error is therefore not the result of a fork(2) call. In addition, the error code thrown by the function producing the error message, is ignored by its caller, with as result that the reference count of the contiguous memory range is increased anyway, which is exactly what needs to happen for live update operations. Thus, during live updates, this error is both misleading and meaningless. However, we have to review whether it is still useful to keep around the error for other scenarios.
 +
 +=== State transfer prefixes ===
 +
 +State transfer makes exceptions based on name prefixes. Some of these name prefixes are overly broad. For example, it is possible that the current exception of the prefix ''​st_''​ also ends up matching certain variables in actual service code by accident. At the very least, all exception prefixes should start with ''​magic_''​.
  
 ===== Further reading ===== ===== Further reading =====
Line 982: Line 865:
  
   * Cristiano Giuffrida, [[http://​www.minix3.org/​theses/​Cristiano_Giuffrida_PhD_thesis.pdf|Safe and Automatic Live Update]], Ph.D. thesis, 2014   * Cristiano Giuffrida, [[http://​www.minix3.org/​theses/​Cristiano_Giuffrida_PhD_thesis.pdf|Safe and Automatic Live Update]], Ph.D. thesis, 2014
 +
developersguide/liveupdate.txt · Last modified: 2022/02/12 22:42 by stux