User Tools

Site Tools


developersguide:liveupdate

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
developersguide:liveupdate [2015/09/23 16:25]
dcvmoole Fix up links after page move
developersguide:liveupdate [2022/02/12 22:42]
stux renamed service(8) to minix-service(8) in various places
Line 17: Line 17:
 The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM "​optimization"​ passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization,​ the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the **magic pass**. ​ The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM "​optimization"​ passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization,​ the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the **magic pass**. ​
  
-In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the **magic library** or //libmagic//. Together, the magic pass and library make up the **magic framework**.+In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the **magic ​runtime ​library** or //libmagicrt//. Together, the magic pass and runtime ​library make up the **magic framework**.
  
 ==== Live rerandomization ==== ==== Live rerandomization ====
Line 25: Line 25:
 The fundamental approach consists of a two-step process. First, new versions of the service program are generated, using link-time randomization of various parts of its program binary. Ideally, this would be done at run time; due to various limitations,​ MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another. The fundamental approach consists of a two-step process. First, new versions of the service program are generated, using link-time randomization of various parts of its program binary. Ideally, this would be done at run time; due to various limitations,​ MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another.
  
-The randomization of binaries is done with another link-time pass, called the **asr pass**. The magic library implements various runtime aspects of ASR rerandomization during live update.+The randomization of binaries is done with another link-time pass, called the **asr pass**. The magic runtime ​library implements various runtime aspects of ASR rerandomization during live update.
  
 ===== Users guide ===== ===== Users guide =====
Line 33: Line 33:
 ==== Setting up the system ==== ==== Setting up the system ====
  
-We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only. The current procedure is not quite ideal, but it is what we have right now, and it should work.+We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only.
  
-After setting up an initial environment,​ the MINIX3 update cycle basically consists of four steps: obtaining or updating ​the MINIX3 source code, building ​the systeminstrumenting ​the system services, and generating a bootable image. We will go through all steps in detail. At the end of this section, there is also a summary of the commands to issue.+The current procedure has been tested only from **Linux** as host platform, and may require minor adjustments on other host platforms. We provide a few additional instructions for those other platforms, but these may currently not be complete. Please feel free to add more instructions to this page, and/or open GitHub issues for other platforms and link to them from here. 
 + 
 +After setting up an initial environment,​ the first step is to obtain ​the MINIX3 source code. After that, the next step is to build an LLVM toolchain with LTO supportwhich is needed because ​the regular MINIX3 crosscompilation LLVM toolchain does not include LTO support (yet - we are working on this). Once the LTO-supporting toolchain has been builtthe final step is to build the MINIX3 sources, with extra flags to enable magic instrumentation ​and possibly ASR rerandomziation. 
 + 
 +Once these steps have been completed successfully for the first time, one can later update the MINIX3 source and then rebuild the system. The LTO-supporting toolchain need not be rebuilt unless we upgrade the LLVM source code itself. 
 + 
 +We will now go through all steps in detail. At the end of this section, there is also a summary of the commands to issue.
  
 All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges. All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges.
Line 51: Line 57:
   /​home/​user/​minix-liveupdate/​obj.i386   /​home/​user/​minix-liveupdate/​obj.i386
  
-You have to choose a location for the containing directory, and create it yourself. The three subdirectories ​will be created automatically as part of the following steps. In terms of disk space usage, expect to be needing a bare minimum of **30GB** for the combination of these three subdirectories,​ with a recommended **40GB** of available space.+You have to choose a location for the containing directory, and create it yourself. The three subdirectories ​should ​be created automatically as part of the following steps. However, it has been reported that on some platforms (e.g., FreeBSD), some or all of these directories have to be created manually; this can be done with nothing more than a few basic ''​mkdir''​ commands. In terms of disk space usage, expect to be needing a bare minimum of **30GB** for the combination of these three subdirectories,​ with a recommended **40GB** of available space.
  
 === Obtaining or updating the MINIX3 source code === === Obtaining or updating the MINIX3 source code ===
Line 69: Line 75:
 In both cases, the next step is now to build the source code. In both cases, the next step is now to build the source code.
  
-=== Building the system ​===+=== Building the LTO toolchain ​===
  
-The next step consists of building ​the systemWhen run for the first time, this step will also build the LLVM LTO infrastructure, the crosscompilation toolsand the instrumentation. The first run may take several hours.+The second ​step is to build the LLVM LTO infrastructure,​ if it has not yet been built beforeEventually, this will be done automatically as part of the regular build. For nowwe have a script that can perform ​the buildcalled ''​generate_gold_plugin.sh''​. It is located in the ''​minix/​llvm''​ subdirectory of the MINIX3 source tree. The basic procedure therefore consists of the following steps (but read this entire section ​first):
  
-The center of all the instrumentation activities is the ''​minix/llvm''​ subdirectory of the MINIX3 source treeThis directory contains the instrumentation passes, runtime library, and supporting scriptsThis step and the next steps therefore assume this subdirectory as the current directory:+  $ cd /​home/​user/​minix-liveupdate/​minix-src/​minix/llvm 
 +  $ ./​generate_gold_plugin.sh
  
-  $ cd minix-src/minix/llvm+On some platforms, it may be needed to specify the C/C++ compiler and/or the name of the GNU make utility, which can be done as follows:
  
-It may be necessary to ensure that clang is used as the compiler, by exporting the following shell variablesGCC should work as well, but has not been tested as thoroughly.+  $ CC=clang CXX=clang++ MAKE=make ​./​generate_gold_plugin.sh
  
-  $ export CC=clang CXX=clang+++On FreeBSD and similar platforms, one may have to ensure that GNU make is installed (typically as ''​gmake''​) first, and pass in ''​MAKE=gmake''​ to point to it.
  
-Then, the system ​can built with support for instrumentation ​by running the ''​configure.llvm''​ script in the current directorywith the ''​MKMAGIC'' ​build variable ​set to //yes//. To build the infrastructure and system without parallel compilation,​ simply run the script this way:+This step may take several hours. It can be sped up by supplying a number of parallel jobsthrough a ''​JOBS=n''​ variable:
  
-  $ BUILDVARS="-V MKMAGIC=yes" ​./configure.llvm+  $ JOBS=./generate_gold_plugin.sh
  
-Alternativelya number of parallel jobs may be suppliedIt is typically advisable to use as many jobs as there are hardware threads of execution (i.e., CPU cores or hyperthreads) ​in the system:+As stated beforeafter this command has finished successfully,​ it need not be reissued until LLVM is upgraded in the MINIX3 source treeThis is a rare event which is typically ​part of a larger resynchronization with NetBSD code, and we will clearly announce such events. When this happens, it may be advisable to remove the entire ''​obj_llvm.i386''​ directory ​as well as any files in ''​minix-src/​minix/​llvm/​bin'',​ before rerunning ​the generate_gold_plugin.sh script.
  
-  $ JOBS=8 BUILDVARS="-V MKMAGIC=yes" ./​configure.llvm+=== Building the system ===
  
-After the first run, the ''​configure.llvm''​ will perform recompilation ​of only the parts of the source code that have changed, and should not take nearly as long to run as the first time. In case of unexpected problems when rebuildingit may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entirety. This can be done by going to the top-level ''​obj.i386''​ directory and deleting all files and directories in there, except the ''​tooldir.{yourplatform}''​ subdirectory. Fully rebuilding the MINIX3 ​source code will take longer than an incremental rebuild, but since the crosscompilation ​toolchain is left as isit will still be nowhere close as long as the first run.+The third step consists ​of building ​the system and generating a bootable image out of it. When run for the first time, this step will also build the regular (non-LTO) crosscompilation toolchainThe first run may therefore (also) take several hoursThe build procedure is just like regular ​MINIX3 crosscompilation, ​differing in only two aspects.
  
-As explained ​in more detail on the [[.:​crosscompiling|crosscompilation page]], it is also possible ​to rebuild particular parts of the system ​without going through the entire "​make ​build" process. This involves the use of the ''​nbmake-i386''​ tool and generally requires a good understanding of the compilation process. It may be worth mentioning that the first ''​configure.llvm''​ run saves the ''​MKMAGIC'' ​value, so this variable ​need not be passed ​to ''​nbmake-i386''​ each timeWe give one example ​of how to use nbmake-i386 in a later section.+First, the appropriate build variables must be passed ​in to enable ​the desired functionalityIn order to build the system ​with live update support ​through ​magic instrumentation, ​the build system must be invoked with the ''​MKMAGIC'' ​build variable ​set to //yes//This will perform a bitcode build of the entire system, and perform magic instrumentation on all system services.
  
-=== Rebuilding ​the instrumentation ===+In order to build the system with ASR instrumentation, the build system must be invoked with the ''​MKASR''​ build variable set to //yes//. This will automatically enable magic instrumentation,​ perform ASR randomization on all system services, and pregenerate a number of ASR-rerandomized service binaries for each service. This number can be controlled with an additional ''​ASRCOUNT=n''​ build variable, where the //n// value must be between 1 and 65536 (inclusive). The default //​ASRCOUNT//​ is 3.
  
-When building the system for the first timethis step may be skipped, as it is performed automatically. However, when the source code is changed ​for any of the LLVM passes or the magic library - that is, the source code in ''​minix/llvm'' ​- the changed component ​must be recompiled. ​**Warning**: updating ​the MINIX3 source code with ''​git pull''​ may also upgrade any of these componentsin which case it is the responsibility ​of the user (you) to recompile and reinstall them!+Secondin order to build a hard disk image suitable ​for use by the resulting bitcode builds, the ''​x86_hdimage.sh'' ​script ​must be invoked with the **-b** flag. This will enlarge ​the generated image to account for the larger binariesand enable inclusion ​of ASR-rerandomized binaries if necessary.
  
-Once we properly integrate the LLVM LTO infrastructure into the MINIX3 ​build system, this step should disappear altogether.+These two aspects can be covered in a single ​build commandThe following short procedure will build a hard disk image with magic instrumentation:​
  
-== Rebuilding libmagic ==+  $ cd /​home/​user/​minix-liveupdate/​minix-src 
 +  $ BUILDVARS="-V MKMAGIC=yes" ./​releasetools/​x86_hdimage.sh -b
  
-This substep must be performed whenever ​the source code of the magic library changesThis is necessary due to the fact that libmagic'​s dependency tracking is not working correctlywhich means the automated step in ''​configure.llvm''​ may not recompile ​the library properly.+In order to speed up the build, a number ​of parallel jobs may be suppliedIt is typically advisable ​to use as many jobs as there are hardware threads of execution (i.e.CPU cores or hyperthreads) ​in the system:
  
-The source code of libmagic is located in the ''​minix/​llvm/static/magic''​ subdirectory of the MINIX3 source codeTo (re)compile and install libmagic, go to its source directory, issue a ''​make clean''​ and a ''​make install'':​+  $ JOBS=8 BUILDVARS="​-V MKMAGIC=yes"​ ./releasetools/x86_hdimage.sh -b
  
-  $ cd static/​magic +It may be necessary to ensure that clang is used as the compiler:
-  $ make clean install+
  
-The library is installed to ''​minix/llvm/bin''​. In a later step, the ''​relink.llvm''​ script will pick it up from there.+  $ CC=clang CXX=clang++ JOBS=8 BUILDVARS="​-V MKMAGIC=yes"​ ./releasetools/x86_hdimage.sh -b
  
-== Rebuilding a pass ==+Also, some platforms may not be able to compile the compiler toolchain for the target platform due to running out of memory. In that case, it is possible to build an image that does not come with its own compiler toolchain, by passing in the ''​MKLLVMCMDS=no''​ build variable. This build variable can also be used simply to speed up the compilation procedure.
  
-This substep is also performed automatically for the first time, by the ''​generate_gold_plugin.sh''​ script invoked from ''​configure.llvm''​. However, whenever the source code of any of the LLVM instrumentation passes changes, that pass must be recompiled and installed.+  $ BUILDVARS="​-V MKMAGIC=yes -V MKLLVMCMDS=no"​ ./​releasetools/​x86_hdimage.sh -b
  
-The source code of the passes is located in the ''​minix/​llvm/​passes''​ subdirectory of the MINIX3 source code. A pass can be compiled and installed by going to its ''​minix/​llvm/​passes/​{pass}''​ subdirectory,​ and issuing ''​make install''​.+In order to build an image with ASR randomization,​ including four additional ASR-rerandomized versions ​of each system service, use the following build variables:
  
-For example, to recompile and install the magic pass:+  $ BUILDVARS="​-V MKASR=yes -V ASRCOUNT=4"​ ./​releasetools/​x86_hdimage.sh -b
  
-  $ cd passes/​magic +Obviously, all variables shown above can be combined as appropriate. The author of this document has used the following command line on several occasions:
-  $ make install+
  
-The passes are installed to ''​minix/llvm/bin''​. In a later step, the ''​build.llvm''​ script will pick them up from there.+  $ CC=clang CXX=clang++ JOBS=4 BUILDVARS="​-V MKASR=yes -V ASRCOUNT=2 -V MKLLVMCMDS=no"​ ./releasetools/x86_hdimage.sh -b
  
-=== Instrumentation ​and image building ===+After the first run, the build system will perform recompilation of only the parts of the source code that have changed, ​and should not take nearly as long to run as the first time. In case of unexpected problems when rebuilding, it may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entirety. This can be done by going to the top-level ''​obj.i386''​ directory and deleting all files and directories in there, except the ''​tooldir.{yourplatform}''​ subdirectory. Fully rebuilding the MINIX3 source code will take longer than an incremental rebuild, but since the crosscompilation toolchain is left as is, it will still be nowhere close as long as the first run.
  
-After building the system, two more steps need to be performed: instrumentation of system services, and generation of a bootable hard disk image. These steps must be performed every time the system is built, including the first timeIn particularevery time a system service is (re)compiled, it must be (re)instrumented afterwards. Furthermore,​ every time any part of the compiled MINIX3 installation ​is changed, a new image must be built. +As explained in more detail on the [[.:crosscompiling|crosscompilation page]], it is also possible ​to rebuild particular parts of the system ​without going through ​the entire ​"make build" ​process. This involves ​the use of the ''​nbmake-i386'' ​tool and generally requires ​good understanding ​of the compilation process.
- +
-In order to generate a fully instrumented system image with a number ​of pregenerated ASR binaries for all services, one can run a command that automates both steps. This approach is recommended for most users and covered in the first subsection. Alternatively,​ the details of manual instrumentation and image building are covered in the two subsections after. +
- +
-== The easy way: bulk ASR generation == +
- +
-The ''​clientctl''​ script in ''​minix/​llvm''​ provides a convenient way to instrument all services for live update and rerandomization,​ generate a number of rerandomized versions of each service, and build a hard disk image. The command has the following syntax: +
- +
-  $ ./clientctl buildasr [N] +
- +
-Here, N is an optional parameter specifying the number of rerandomized binaries that should be generated in addition to the standard set of randomized binaries. N defaults to 1. For example, the following command will produce a system ​with four randomized sets of service binaries: one set of ASR-randomized services that are used by default, and three extra rerandomized binaries to which the system can switch at run time: +
- +
-  $ ./clientctl buildasr 3 +
- +
-The result is a MINIX3 hard disk image file which can be booted in (for example) qemu; see further below. +
- +
-== The manual way (1/2): instrumentation == +
- +
-Instrumentation takes place at the granularity of individual system services. The ''​minix/​llvm''​ directory contains scripts that allow for relinking services against runtime libraries, and instrumenting services with LLVM passes. The general procedure is like this: +
- +
-  - First, the service is compiled and linked to its basic form. +
-  - Then, the resulting linked bitcode object is relinked with **libmagic**. +
-  - Finally, link-time instrumentation is applied by running the **magic pass**, possibly as well as the **asr pass**, on the linked bitcode object. +
- +
-Each step also (re)generates a ready-to-execute machine code version of the service. +
- +
-Step 1 happens in the "building the system" ​step, using ''​configure.llvm'',​ as explained in a previous section. +
- +
-Step 2 is done with the ''​relink.llvm''​ script in ''​minix/​llvm''​. This script will relink services against a space-separated list of libraries. For live update, only the magic library is relevant: +
- +
-  $ ./​relink.llvm magic +
- +
-This command will relink all services against libmagic, thus providing them with runtime support for live update. +
- +
-Step 3 is done with the ''​build.llvm''​ script in ''​minix/​llvm''​. This script will instrument services with a space-separated list of LLVM passes. For live update, the magic pass should be used: +
- +
-  $ ./​build.llvm magic +
- +
-This command will instrument all services with the magic pass, performing static analysis and changing the service to include the information necessary for libmagic to perform live updates at runtime. +
- +
-For live rerandomization support, one must apply not only the magic pass, but also the asr pass: +
- +
-  $ ./​build.llvm magic asr +
- +
-The resulting service will not only be ready for live update, but also be subjected to fine-grained randomization. It will also be supplied with parameters to perform the runtime component of rerandomization when requested during live updates. +
- +
-For reference, ​the ''​clientctl buildasr''​ command shown above performs this step multiple times to generate different rerandomized versions of each service, storing each in a different location. +
- +
-We now describe some details that might be useful to know about relinking and applying passes: +
- +
-  * By default, ''​relink.llvm''​ and ''​build.llvm''​ perform their respective actions on all system services. It is however possible instrument only a subset of services, leaving the other services untouched. This can be done by passing a ''​C''​ shell variable with a comma-separated list of individual services. For example, the following command relinks the PM (Process Manager) service against the magic library: +
- +
-  $ C=pm ./​relink.llvm magic +
- +
-In the ''​C''​ variable, the pseudo-targets ''​servers'',​ ''​fs'',​ ''​net'', ​and ''​drivers''​ can be used to perform script'​s actions on the services in the corresponding subdirectories in the MINIX3 source tree. The ''​rd''​ pseudo-target regenerates the ramdisk, which must be redone after changing any service on the ramdisk. For example, the following command instruments core servers and file system services with the magic and asr passes, and regenerates the ramdisk: +
- +
-  $ C=servers,​fs,​rd ./​build.llvm magic asr +
- +
-The ''​clientctl buildasr''​ command accepts this optional ''​C''​ shell variable as well. It will however remove any previously generated ASR-rerandomized binaries of all services, irrespective of the ''​C''​ variable. +
- +
-  * Each of the three steps undoes the effects of prior invocations of both that step and subsequent steps, but not earlier steps. In other words: compiling and linking ​service (step 1) will undo any previous relinking and instrumentation. Relinking a service (step 2) will similarly undo any previous relinking and instrumentation ​of the same service. Instrumenting a service (step 3) will undo any previous instrumentation,​ reapplying the instrumentation to the same relinked binary. For this reason, a single ''​build.llvm''​ invocation must be used to apply all passes at once. +
- +
-  * Instrumentation with the magic pass will fail if the service has not been relinked with libmagic first. The same applies to the asr pass. However, the asr pass will not fail if the service has not been instrumented with the magic pass. Instrumenting a service with the asr pass but not the magic pass is of limited use: the service will be randomized, but cannot be subjected to live rerandomization. +
- +
-== The manual way (2/2): building the image == +
- +
-Finally, a MINIX3 image can be built from the compiled MINIX3 code using the ''​clientctl''​ **buildimage** command: +
- +
-  $ ./clientctl buildimage +
- +
-This command produces a bootable MINIX3 hard disk image file. The generated image file is called ''​minix_x86.img''​ and located in the root of the MINIX3 source tree - ''​minix-src''​ in our examples. +
- +
-This command is called automatically as part of ''​clientctl buildasr''​.+
  
 === Running the image === === Running the image ===
  
-Once a hard disk image has been generated, it can be run. The most convenient way to run the image is to use **qemu**. ​For convenience, ​the ''​clientctl''​ script in ''​minix/​llvm''​ has a **run** ​command ​to run the image in qemu without further effort:+The x86_hdimage command produces ​bootable MINIX3 ​hard disk image file. The generated image file is called ''​minix_x86.img''​ and located in the root of the MINIX3 source tree - ''​minix-src''​ in our examples. Once an image has been generated, it can be run. The most convenient way to run the image is to use **qemu/KVM**. This can be done using the command ​as given at the end of the x86_hdimage output.
  
-  $ OUT=F ./clientctl run+While explaining the use of qemu is beyond the scope of this document, it may be useful to look into the ''​-append'',​ ''​-curses'',​ and ''​-serial file:..''​ qemu command line arguments. The following command line will launch qemu with KVM support (remove ''<​nowiki>​--enable-kvm<​/nowiki>''​ to disable KVM support), a curses-based user interface, and system output redirected to a file named ''​serial.out'':​
  
-The ''​OUT''​ shell variable can be set to other values to control what to do with serial outputThe ''​F''​ value specifies that the serial output will be redirected to a ''​F''​ile,​ namely ''​serial.out''​ in the current directoryThe other supported settings are ''​S''​tdout''​C''​onsoleand ''​P''​ty. +  $ cd /​home/​user/​minix-liveupdate/​minix-src 
- +  $ (cd ../obj.i386/​destdir.i386/​boot/​minix/​.temp && qemu-system-i386 --enable-kvm -m 256 -kernel kernel -initrd "​mod01_ds,mod02_rs,mod03_pm,​mod04_sched,​mod05_vfs,​mod06_memory,​mod07_tty,​mod08_mib,​mod09_vm,​mod10_pfs,​mod11_mfs,​mod12_init"​ -hda ../​../​../​../​../​minix-src/​minix_x86.img -curses -serial file:../​../​../​../​../​minix-src/​serial.out -append "​rootdevname=c0d0p0 cttyline=0")
-Extra [[usersguide:bootmonitor|boot options]] can be supplied through the APPEND variable: +
- +
-  $ OUT=F APPEND="rs_verbose=1"​ ./clientctl run+
  
-This example will enable verbose output in the RS service, which is highly useful for debugging issues with live update.+Extra [[usersguide:​bootmonitor|boot options]] can be supplied in the (space-separated) list that follows the ''​-append''​ switch. For example, adding ''​ rs_verbose=1'' ​will enable verbose output in the RS service, which is highly useful for debugging issues with live update. ​
  
 === Summary === === Summary ===
  
-The following commands can be used to obtain, build, instrument, ​and start a MINIX3 system that supports live update and live rerandomization,​ including three alternative rerandomized versions of all system services, in addition to the randomized standard ones:+The following commands can be used to obtain and build a MINIX3 system that supports live update and live rerandomization,​ including three alternative rerandomized versions of all system services, in addition to the randomized standard ones:
  
 +  $ export CC=clang CXX=clang++ JOBS=8
 +  $ cd /​home/​user/​minix-liveupdate
   $ git clone git://​git.minix3.org/​minix minix-src   $ git clone git://​git.minix3.org/​minix minix-src
   $ cd minix-src/​minix/​llvm   $ cd minix-src/​minix/​llvm
-  $ export CC=clang CXX=clang++ +  $ ./​generate_gold_plugin.sh 
-  $ JOBS=8 ​BUILDVARS="​-V ​MKMAGIC=yes" ./configure.llvm +  $ cd ../.. 
-  $ ./clientctl buildasr 3 +  $ BUILDVARS="​-V ​MKASR=yes -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
-  $ OUT=F ./clientctl run+
  
 The entire procedure will typically take about 30GB of disk space and several hours of time. The entire procedure will typically take about 30GB of disk space and several hours of time.
Line 228: Line 160:
 Sometime later, the following steps can be used to update the installation to a newer MINIX3 version: Sometime later, the following steps can be used to update the installation to a newer MINIX3 version:
  
-  $ cd minix-src/minix/llvm+  $ cd /home/user/minix-liveupdate/minix-src
   $ git pull   $ git pull
-  $ export ​CC=clang CXX=clang++ +  $ CC=clang CXX=clang++ JOBS=8 BUILDVARS="​-V ​MKASR=yes -V MKLLVMCMDS=no" ./releasetools/x86_hdimage.sh -b
-  $ JOBS=8 BUILDVARS="​-V ​MKMAGIC=yes" ./configure.llvm +
-  $ for pass in WeakAliasModuleOverride sectionify magic asr; do (cd passes/$pass && make clean install); done +
-  $ (cd static/​magic && make clean install) +
-  $ ./clientctl buildasr 3 +
-  $ OUT=F ./clientctl run+
  
 In contrast to the initial run, the entire update procedure should take no more than an hour. In contrast to the initial run, the entire update procedure should take no more than an hour.
- 
-Instead of the ''​./​clientctl buildasr 3''​ step in the above two examples, one can for example also instrument the system for live update but not live rerandomization,​ using the following three replacement steps: 
- 
-  $ ./​relink.llvm magic 
-  $ ./​build.llvm magic 
-  $ ./clientctl buildimage 
  
 ==== Using live update ==== ==== Using live update ====
Line 262: Line 183:
   minix# ./​testrelpol   minix# ./​testrelpol
  
-For its live update tests, this script does //not// use the magic framework for state transfer at all. Instead it uses **identity transfer** which performs a basic memory copy between the old and the new instance. As a result, the testrelpol script should succeed whether or not services are instrumented. However, it may not work reliably on MINIX3 systems that are not built for magic instrumentation ​at all (i.e., built without ​''​MKMAGIC=yes''​).+For its live update tests, this script does //not// use the magic framework for state transfer at all. Instead it uses **identity transfer** which performs a basic memory copy between the old and the new instance. As a result, the testrelpol script should succeed whether or not services are instrumented. However, it may not work reliably on MINIX3 systems that are not built for magic instrumentation (i.e., built with neither ​''​MKMAGIC=yes''​ nor ''​MKASR=yes''​).
  
 == Live rerandomization:​ update_asr == == Live rerandomization:​ update_asr ==
  
-As we have shown before, the ''​clientctl buildasr''​ host-side ​command can perform ​the //​build-time//​ preparation of a MINIX3 system for live rerandomization. Complementing this, the //​run-time//​ side of the live rerandomization is provided by means of the **update_asr** command. The update_asr command will update system services to their next pregenerated rerandomized version, using a cyclic system. Live rerandomization is not automatic, and thus, the MINIX3 system administrator is responsible for running the update_asr command at appropriate times.+As we have shown before, the ''​MKASR=yes''​ host-side ​build variable performs ​the //​build-time//​ preparation of a MINIX3 system for live rerandomization. Complementing this, the //​run-time//​ side of the live rerandomization is provided by means of the **update_asr** command. The update_asr command will update system services to their next pregenerated ​ASR-rerandomized version, using a cyclic system. Live rerandomization is not automatic, and thus, the MINIX3 system administrator is responsible for running the update_asr command at appropriate times.
  
 By default, the update_asr command performs one round of ASR rerandomization,​ updating each service to its next version: By default, the update_asr command performs one round of ASR rerandomization,​ updating each service to its next version:
Line 284: Line 205:
 === Live update commands === === Live update commands ===
  
-RS can be instructed to perform live updates through the service(8) command, specifically through its **service update** subcommand. This command is also used by the automated scripts. For a full overview of the command'​s functionality,​ please see the service(8) manual page as well as the command'​s output when it is run with no parameters.+RS can be instructed to perform live updates through the minix-service(8) command, specifically through its **minix-service update** subcommand. This command is also used by the automated scripts. For a full overview of the command'​s functionality,​ please see the minix-service(8) manual page as well as the command'​s output when it is run with no parameters.
  
-In its most fundamental form, the //service update// command will update a running service, identified by its label, to a new version provided as an on-disk binary file. It is however also possible to tell RS to update the service into a copy of itself. In addition, various flags and options can be used for fine-grained control of the live update action. The basic syntax to perform a live update on a single system service is as follows:+In its most fundamental form, the //minix-service update// command will update a running service, identified by its label, to a new version provided as an on-disk binary file. It is however also possible to tell RS to update the service into a copy of itself. In addition, various flags and options can be used for fine-grained control of the live update action. The basic syntax to perform a live update on a single system service is as follows:
  
-  minix# service [flags] update [self|<​binary>​] -label <​label>​ [options]+  minix# ​minix-service [flags] update [self|<​binary>​] -label <​label>​ [options]
  
 Through various combinations of this command'​s parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions for the overall live update infrastructure in general, and state transfer in particular. We will now go through all of them, and explain how they can be performed. For more details regarding what is actually going on below the surface, please consult the developers guide section of this document. Through various combinations of this command'​s parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions for the overall live update infrastructure in general, and state transfer in particular. We will now go through all of them, and explain how they can be performed. For more details regarding what is actually going on below the surface, please consult the developers guide section of this document.
Line 294: Line 215:
 == Identity transfer == == Identity transfer ==
  
-The first update type is **identity transfer**. In this case, the service is updated to an identical copy of itself, with all functions and static data in the new instance located at the exact same addresses as the old instance. Identity transfer bluntly copies over entire memory sections at once, thus requiring no instrumentation at all. This makes it suitable for testing of the MINIX3-specific side of the live update infrastructure,​ hence its use in the ''​testrelpol''​ script. Identity transfer is the default of the service(8) command when "​self"​ is given instead of a path to a new binary:+The first update type is **identity transfer**. In this case, the service is updated to an identical copy of itself, with all functions and static data in the new instance located at the exact same addresses as the old instance. Identity transfer bluntly copies over entire memory sections at once, thus requiring no instrumentation at all. This makes it suitable for testing of the MINIX3-specific side of the live update infrastructure,​ hence its use in the ''​testrelpol''​ script. Identity transfer is the default of the minix-service(8) command when "​self"​ is given instead of a path to a new binary:
  
-  minix# service update self -label pm+  minix# ​minix-service update self -label pm
  
 This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 system services. As mentioned, it is guaranteed to work only when the system was built with ''​MKMAGIC=yes'',​ although it will mostly work on systems built without magic support as well. It works regardless of whether the target service was instrumented with the magic framework (or ASR). This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 system services. As mentioned, it is guaranteed to work only when the system was built with ''​MKMAGIC=yes'',​ although it will mostly work on systems built without magic support as well. It works regardless of whether the target service was instrumented with the magic framework (or ASR).
  
-If the live update is successful, the service(8) command will be silent, but RS will print a system message that the update succeeded:+If the live update is successful, the minix-service(8) command will be silent, but RS will print a system message that the update succeeded:
  
   RS: update succeeded   RS: update succeeded
Line 306: Line 227:
 If the system was started on qemu with ''​OUT=F'',​ this message will end up in ''​serial.out''​. Otherwise, the message should show up in the MINIX3 system log (''/​var/​log/​messages''​) and possibly on the first console. If the system was started on qemu with ''​OUT=F'',​ this message will end up in ''​serial.out''​. Otherwise, the message should show up in the MINIX3 system log (''/​var/​log/​messages''​) and possibly on the first console.
  
-If the live update fails, RS should print an error to the system log, and service(8) will complain. In order to debug such failures, it may be useful to enable verbose mode in RS, buy starting the system with ''​rs_verbose=1''​ as shown earlier.+If the live update fails, RS should print an error to the system log, and minix-service(8) will complain. In order to debug such failures, it may be useful to enable verbose mode in RS, buy starting the system with ''​rs_verbose=1''​ as shown earlier.
  
 == Self state transfer == == Self state transfer ==
Line 312: Line 233:
 The second update type is **self state transfer**. Self state transfer also performs an update of a service into an identical copy of itself, but instead uses the state transfer functionality of the magic framework. Thus, self state transfer requires that the service be instrumented properly. This update type can be used to test whether a service'​s state can be transferred without problems. Please note that many of the points covered here also apply to the remaining two update types, as all three are using the state transfer of the magic framework. The second update type is **self state transfer**. Self state transfer also performs an update of a service into an identical copy of itself, but instead uses the state transfer functionality of the magic framework. Thus, self state transfer requires that the service be instrumented properly. This update type can be used to test whether a service'​s state can be transferred without problems. Please note that many of the points covered here also apply to the remaining two update types, as all three are using the state transfer of the magic framework.
  
-Self state transfer is performed by supplying the ''​-t''​ flag along with "​self"​ to the service update command:+Self state transfer is performed by supplying the ''​-t''​ flag along with "​self"​ to the minix-service update command:
  
-  minix# service -t update self -label pm+  minix# ​minix-service -t update self -label pm
  
-This command will perform self state transfer of the PM service. The libmagic ​state transfer routine in the new service instance will print additional system messages while it is running. Upon success, the system output will look somewhat like this:+This command will perform self state transfer of the PM service. The libmagicrt ​state transfer routine in the new service instance will print additional system messages while it is running. Upon success, the system output will look somewhat like this:
  
   total remote functions: 57. relocated: 54   total remote functions: 57. relocated: 54
Line 328: Line 249:
   RS: update succeeded   RS: update succeeded
  
-If the state transfer routine is not able to perform state transfer successfully,​ it will print messages that start with ''​[ERROR]''​. RS will then roll back the service to the old instance, and both RS and service(8) will report failure. Self state transfer should succeed for all MINIX3 system services that have been built with bitcode and instrumented with libmagic ​and the magic pass. As of writing, there are no system services for which self state transfer is known to result in ''​[ERROR]''​ lines and subsequent live update failure. However:+If the state transfer routine is not able to perform state transfer successfully,​ it will print messages that start with ''​[ERROR]''​. RS will then roll back the service to the old instance, and both RS and minix-service(8) will report failure. Self state transfer should succeed for all MINIX3 system services that have been built with bitcode and instrumented with libmagicrt ​and the magic pass. As of writing, there are no system services for which self state transfer is known to result in ''​[ERROR]''​ lines and subsequent live update failure. However:
  
   * It is possible that new changes to system services, and even usage scenarios which we have not yet tested, do result in state transfer errors. Such errors should be resolved. The developers guide further below contains information on how to resolve some of these errors.   * It is possible that new changes to system services, and even usage scenarios which we have not yet tested, do result in state transfer errors. Such errors should be resolved. The developers guide further below contains information on how to resolve some of these errors.
Line 336: Line 257:
   * Some services have no state to transfer, in which case their new instances will perform a fresh start instead of state transfer. In that case, live update with self state transfer will succeed, but not print the state transfer system messages shown above. This is the case for the IS (Information Server) and readclock.drv services, for example.   * Some services have no state to transfer, in which case their new instances will perform a fresh start instead of state transfer. In that case, live update with self state transfer will succeed, but not print the state transfer system messages shown above. This is the case for the IS (Information Server) and readclock.drv services, for example.
  
-  * Some services may only be updated once brought into a specific state of quiescence, because the default quiescence state is not sufficiently restrictive. In that case, the user must specify an alternative quiescence state explicitly, through the service(8) ''​-state''​ option. This currently applies to all services that make use of userspace threads, namely the VFS, ahci, and virtio_blk services. These services must be updated using quiescence state 2 (//request free//) rather than state 1 (//work free//):+  * Some services may only be updated once brought into a specific state of quiescence, because the default quiescence state is not sufficiently restrictive. In that case, the user must specify an alternative quiescence state explicitly, through the minix-service(8) ''​-state''​ option. This currently applies to all services that make use of userspace threads, namely the VFS, ahci, and virtio_blk services. These services must be updated using quiescence state 2 (//request free//) rather than state 1 (//work free//):
  
-  minix# service -t update self -label vfs -state 2+  minix# ​minix-service -t update self -label vfs -state 2
  
 Omitting the appropriate state parameter may result in a crash of the service after live update. At the moment, the update_asr(8) script has hardcoded knowledge about these necessary states. None of this is great, and we will be working towards a situation where the default state will not result in a crash - see the section on open issues further below. Omitting the appropriate state parameter may result in a crash of the service after live update. At the moment, the update_asr(8) script has hardcoded knowledge about these necessary states. None of this is great, and we will be working towards a situation where the default state will not result in a crash - see the section on open issues further below.
  
-  * State transfer may be slow, and RS applies a rather strict default timeout for live updates. Therefore, it may sometimes be necessary to set a longer timeout in order to avoid needless failures. This can be done through the ''​-maxtime''​ option to service(8):+  * State transfer may be slow, and RS applies a rather strict default timeout for live updates. Therefore, it may sometimes be necessary to set a longer timeout in order to avoid needless failures. This can be done through the ''​-maxtime''​ option to minix-service(8):
  
-  minix# service -t update self -label vfs -state 2 -maxtime 120HZ+  minix# ​minix-service -t update self -label vfs -state 2 -maxtime 120HZ
  
 The maximum time is specified in clock ticks by default, but may be given in seconds by appending "​HZ"​ to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system'​s clock frequency, also known as its HZ setting. The above example allows the live update of VFS to take up to two minutes. The maximum time is specified in clock ticks by default, but may be given in seconds by appending "​HZ"​ to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system'​s clock frequency, also known as its HZ setting. The above example allows the live update of VFS to take up to two minutes.
Line 350: Line 271:
 == ASR rerandomization == == ASR rerandomization ==
  
-The third update type is **ASR rerandomization**. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service performs state transfer into a rerandomized version of the same service. This involves specifying the path to a rerandomized ASR binary to the service(8) command, as well as the ''​-a''​ flag. The ''​-a''​ flag tells the new instance to enable the run-time parts of rerandomization during the live update.+The third update type is **ASR rerandomization**. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service performs state transfer into a rerandomized version of the same service. This involves specifying the path to a rerandomized ASR binary to the minix-service(8) command, as well as the ''​-a''​ flag. The ''​-a''​ flag tells the new instance to enable the run-time parts of rerandomization during the live update.
  
-  minix# service -a update /​service/​asr/​1/pm -label pm+  minix# ​minix-service -a update /​service/​asr/​pm--progname ​pm -label pm
  
-In a system that has been built with ASR rerandomization,​ the (randomized) base service binaries are located in ''/​service''​ and the (randomized) alternative service binaries are located ​in numbered ​subdirectories ​in ''/​service/​asr''​. As mentioned before, the update_asr(8) command can be used to perform these updates semi-automatically.+In a system that has been built with ASR rerandomization,​ the (randomized) base service binaries are located in ''/​service''​ and the (randomized) alternative service binaries are located ​as numbered ​files in ''/​service/​asr''​. As mentioned before, the update_asr(8) command can be used to perform these updates semi-automatically.
  
 Compared to self state transfer, ASR rerandomization comes with one extra restriction:​ the VM service cannot be subjected to forms of state transfer more complicated than self state transfer. For this reason, VM is also skipped by the update_asr(8) command. We will explain the restrictions regarding the VM service in the developers guide. Compared to self state transfer, ASR rerandomization comes with one extra restriction:​ the VM service cannot be subjected to forms of state transfer more complicated than self state transfer. For this reason, VM is also skipped by the update_asr(8) command. We will explain the restrictions regarding the VM service in the developers guide.
Line 362: Line 283:
 The final update type is a **functional update**. Compared to self state transfer, ASR rerandomization relocates code and more data. However, for ASR rerandomization,​ there are still fundamentally no differences between the old and the new version of the service. In contrast, in the case of a functional update, the service performs state transfer into a new program. While this new program is typically highly similar, it may be different from the running service in various ways. The final update type is a **functional update**. Compared to self state transfer, ASR rerandomization relocates code and more data. However, for ASR rerandomization,​ there are still fundamentally no differences between the old and the new version of the service. In contrast, in the case of a functional update, the service performs state transfer into a new program. While this new program is typically highly similar, it may be different from the running service in various ways.
  
-In terms of the service(8) command, such functional updates can be performed by simply using //service update// with a new binary. For example, one could test a new version of the UDS (UNIX Domain Sockets) service, without installing it into ''/​service''​ yet, and without affecting its open sockets:+In terms of the minix-service(8) command, such functional updates can be performed by simply using //minix-service update// with a new binary. For example, one could test a new version of the UDS (UNIX Domain Sockets) service, without installing it into ''/​service''​ yet, and without affecting its open sockets:
  
-  minix# service update /​usr/​src/​minix/​net/​uds/​uds -label uds+  minix# ​minix-service update /​usr/​src/​minix/​net/​uds/​uds -label uds
  
 The possibility of actual differences between the old and new service versions adds an extra dimension for the state transfer. Additional state transfer problems can be expected in this case, and must be dealt with accordingly. The developers guide will (eventually) elaborate on this point. The possibility of actual differences between the old and new service versions adds an extra dimension for the state transfer. Additional state transfer problems can be expected in this case, and must be dealt with accordingly. The developers guide will (eventually) elaborate on this point.
  
-Similarly, depending on the nature of the update, the update action may require a specific state of quiescence. Taking UDS as an example, an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors be in flight at the time of the update. The old instance of the service must support this as a custom quiescence state. This custom state can then be specified through the ''​-state''​ option of the //service update// command.+Similarly, depending on the nature of the update, the update action may require a specific state of quiescence. Taking UDS as an example, an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors be in flight at the time of the update. The old instance of the service must support this as a custom quiescence state. This custom state can then be specified through the ''​-state''​ option of the //minix-service update// command.
  
 Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned! Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned!
Line 374: Line 295:
 == Multicomponent updates == == Multicomponent updates ==
  
-From the user's perspective,​ updating multiple services at once is not much more complex than updating a single service. First, a number of **service update** commands should be issued, just as before, but each with the ''​-q''​ flag added:+From the user's perspective,​ updating multiple services at once is not much more complex than updating a single service. First, a number of **minix-service update** commands should be issued, just as before, but each with the ''​-q''​ flag added:
  
-  minix# service -q -t update /service/pm -label pm +  minix# ​minix-service -q -t update /service/pm -label pm 
-  minix# service -q -t update /​service/​vfs -label vfs -state 2+  minix# ​minix-service -q -t update /​service/​vfs -label vfs -state 2
  
-Then, the entire update can be launched with the **service sysctl upd_run** command:+Then, the entire update can be launched with the **minix-service sysctl upd_run** command:
  
-  minix# service sysctl upd_run+  minix# ​minix-service sysctl upd_run
  
-The RS output will be much more verbose in this case. Note that timeouts are still to be specified on a per-service basis, rather than for the entire update at once. If necessary, any queued //service update// commands may be canceled with the **upd_stop** subcommand:+The RS output will be much more verbose in this case. Note that timeouts are still to be specified on a per-service basis, rather than for the entire update at once. If necessary, any queued //minix-service update// commands may be canceled with the **upd_stop** subcommand:
  
-  minix# service sysctl upd_stop+  minix# ​minix-service sysctl upd_stop
  
 This will cancel the entire multicomponent live update action. This will cancel the entire multicomponent live update action.
- 
-==== Useful host commands ==== 
- 
-The host-side ''​clientctl''​ script in ''​minix/​llvm''​ offers a number of additional convenient commands, mainly for developers. We list some of them here. 
- 
-The **buildboot** command installs just the services that are part of the boot image. It can be used instead of ''​clientctl buildimage''​ when only boot-image services have been changed, thus speeding up the development cycle: 
- 
-  $ ./clientctl buildboot 
- 
-Using this command, it is possible to make and test changes to boot system services fairly quickly. As an example, the following set of steps suffices to make and test changes to the PM service: 
- 
-  $ export PATH=$PATH:/​home/​user/​minix-liveupdate/​obj.i386/​tooldir.{platform}/​bin 
-  $ cd minix-src/​minix/​servers/​pm 
-  [make changes to the PM source code] 
-  $ nbmake-i386 all install 
-  $ cd ../../llvm 
-  $ C=pm ./​relink.llvm magic 
-  $ C=pm ./​build.llvm magic 
-  $ ./clientctl buildboot 
-  $ OUT=F ./clientctl run 
- 
-The **unstack** command shows a stacktrace of pretty much any MINIX3 binary in human-readable form: 
- 
-  $ ./clientctl unstack <​name>​ [address [address ..]] 
- 
-For example, to show a stack trace of the PM service in a human-readable form: 
- 
-  $ ./clientctl unstack pm 0x805a7fd 0x80492a5 0x8048050 
- 
-Note that on ASR-enabled installations,​ the unstack command works only on the base versions of system services. There is currently no way to unstack a stacktrace for any of the ASR-rerandomized service binaries. On one occasion, the author of this document has done that process by hand, by finding the matching assembly code of an ASR-rerandomized service'​s crash site in the service'​s base version. 
  
 ===== Developers guide ===== ===== Developers guide =====
Line 446: Line 337:
 In certain cases, a service may have to meet custom requirements before it is allowed to be updated. This depends on both the service and the update. We previously gave an example regarding the UDS service and transferring file descriptors before. As another example, an update that affects message protocols may have to ensure that the service has no outstanding requests to other services using that protocol. As yet another example, certain drivers may want to avoid being updated while certain types of DMA are ongoing, etcetera. In certain cases, a service may have to meet custom requirements before it is allowed to be updated. This depends on both the service and the update. We previously gave an example regarding the UDS service and transferring file descriptors before. As another example, an update that affects message protocols may have to ensure that the service has no outstanding requests to other services using that protocol. As yet another example, certain drivers may want to avoid being updated while certain types of DMA are ongoing, etcetera.
  
-It is up to the writer of the service to implement any such custom quiescence states, assigning a number to each of them. It is then up to the system administrator to supply such a state with the //service update// command, using the ''​-state <​number>''​ option. Some of the quiescence states are predefined; others must be defined by the service developer explicitly. The following states are defined:+It is up to the writer of the service to implement any such custom quiescence states, assigning a number to each of them. It is then up to the system administrator to supply such a state with the //minix-service update// command, using the ''​-state <​number>''​ option. Some of the quiescence states are predefined; others must be defined by the service developer explicitly. The following states are defined:
  
   * State **1** (''​SEF_LU_STATE_WORK_FREE''​):​ work free. This state ensures that the service is not currently performing any work. The fact that the service is being prepared at the time of verifying the quiescence state implies that it is not doing any other work, and thus, SEF is hardcoded to accept updates in this state. The service developer can not override the check for this state.   * State **1** (''​SEF_LU_STATE_WORK_FREE''​):​ work free. This state ensures that the service is not currently performing any work. The fact that the service is being prepared at the time of verifying the quiescence state implies that it is not doing any other work, and thus, SEF is hardcoded to accept updates in this state. The service developer can not override the check for this state.
Line 464: Line 355:
   sef_setcb_lu_state_isvalid(my_state_isvalid);​   sef_setcb_lu_state_isvalid(my_state_isvalid);​
  
-This routine has the signature ''​int my_state_isvalid(int state, int flags)'',​ and will be called when a live update is initiated through service(8). As its most important parameter, ''​state''​ is the requested quiescence state. The ''​flags''​ parameter contains update flags and is typically unused. The routine must return ''​TRUE''​ if the state is valid for the service, and ''​FALSE''​ otherwise. Most services will want to allow the standard states as well as any custom states:+This routine has the signature ''​int my_state_isvalid(int state, int flags)'',​ and will be called when a live update is initiated through ​minix-service(8). As its most important parameter, ''​state''​ is the requested quiescence state. The ''​flags''​ parameter contains update flags and is typically unused. The routine must return ''​TRUE''​ if the state is valid for the service, and ''​FALSE''​ otherwise. Most services will want to allow the standard states as well as any custom states:
  
   #define MY_CUSTOM_STATE_0 (SEF_LU_STATE_CUSTOM_BASE+0)   #define MY_CUSTOM_STATE_0 (SEF_LU_STATE_CUSTOM_BASE+0)
Line 475: Line 366:
   sef_setcb_lu_prepare(my_lu_prepare);​   sef_setcb_lu_prepare(my_lu_prepare);​
  
-This routine has the signature ''​int my_lu_prepare(int state)'',​ and will be called when a live update is initiated through service(8), after ensuring the given state is valid. Again, ''​state''​ is the requested quiescence state. The function must return ''​OK''​ if the live update can proceed in this state, and ''​ENOTREADY''​ otherwise. It should check the standard states and/or any custom states, typically in a switch statement.+This routine has the signature ''​int my_lu_prepare(int state)'',​ and will be called when a live update is initiated through ​minix-service(8), after ensuring the given state is valid. Again, ''​state''​ is the requested quiescence state. The function must return ''​OK''​ if the live update can proceed in this state, and ''​ENOTREADY''​ otherwise. It should check the standard states and/or any custom states, typically in a switch statement.
  
 Third, the service may optionally provide a quiescence state debugging function through the sef_setcb_lu_state_dump(3) SEF API call. The given callback routine has the signature ''​int my_lu_state_dump(int state)''​ and should use the sef_lu_dprint(3) printf-like function to print information about the given quiescence state and its current internal state as appropriate,​ using newline-terminated lines. Third, the service may optionally provide a quiescence state debugging function through the sef_setcb_lu_state_dump(3) SEF API call. The given callback routine has the signature ''​int my_lu_state_dump(int state)''​ and should use the sef_lu_dprint(3) printf-like function to print information about the given quiescence state and its current internal state as appropriate,​ using newline-terminated lines.
Line 483: Line 374:
 We now get into the details of the live update infrastructure. For many parts of the story, it may be useful to take a look at the actual source code as well. In this section we give a quick overview of what parts of the source code are where, and what they do. We now get into the details of the live update infrastructure. For many parts of the story, it may be useful to take a look at the actual source code as well. In this section we give a quick overview of what parts of the source code are where, and what they do.
  
-The LLVM instrumentation ​code is located in ''​minix/​llvm''​ of the MINIX3 source code, along with the supporting scripts ​described in the users guide. The following relevant LLVM passes are located in ''​minix/​llvm/​passes'':​+The LLVM instrumentation ​passes are located in ''​minix/​llvm''​ of the MINIX3 source code, along with generate_gold_plugin.sh script ​described in the users guide. The following relevant LLVM passes are located in ''​minix/​llvm/​passes'':​
  
   * The **WeakAliasModuleOverride** pass resolves a particular issue with weak symbols being used in assembly code. TODO   * The **WeakAliasModuleOverride** pass resolves a particular issue with weak symbols being used in assembly code. TODO
Line 489: Line 380:
   * The **sectionify** pass is used to tag certain functions and data of bitcode modules as belonging to a certain section. Its main purpose is to tag certain parts of the compiled code such that the magic pass (see below), in a subsequent run over the same code, will treat the tagged parts as special. For example, it is used to ignore all variables in the libc malloc code for state transfer, for reasons explained later.   * The **sectionify** pass is used to tag certain functions and data of bitcode modules as belonging to a certain section. Its main purpose is to tag certain parts of the compiled code such that the magic pass (see below), in a subsequent run over the same code, will treat the tagged parts as special. For example, it is used to ignore all variables in the libc malloc code for state transfer, for reasons explained later.
  
-  * The **magic** pass performs link-time static analysis and instrumentation of system services. It is responsible for supplying ​libmagic ​(see below) with the necessary information to allow for state transfer at runtime, by including descriptions of data types, global variables, and other information,​ in the service module. In addition, it is responsible for replacing certain function calls in the module, in particular memory management functions, with calls to wrappers in libmagic. This allows for runtime tracking of dynamically allocated memory objects.+  * The **magic** pass performs link-time static analysis and instrumentation of system services. It is responsible for supplying ​libmagicrt ​(see below) with the necessary information to allow for state transfer at runtime, by including descriptions of data types, global variables, and other information,​ in the service module. In addition, it is responsible for replacing certain function calls in the module, in particular memory management functions, with calls to wrappers in libmagicrt. This allows for runtime tracking of dynamically allocated memory objects.
  
-  * The **asr** pass performs randomization of the service binary, for example by rearranging functions, basic blocks within functions, and data, adding padding between those, and letting functions allocate stack padding. The ASR pass does not deal with randomization of dynamically allocated objects. Instead, it passes some settings on to libmagic.+  * The **asr** pass performs randomization of the service binary, for example by rearranging functions, basic blocks within functions, and data, adding padding between those, and letting functions allocate stack padding. The ASR pass does not deal with randomization of dynamically allocated objects. Instead, it passes some settings on to libmagicrt.
  
 In addition to the passes, the following pieces of system functionality are especially important for live update: In addition to the passes, the following pieces of system functionality are especially important for live update:
  
-  * The magic library, **libmagic**, is the runtime component of system services. It implements the actual state transfer routine, which uses both the information embedded in the service by the magic pass, and the tracking information it has gathered about dynamically allocated memory objects at run time. It also implements that actual runtime tracking. Furthermore, ​libmagic ​implements the aforementioned runtime part of the ASR functionality. For example, ​libmagic ​can add extra padding when performing memory allocations. The magic library is located in ''​minix/​llvm/static/​magic''​.+  * The magic runtime ​library, **libmagicrt**, is the runtime component of system services. It implements the actual state transfer routine, which uses both the information embedded in the service by the magic pass, and the tracking information it has gathered about dynamically allocated memory objects at run time. It also implements that actual runtime tracking. Furthermore, ​libmagicrt ​implements the aforementioned runtime part of the ASR functionality. For example, ​libmagicrt ​can add extra padding when performing memory allocations. The magic runtime ​library is located in ''​minix/​lib/libmagicrt''​.
  
-  * The glue between system services and libmagic ​is implemented as part of the **System Event Framework** (SEF) library routines. These routines also handle the communication between the system service and RS. Use of SEF is compulsory for all system services. The SEF code is part of **libsys**. Its implementation can be found in the ''​minix/​lib/​libsys/​sef*.c''​ files.+  * The glue between system services and libmagicrt ​is implemented as part of the **System Event Framework** (SEF) library routines. These routines also handle the communication between the system service and RS. Use of SEF is compulsory for all system services. The SEF code is part of **libsys**. Its implementation can be found in the ''​minix/​lib/​libsys/​sef*.c''​ files.
  
   * The source code of **RS**, the Reincarnation Server, is located in ''​minix/​servers/​rs''​. RS uses live update functionality implemented in the kernel, located in ''​minix/​kernel'',​ and VM, located in ''​minix/​servers/​vm''​.   * The source code of **RS**, the Reincarnation Server, is located in ''​minix/​servers/​rs''​. RS uses live update functionality implemented in the kernel, located in ''​minix/​kernel'',​ and VM, located in ''​minix/​servers/​vm''​.
Line 511: Line 402:
 In general, properly achieving //​quiescence//​ is one of the main challenges for a live update system. For example, if a live update changes the implementation of a particular function, the component being updated must not be executing that function at the time of the live update - if it is, the live update will most likely result in a crash of the component. In MINIX3, the quiescence issue is resolved in a way that leaves little room for problems, by exploiting MINIX3'​s message-based nature. In essence, all the MINIX3 services consist of a main message loop that repeatedly receives a message and processes this message. MINIX3 supports no kernel threads, and thus, the MINIX3 services have no internal CPU-level concurrency. As a result, a message can be used to enforce quiescence. In general, properly achieving //​quiescence//​ is one of the main challenges for a live update system. For example, if a live update changes the implementation of a particular function, the component being updated must not be executing that function at the time of the live update - if it is, the live update will most likely result in a crash of the component. In MINIX3, the quiescence issue is resolved in a way that leaves little room for problems, by exploiting MINIX3'​s message-based nature. In essence, all the MINIX3 services consist of a main message loop that repeatedly receives a message and processes this message. MINIX3 supports no kernel threads, and thus, the MINIX3 services have no internal CPU-level concurrency. As a result, a message can be used to enforce quiescence.
  
-MINIX3 live updates are orchestrated by the RS (Reincarnation Server) service. The administrator of the system first compiles a new version of the service into an executable on disk, and then instructs RS to update a particular running system service into the new version, through the service(8) utility. RS starts by loading the new version of the service as a new service process, without letting it run. Thus, there are temporarily two instances of the service: the old instance, which is still running, and the new instance, which contains the new code but not yet any of the necessary state.+MINIX3 live updates are orchestrated by the RS (Reincarnation Server) service. The administrator of the system first compiles a new version of the service into an executable on disk, and then instructs RS to update a particular running system service into the new version, through the minix-service(8) utility. RS starts by loading the new version of the service as a new service process, without letting it run. Thus, there are temporarily two instances of the service: the old instance, which is still running, and the new instance, which contains the new code but not yet any of the necessary state.
  
 RS then asks the old instance of the service to prepare to be updated, by sending a __prepare__ request message to it. At the moment that the service receives and processes the preparation message, it is by definition in a known state, as it cannot also be doing something else at the same time. While this is a good start for quiescence, the service may have to meet additional requirements regarding its current activity, depending on the service and the type of live update. The administrator provides the intended //​quiescence state// for the live update when starting the update, and the service itself determines whether or not it is //ready// when handling the __prepare__ message. If the service decides that it does not meet the given quiescence requirements,​ the live update is aborted. RS then asks the old instance of the service to prepare to be updated, by sending a __prepare__ request message to it. At the moment that the service receives and processes the preparation message, it is by definition in a known state, as it cannot also be doing something else at the same time. While this is a good start for quiescence, the service may have to meet additional requirements regarding its current activity, depending on the service and the type of live update. The administrator provides the intended //​quiescence state// for the live update when starting the update, and the service itself determines whether or not it is //ready// when handling the __prepare__ message. If the service decides that it does not meet the given quiescence requirements,​ the live update is aborted.
Line 517: Line 408:
 However, if the old instance does meet the requirements,​ it acknowledges that it is ready by sending a __ready__ message to RS, blocking on receipt of a reply from RS. Thus, the old instance is effectively stopped in a known state. In order to maintain the externally visible state (most importantly,​ the communication endpoint) of the service being updated, the process slots of the old and the new service instances are swapped. The new instance, now in the original process slot, is then allowed to run. Upon startup, the new instance finds out from RS that it is the new instance of an old, stopped process, and attempts to perform state transfer from this old process into itself. However, if the old instance does meet the requirements,​ it acknowledges that it is ready by sending a __ready__ message to RS, blocking on receipt of a reply from RS. Thus, the old instance is effectively stopped in a known state. In order to maintain the externally visible state (most importantly,​ the communication endpoint) of the service being updated, the process slots of the old and the new service instances are swapped. The new instance, now in the original process slot, is then allowed to run. Upon startup, the new instance finds out from RS that it is the new instance of an old, stopped process, and attempts to perform state transfer from this old process into itself.
  
-State transfer requires transfer of all individual pieces of data from the old to the new process, possibly to a new location. This is performed by the magic framework. In a nutshell, the magic state transfer approach relies on having a full view of all the individual pieces of data that make up the process, along with type information about the data, including for example structure layouts and types of pointers. For static data, this information is generated by the magic pass through static analysis performed at compile time, and included with the service binary. For dynamic data, the information is collected and maintained by the magic library attached to the service. The end result is that the state transfer framework knows about all global variables and functions, and for each pointer, what type of data the pointer points to.+State transfer requires transfer of all individual pieces of data from the old to the new process, possibly to a new location. This is performed by the magic framework. In a nutshell, the magic state transfer approach relies on having a full view of all the individual pieces of data that make up the process, along with type information about the data, including for example structure layouts and types of pointers. For static data, this information is generated by the magic pass through static analysis performed at compile time, and included with the service binary. For dynamic data, the information is collected and maintained by the magic runtime ​library attached to the service. The end result is that the state transfer framework knows about all global variables and functions, and for each pointer, what type of data the pointer points to.
  
-This knowledge, in addition to full access to the memory of the old instance through a special memory grant, allows the runtime libmagic ​state transfer procedure in the new instance to iterate over all data of the old process. This procedure recursively follows any pointers it encounters, and //pairs// each piece of data with the corresponding piece of data in the new process, copying over and adjusting (as necessary) the data for the new layout as necessary. In certain cases, the state transfer system may not be able to pair all pieces of data, or deal with all pointers. In that case, state transfer fails. Annotations in the service source code, as well as custom data transfer methods, can be provided in order to aid the state transfer process.+This knowledge, in addition to full access to the memory of the old instance through a special memory grant, allows the libmagicrt ​state transfer procedure in the new instance to iterate over all data of the old process. This procedure recursively follows any pointers it encounters, and //pairs// each piece of data with the corresponding piece of data in the new process, copying over and adjusting (as necessary) the data for the new layout as necessary. In certain cases, the state transfer system may not be able to pair all pieces of data, or deal with all pointers. In that case, state transfer fails. Annotations in the service source code, as well as custom data transfer methods, can be provided in order to aid the state transfer process.
  
-Regardless of whether state transfer succeeded or failed, the new instance sends the result of the state transfer to RS using an __init__ request message. If state transfer succeeded, RS allows the new instance to continue to run, and kills the process of the old instance. If the state transfer fails, RS again swaps the process slots of the old and the new instance, allows the old instance to run again, and kills the new instance. In both cases, RS communicates the result to the service(8) utility as well, ultimately letting the system administrator know about the outcome of the live update.+Regardless of whether state transfer succeeded or failed, the new instance sends the result of the state transfer to RS using an __init__ request message. If state transfer succeeded, RS allows the new instance to continue to run, and kills the process of the old instance. If the state transfer fails, RS again swaps the process slots of the old and the new instance, allows the old instance to run again, and kills the new instance. In both cases, RS communicates the result to the minix-service(8) utility as well, ultimately letting the system administrator know about the outcome of the live update.
  
 For multicomponent live updates, all affected services are first brought into the //ready// state, after which they are all updated. Any service failing to get ready in the preparation phase will cause an abort of the entire update, and any service failing the state transfer phase causes a rollback of the entire update. For multicomponent live updates, all affected services are first brought into the //ready// state, after which they are all updated. Any service failing to get ready in the preparation phase will cause an abort of the entire update, and any service failing the state transfer phase causes a rollback of the entire update.
Line 602: Line 493:
 The state transfer procedure also transfers //dynamic// data objects, which are located in the heap and mmap sections of the old instance. In essence, the procedure recreates the heap and mmap sections during the state transfer, by allocating new heap or mapped memory for each dynamic object, and then transferring its actual contents. This again includes pointer analysis and adjustment. Here, one object is one piece of memory created by a call to malloc(3) or mmap(2), for example. The state transfer procedure also transfers //dynamic// data objects, which are located in the heap and mmap sections of the old instance. In essence, the procedure recreates the heap and mmap sections during the state transfer, by allocating new heap or mapped memory for each dynamic object, and then transferring its actual contents. This again includes pointer analysis and adjustment. Here, one object is one piece of memory created by a call to malloc(3) or mmap(2), for example.
  
-Since MINIX3 already transfers the mmap section to the new instance automatically,​ the state transfer framework starts by unmapping all memory-mapped areas that it knows it will recreate. However, since some memory areas (the aforementioned memory-mapped I/O and DMA memory) cannot be recreated by the magic framework, these are not destroyed and recreated. These areas are called //​special//,​ //​out-of-band//​ memory. The service has to tell the magic library about special memory areas. For the two common ways of allocating such memory, alloc_contig(3) and vm_map_phys(2),​ this is done automatically by libsys.+Since MINIX3 already transfers the mmap section to the new instance automatically,​ the state transfer framework starts by unmapping all memory-mapped areas that it knows it will recreate. However, since some memory areas (the aforementioned memory-mapped I/O and DMA memory) cannot be recreated by the magic framework, these are not destroyed and recreated. These areas are called //​special//,​ //​out-of-band//​ memory. The service has to tell the magic runtime ​library about special memory areas. For the two common ways of allocating such memory, alloc_contig(3) and vm_map_phys(2),​ this is done automatically by libsys.
  
 Out-of-band memory is seen as opaque, physically and virtually unmovable memory, and ignored entirely for the purpose of state transfer. Thus, if a piece of out-of-band memory contains a pointer to a piece of memory that is //not// marked as out-of-band,​ this pointer will be missed during state transfer. For the aforementioned (memory-mapped I/O and DMA) memory types, this is not a problem. Out-of-band memory is seen as opaque, physically and virtually unmovable memory, and ignored entirely for the purpose of state transfer. Thus, if a piece of out-of-band memory contains a pointer to a piece of memory that is //not// marked as out-of-band,​ this pointer will be missed during state transfer. For the aforementioned (memory-mapped I/O and DMA) memory types, this is not a problem.
Line 608: Line 499:
 The default of inheriting the entire mmap section leads to the situation that if the magic framework misses any memory-mapped areas for any reason, these will effectively translate to a memory leak in the new instance. Currently, one such memory leak is addressed explicitly: the page directory that is allocated with mmap(2) internally by the libc malloc code. The default of inheriting the entire mmap section leads to the situation that if the magic framework misses any memory-mapped areas for any reason, these will effectively translate to a memory leak in the new instance. Currently, one such memory leak is addressed explicitly: the page directory that is allocated with mmap(2) internally by the libc malloc code.
  
-The state transfer procedure may fail if its analysis is not successful, in which the system will roll back to the old instance, and the live update fails. It is then up to the programmer to deal with such problems. This may involve annotating source code, for example to instruct the state transfer procedure to ignore certain pointers, or to copy over certain data as is. It may involve adding special state transfer routines to libmagic, which deal with fundamentally problematic cases such as unions. In rare cases, it may involve adapting source code to avoid state transfer problems. We discuss all this in more detail later.+The state transfer procedure may fail if its analysis is not successful, in which the system will roll back to the old instance, and the live update fails. It is then up to the programmer to deal with such problems. This may involve annotating source code, for example to instruct the state transfer procedure to ignore certain pointers, or to copy over certain data as is. It may involve adding special state transfer routines to libmagicrt, which deal with fundamentally problematic cases such as unions. In rare cases, it may involve adapting source code to avoid state transfer problems. We discuss all this in more detail later.
  
 In the case of self state transfer, all static objects will have the same location in both the old and the new instance. However, due to their dynamic recreation, the addresses of dynamic objects may change during self state transfer. In the case of self state transfer, all static objects will have the same location in both the old and the new instance. However, due to their dynamic recreation, the addresses of dynamic objects may change during self state transfer.
Line 634: Line 525:
 Therefore, in order to allow for rollback, VM must not make any changes to its dynamic memory during the live update. That also means that it may not allocate memory during the live update, not for other processes and not for itself. This situation leads to the following exceptions: Therefore, in order to allow for rollback, VM must not make any changes to its dynamic memory during the live update. That also means that it may not allocate memory during the live update, not for other processes and not for itself. This situation leads to the following exceptions:
  
-  * First and foremost, since the new VM instance essentially inherits the old instance'​s dynamic memory, the dynamic memory must be ignored by the state transfer framework. For this reason, at startup, VM tells libmagic ​that its entire dynamic memory region consists of special, out-of-band data. As a result, any pointers in this region will not be analyzed or adjusted by the state transfer procedure. This is a good thing, as changes to such pointers would not be undone after a rollback. However, the main consequence is that if the static memory layout of the VM process changes, any pointers in dynamic memory that point to static memory will become invalid. Therefore, updates to VM are limited to the **identity transfer** and **self state transfer** update types.+  * First and foremost, since the new VM instance essentially inherits the old instance'​s dynamic memory, the dynamic memory must be ignored by the state transfer framework. For this reason, at startup, VM tells libmagicrt ​that its entire dynamic memory region consists of special, out-of-band data. As a result, any pointers in this region will not be analyzed or adjusted by the state transfer procedure. This is a good thing, as changes to such pointers would not be undone after a rollback. However, the main consequence is that if the static memory layout of the VM process changes, any pointers in dynamic memory that point to static memory will become invalid. Therefore, updates to VM are limited to the **identity transfer** and **self state transfer** update types.
   * Another effect of the automatic dynamic memory inheritance is that dynamic memory allocations need and must not be tracked. Therefore, dynamic memory allocation functions are not instrumented in VM at all, requiring an instrumentation override. This override also requires the need to disable some other instrumentation features, such as the aforementioned libc malloc page directory exception. The features are disabled during VM's linking process, through special statements in its Makefile.   * Another effect of the automatic dynamic memory inheritance is that dynamic memory allocations need and must not be tracked. Therefore, dynamic memory allocation functions are not instrumented in VM at all, requiring an instrumentation override. This override also requires the need to disable some other instrumentation features, such as the aforementioned libc malloc page directory exception. The features are disabled during VM's linking process, through special statements in its Makefile.
   * After a rollback, the old VM instance still has to perform a small number of corrective actions to undo changes made by the new instance. These actions are however kept to a minimum. In the future, more extended non-transparent rollback may be the key to allowing more invasive live updates to the VM service.   * After a rollback, the old VM instance still has to perform a small number of corrective actions to undo changes made by the new instance. These actions are however kept to a minimum. In the future, more extended non-transparent rollback may be the key to allowing more invasive live updates to the VM service.
Line 652: Line 543:
 === Some basics and terminology === === Some basics and terminology ===
  
-The magic framework keeps track of each //static// object of data using a **sentry** ("​state entry"​) data structure. The framework keeps track of each //dynamic// object of data using a **dsentry** ("​dynamic state entry"​) data structure, which itself has an embedded //sentry// data structure. The magic pass installs ​libmagic ​wrappers around memory allocation routines so that it can allocate extra memory to store the dsentry metadata right before the actual memory object. Special, out-of-band memory regions are maintained in **obdsentry** ("​out-of-band dynamic state entry"​) data structur. Since no extra memory can be allocated next to the actual memory object in this case, obdsentries themselves are (currently) stored as static data as part of libmagic's own state.+The magic framework keeps track of each //static// object of data using a **sentry** ("​state entry"​) data structure. The framework keeps track of each //dynamic// object of data using a **dsentry** ("​dynamic state entry"​) data structure, which itself has an embedded //sentry// data structure. The magic pass installs ​libmagicrt ​wrappers around memory allocation routines so that it can allocate extra memory to store the dsentry metadata right before the actual memory object. Special, out-of-band memory regions are maintained in **obdsentry** ("​out-of-band dynamic state entry"​) data structur. Since no extra memory can be allocated next to the actual memory object in this case, obdsentries themselves are (currently) stored as static data as part of libmagicrt's own state.
  
 The magic framework also uses the concept of a **selement** ("​state element"​),​ which is a particular element within a state entry; for example, it can be one particular field in a structure. State is transferred one element at a time. If the state transfer procedure encounters a problem, it will report about the state element that is causing the problem. The magic framework also uses the concept of a **selement** ("​state element"​),​ which is a particular element within a state entry; for example, it can be one particular field in a structure. State is transferred one element at a time. If the state transfer procedure encounters a problem, it will report about the state element that is causing the problem.
  
-Each pointer in the service process is expected to point a data type known to libmagic. All the possible data types that can be used by the service are enumerated through static analysis by the magic pass, and stored in a **type** table as part of the instrumented service. It may happen that one data type is cast to another, either in the source code of the service or as a result of the LLVM compilation and linking process. As a result, while the static analysis may conclude that a pointer is for one type, runtime state transfer may find that the pointer was (for example) allocated for another type. Normally, such mismatches would cause state transfer to fail, but casting makes this a legitimate case. Therefore, the magic framework has the notion of **compatible types**: if type A is cast to type B anywhere, type A is marked as a compatible type for type B, and finding type A when transferring data of supposed type B will not result in state transfer failure. The magic pass adds a list of compatible types to the service binary as well, all for use by libmagic ​at state transfer time.+Each pointer in the service process is expected to point a data type known to libmagicrt. All the possible data types that can be used by the service are enumerated through static analysis by the magic pass, and stored in a **type** table as part of the instrumented service. It may happen that one data type is cast to another, either in the source code of the service or as a result of the LLVM compilation and linking process. As a result, while the static analysis may conclude that a pointer is for one type, runtime state transfer may find that the pointer was (for example) allocated for another type. Normally, such mismatches would cause state transfer to fail, but casting makes this a legitimate case. Therefore, the magic framework has the notion of **compatible types**: if type A is cast to type B anywhere, type A is marked as a compatible type for type B, and finding type A when transferring data of supposed type B will not result in state transfer failure. The magic pass adds a list of compatible types to the service binary as well, all for use by libmagicrt ​at state transfer time.
  
 === Annotation === === Annotation ===
Line 665: Line 556:
   * **ixfer**: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain external pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer.   * **ixfer**: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain external pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer.
   * **cixfer**: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''​void *''​ but may be used to store a small integer as well.   * **cixfer**: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''​void *''​ but may be used to store a small integer as well.
-  * **pxfer**: Pointer Transfer. This annotation forces a non-pointer ​value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. As of writingthis annotation ​is not used in practice.+  * **pxfer**: Pointer Transfer. This annotation forces a value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. The pxfer annotation may also be used for a union of (differently typed) pointers. Thusin some cases, a union-of-structures can be split up into a union of non-pointers and one or more unions of pointers, marking the non-pointer union with ''​ixfer''​ and the pointer union(s) with ''​pxfer''​. This is indeed how ''​pxfer''​ is currently ​used in practice ​as well.
   * **sxfer**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them, etcetera. The sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure'​s fields as appropriate when annotating the union as sxfer.   * **sxfer**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them, etcetera. The sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure'​s fields as appropriate when annotating the union as sxfer.
  
Line 682: Line 573:
   noxfer_foo_t * my_foo_pointer = &​my_foo_struct;​ /* the pointer will be transferred */   noxfer_foo_t * my_foo_pointer = &​my_foo_struct;​ /* the pointer will be transferred */
  
-It is possible to enable debugging flags in libmagic ​such that it will print more details on how it handles annotated exceptions: in ''​minix/​llvm/​include/​st/​callback.h'',​ change ''​ST_CB_DEFAULT_FLAGS''​ from ''​(ST_CB_PRINT_ERR)''​ to ''​(ST_CB_PRINT_ERR|ST_CB_PRINT_DBG)''​. The debugging statements will be sent to the system log, and have a ''​[DEBUG]''​ prefix.+It is possible to enable debugging flags in libmagicrt ​such that it will print more details on how it handles annotated exceptions: in ''​minix/​lib/​libmagicrt/​include/​st/​callback.h'',​ change ''​ST_CB_DEFAULT_FLAGS''​ from ''​(ST_CB_PRINT_ERR)''​ to ''​(ST_CB_PRINT_ERR|ST_CB_PRINT_DBG)''​. The debugging statements will be sent to the system log, and have a ''​[DEBUG]''​ prefix.
  
 === Custom state transfer routines === === Custom state transfer routines ===
Line 690: Line 581:
 TODO TODO
  
-There is currently one example case where a custom state transfer routine is used, namely for the ''​dsi_u''​ union in the ''​struct data_store''​ structure which is used by the Data Store (DS) service and defined in ''​minix/​servers/​ds/​store.h''​. The custom state transfer routine is located in ''​minix/​llvm/static/​magic/​minix/​magic_ds.c'',​ and provides the state transfer process with information as to which of the union'​s fields should be transferred.+There is currently one example case where a custom state transfer routine is used, namely for the ''​dsi_u''​ union in the ''​struct data_store''​ structure which is used by the Data Store (DS) service and defined in ''​minix/​servers/​ds/​store.h''​. The custom state transfer routine is located in ''​minix/​lib/libmagicrt/​magic_ds.c'',​ and provides the state transfer process with information as to which of the union'​s fields should be transferred.
  
 === Preventing state transfer issues === === Preventing state transfer issues ===
Line 702: Line 593:
 However, the main union of the grant structure (''​cp_grant_t'',​ defined in ''​minix/​include/​minix/​safecopies.h''​) is currently marked as //ixfer//, meaning it will be transferred as is. This is not a problem for grants that point to memory //outside// the process being updated, and that means that indirect and magic grants pose no problem for state transfer. It is however a problem for grants that point to memory //inside// the process being updated, that is, for **direct grants**. However, the main union of the grant structure (''​cp_grant_t'',​ defined in ''​minix/​include/​minix/​safecopies.h''​) is currently marked as //ixfer//, meaning it will be transferred as is. This is not a problem for grants that point to memory //outside// the process being updated, and that means that indirect and magic grants pose no problem for state transfer. It is however a problem for grants that point to memory //inside// the process being updated, that is, for **direct grants**.
  
-For this reason, for a service that may potentially have direct grants active at the time of the live update, its writer has two options: 1) implement a custom state transfer routine for the ''​cp_grant_t''​ structure in libmagic, thus resolving the problem described in this entire section altogether, or 2) prevent live updates of the service whenever the service has active memory grants. The first option is preferred. In any case, the potential consequence of doing neither is that the service ends up suffering from arbitrary memory corruption after a live update, since the transferred direct grant will point to the wrong memory location.+For this reason, for a service that may potentially have direct grants active at the time of the live update, its writer has two options: 1) implement a custom state transfer routine for the ''​cp_grant_t''​ structure in libmagicrt, thus resolving the problem described in this entire section altogether, or 2) prevent live updates of the service whenever the service has active memory grants. The first option is preferred. In any case, the potential consequence of doing neither is that the service ends up suffering from arbitrary memory corruption after a live update, since the transferred direct grant will point to the wrong memory location.
  
 The live update system itself actually relies on the presence of a long-running direct grant, which provides access of the process'​s full address space to the process itself. The new instance uses this grant during a live update to access the memory contents of the old instance. Since the grant provides access to the process'​s entire address space, it does not suffer from the problem above. The live update system itself actually relies on the presence of a long-running direct grant, which provides access of the process'​s full address space to the process itself. The new instance uses this grant during a live update to access the memory contents of the old instance. Since the grant provides access to the process'​s entire address space, it does not suffer from the problem above.
Line 708: Line 599:
 == Userspace threads == == Userspace threads ==
  
-Userspace threads pose a problem for state transfer as well. We have previously explained that the process stack of the old instance can be disregarded by the state transfer procedure because it is "​naturally"​ recreated in the new instance. The same does not apply to the stacks of userspace threads, since stack variables are not tracked at run time: even though the threads'​ stacks are transferred to the new instance by the magic framework, they are seen as blobs of (typically) memory-mapped character arrays. The result is that any pointers on these stacks will not be known to libmagic ​and thus not be transferred properly. In addition, thread context (CPU register) state will typically be stored as an array of integers, and similarly end up being skipped by the state transfer procedure. The result is that while state transfer may (appear to) succeed, the service will crash after completion of the live update.+Userspace threads pose a problem for state transfer as well. We have previously explained that the process stack of the old instance can be disregarded by the state transfer procedure because it is "​naturally"​ recreated in the new instance. The same does not apply to the stacks of userspace threads, since stack variables are not tracked at run time: even though the threads'​ stacks are transferred to the new instance by the magic framework, they are seen as blobs of (typically) memory-mapped character arrays. The result is that any pointers on these stacks will not be known to libmagicrt ​and thus not be transferred properly. In addition, thread context (CPU register) state will typically be stored as an array of integers, and similarly end up being skipped by the state transfer procedure. The result is that while state transfer may (appear to) succeed, the service will crash after completion of the live update.
  
 At this time, the recommended solution is for the service to shut down all threads explicitly before starting state transfer, and to recreate the threads both after successful live update and as part of a rollback. The service may refuse to be updated if any of its threads are in use and cannot be shut down. The last point requires that the service supply a custom callback routine to SEF to perform that check for a quiescence state other than the default, through sef_setcb_lu_prepare(3). In order to allow the use of a nondefault state, a sef_setcb_lu_state_isvalid(3) callback routine must be supplied as well. For VFS and libblockdriver,​ we have chosen the following approach: At this time, the recommended solution is for the service to shut down all threads explicitly before starting state transfer, and to recreate the threads both after successful live update and as part of a rollback. The service may refuse to be updated if any of its threads are in use and cannot be shut down. The last point requires that the service supply a custom callback routine to SEF to perform that check for a quiescence state other than the default, through sef_setcb_lu_prepare(3). In order to allow the use of a nondefault state, a sef_setcb_lu_state_isvalid(3) callback routine must be supplied as well. For VFS and libblockdriver,​ we have chosen the following approach:
Line 721: Line 612:
 == Physically unmovable regions == == Physically unmovable regions ==
  
-Another case where the programmer may have to ensure that state transfer does not result in problems that will surface only after the live update, is when a service uses memory areas that are physically unmovable. Such memory areas are typically in use for DMA purposes. If the state transfer procedure changes the physical location of the buffers, DMA may be performed from or to the original physical location, resulting in garbage and possibly arbitrary memory corruption. Such DMA areas must be marked as special out-of-band memory in libmagic, and unmarked when freed, using the sef_llvm_add_special_mem_region(3) and sef_llvm_del_special_mem_region(3) SEF calls. This is done automatically by the alloc_contig(3) and free_contig(3) wrapper routines, but must be done explicitly for memory allocated in different ways.+Another case where the programmer may have to ensure that state transfer does not result in problems that will surface only after the live update, is when a service uses memory areas that are physically unmovable. Such memory areas are typically in use for DMA purposes. If the state transfer procedure changes the physical location of the buffers, DMA may be performed from or to the original physical location, resulting in garbage and possibly arbitrary memory corruption. Such DMA areas must be marked as special out-of-band memory in libmagicrt, and unmarked when freed, using the sef_llvm_add_special_mem_region(3) and sef_llvm_del_special_mem_region(3) SEF calls. This is done automatically by the alloc_contig(3) and free_contig(3) wrapper routines, but must be done explicitly for memory allocated in different ways.
  
 However, this is only necessary if DMA can happen across a live update. In cases where it is known that no DMA can possibly be ongoing during the live update, the regions are not actually physically unmovable, and therefore need not be marked as such. For example, this is the case for the file system buffer cache implemented in libminixfs. This library allocates and manages buffers without using physically contiguous memory and alloc_contig(3),​ instead using mmap(2) directly and requesting DMA I/O in page-sized chunks (in order to avoid DMA issues on ARM). Therefore, it would be affected by the above problem, were it not for the fact that all its block I/O calls are synchronous. Any future introduction of more asynchrony will turn this situation into a real problem for live update, though. However, this is only necessary if DMA can happen across a live update. In cases where it is known that no DMA can possibly be ongoing during the live update, the regions are not actually physically unmovable, and therefore need not be marked as such. For example, this is the case for the file system buffer cache implemented in libminixfs. This library allocates and manages buffers without using physically contiguous memory and alloc_contig(3),​ instead using mmap(2) directly and requesting DMA I/O in page-sized chunks (in order to avoid DMA issues on ARM). Therefore, it would be affected by the above problem, were it not for the fact that all its block I/O calls are synchronous. Any future introduction of more asynchrony will turn this situation into a real problem for live update, though.
Line 733: Line 624:
 == Dangling pointers == == Dangling pointers ==
  
-In order to know how to transfer a piece of memory, the magic library must know about the data type associated to that piece of memory. If no type information is known for a piece of memory, it cannot be transferred. There are various reasons why libmagic ​might not have type information about a piece of memory. The simplest one is a case of a **dangling pointer**: a pointer that used to be valid at some point, but no longer is, because the memory pointed to has been deallocated. While the actual program may know not to use that particular pointer anymore, the state transfer routine does not have such knowledge. A typical error resulting from a dangling pointer may look like this, with some important parts of the output highlighted:​+In order to know how to transfer a piece of memory, the magic runtime ​library must know about the data type associated to that piece of memory. If no type information is known for a piece of memory, it cannot be transferred. There are various reasons why libmagicrt ​might not have type information about a piece of memory. The simplest one is a case of a **dangling pointer**: a pointer that used to be valid at some point, but no longer is, because the memory pointed to has been deallocated. While the actual program may know not to use that particular pointer anymore, the state transfer routine does not have such knowledge. A typical error resulting from a dangling pointer may look like this, with some important parts of the output highlighted:​
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=sbuf.1900354961,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**sbuf**.1900354961,​ type=TYPE: (id=53 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**)) +  * SELEMENT: ​''<​nowiki>​(parent=sbuf.1900354961,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**sbuf**.1900354961,​ type=TYPE: (id=53 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x080cb49f**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=D0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x080cb49f**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=D0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​
  
-In this case, the global variable **sbuf** (suffixed with a tag to make its name unique) is a char* pointer to location 0x080cb49f. Since the magic library knows no type information about this target (//trg//) memory location, it marks the location with the placeholder type UNKNOWN_TYPE and aborts state transfer because an unknown type was found. Another example:+In this case, the global variable **sbuf** (suffixed with a tag to make its name unique) is a char* pointer to location 0x080cb49f. Since the magic runtime ​library knows no type information about this target (//trg//) memory location, it marks the location with the placeholder type UNKNOWN_TYPE and aborts state transfer because an unknown type was found. Another example:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=inode.3951291702,​ num=80, depth=2, address=0xdfbe3210,​ **name**=**inode.3951291702/​4/​i_data**,​ type=TYPE: (id=61 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**)) +  * SELEMENT: ​''<​nowiki>​(parent=inode.3951291702,​ num=80, depth=2, address=0xdfbe3210,​ **name**=**inode.3951291702/​4/​i_data**,​ type=TYPE: (id=61 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=80, type=ptr, flags(DIVW)=1110,​ **value**=**0x08108098**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=H0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=80, type=ptr, flags(DIVW)=1110,​ **value**=**0x08108098**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=H0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=H0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=H0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​
  
-In this case, the **i_data** field of the fifth element (**/4/**) of the global **inode** structure, also a char* pointer, is pointing to address 0x08108098 which is unknown to libmagic. The pointer address typically allows one to determine what kind of memory it is, by means of the memory sections of the process. In this particular example, the address was somewhat higher than the service'​s data end, thus suggesting the memory pointed to is heap memory. This matched with the source code of the service (PFS, the Pipe File Server), which dynamically allocates and frees the i_data buffers using malloc(3) and free(3).+In this case, the **i_data** field of the fifth element (**/4/**) of the global **inode** structure, also a char* pointer, is pointing to address 0x08108098 which is unknown to libmagicrt. The pointer address typically allows one to determine what kind of memory it is, by means of the memory sections of the process. In this particular example, the address was somewhat higher than the service'​s data end, thus suggesting the memory pointed to is heap memory. This matched with the source code of the service (PFS, the Pipe File Server), which dynamically allocates and frees the i_data buffers using malloc(3) and free(3).
  
 It is up to the programmer of the service to ensure that the state transfer routine will not attempt to transfer a dangling pointer. This can be as simple as zeroing out the pointer after use, which is usually good practice anyway: It is up to the programmer of the service to ensure that the state transfer routine will not attempt to transfer a dangling pointer. This can be as simple as zeroing out the pointer after use, which is usually good practice anyway:
Line 763: Line 654:
   static ixfer_mproc_t mproc;   static ixfer_mproc_t mproc;
  
-In some cases, it may make more sense to zero out pointers instead. In other cases, we have changed code to retrieve not entire kernel tables but only specific values, or to use the kernel-mapped pages instead of copies of kernel structures to retrieve values. The magic library already ignores pointers into kernel space (that is, 0xf0000000 and higher) altogether.+In some cases, it may make more sense to zero out pointers instead. In other cases, we have changed code to retrieve not entire kernel tables but only specific values, or to use the kernel-mapped pages instead of copies of kernel structures to retrieve values. The magic runtime ​library already ignores pointers into kernel space (that is, 0xf0000000 and higher) altogether.
  
 Theoretically it is possible that remote pointers end up being valid in the local address space by sheer luck. In known cases of copying in external pointers, it is best to not to rely on failures in the magic framework, but rather annotate the code in a proactive manner. Theoretically it is possible that remote pointers end up being valid in the local address space by sheer luck. In known cases of copying in external pointers, it is best to not to rely on failures in the magic framework, but rather annotate the code in a proactive manner.
Line 774: Line 665:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=ds_subs.1944246923,​ num=9, depth=3, address=0xdfbe6108,​ name=**ds_subs.1944246923/​0/​regex/​re_g**,​ type=TYPE: (id=18 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=opaque*)) +  * SELEMENT: ​''<​nowiki>​(parent=ds_subs.1944246923,​ num=9, depth=3, address=0xdfbe6108,​ name=**ds_subs.1944246923/​0/​regex/​re_g**,​ type=TYPE: (id=18 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=opaque*))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=9, type=ptr, flags(DIVW)=1110,​ **value**=**0x08111000**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=9, type=ptr, flags(DIVW)=1110,​ **value**=**0x08111000**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, ptr_found=1,​ **unknown_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​
  
 In this case, the pointer **ds_subs[0].regex.re_g** ended up pointing to the unknown heap-section value of 0x08111000. We worked around this issue by forcing DS to use the targets of the weak aliases, _regcomp and _regfree, rather than their original names, using Makefile hacks. In this case, the pointer **ds_subs[0].regex.re_g** ended up pointing to the unknown heap-section value of 0x08111000. We worked around this issue by forcing DS to use the targets of the weak aliases, _regcomp and _regfree, rather than their original names, using Makefile hacks.
  
-== Code used only in libmagic ​==+== Code used only in libmagicrt ​==
  
-If the magic library itself uses other library modules, for example from libc, and these modules are not already used by the service itself anyway, then the bitcode linker may not include them in the linked object on which the instrumentation passes are run. Again, this may result in various failures, and unknown pointers in particular:+If the magic runtime ​library itself uses other library modules, for example from libc, and these modules are not already used by the service itself anyway, then the bitcode linker may not include them in the linked object on which the instrumentation passes are run. Again, this may result in various failures, and unknown pointers in particular:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=_ctype_tab_,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**_ctype_tab_**,​ type=TYPE: (id=204 ​ , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=11000000,​ values=%%''​%%,​ type_str=i16/​unsigned short*)) +  * SELEMENT: ​''<​nowiki>​(parent=_ctype_tab_,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**_ctype_tab_**,​ type=TYPE: (id=204 ​ , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=11000000,​ values=%%''​%%,​ type_str=i16/​unsigned short*))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x0809ccb6**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=1, type=ptr, flags(DIVW)=1110,​ **value**=**0x0809ccb6**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, ptr_found=1,​ **unknown_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​
  
-In this particular failure case, the global **_ctype_tab_** variable pointed into another global variable, at location 0x0809ccb6 the data section. The other global variable was invisible to the magic pass, so no **sentry** object could be created for it. As a result, ​libmagic ​did not know about the target of the pointer. The _ctype_tab_ variable itself was used by the ''<​ctype.h>''​ isalpha(3) (etc) set of macros from within ​libmagic. We worked around this issue by putting our own replacement set of macros in libmagic ​instead.+In this particular failure case, the global **_ctype_tab_** variable pointed into another global variable, at location 0x0809ccb6 the data section. The other global variable was invisible to the magic pass, so no **sentry** object could be created for it. As a result, ​libmagicrt ​did not know about the target of the pointer. The _ctype_tab_ variable itself was used by the ''<​ctype.h>''​ isalpha(3) (etc) set of macros from within ​libmagicrt. We worked around this issue by putting our own replacement set of macros in libmagicrt ​instead.
  
 == Assembly code == == Assembly code ==
Line 799: Line 690:
 Finally, we describe one class of state transfer failures which are the result of shortcomings in the magic instrumentation framework itself. LLVM bitcode has the notion of an **opaque** data type. The opaque data type is used for data of which the type has been declared but not defined, typically as a result of forward declarations of structures (''​struct foo;''​). Instead of resolving these types after they have been instantiated,​ LLVM tends to cast between various data types which are identical except for the presence of opaque pointers. As a result, opaque pointers may show up in various places in linked bitcode. Finally, we describe one class of state transfer failures which are the result of shortcomings in the magic instrumentation framework itself. LLVM bitcode has the notion of an **opaque** data type. The opaque data type is used for data of which the type has been declared but not defined, typically as a result of forward declarations of structures (''​struct foo;''​). Instead of resolving these types after they have been instantiated,​ LLVM tends to cast between various data types which are identical except for the presence of opaque pointers. As a result, opaque pointers may show up in various places in linked bitcode.
  
-The magic pass should mark all these practically identical data types as //​compatible types//. However, due to the fact that the casts can take rather complex forms, this is not always happening. The result is that in some cases, state transfer may fail because ​libmagic ​erroneously detects an incompatibility between a pointer type and the type of data being pointed to. As an example, the following state transfer error was reported during state transfer of the PM service:+The magic pass should mark all these practically identical data types as //​compatible types//. However, due to the fact that the casts can take rather complex forms, this is not always happening. The result is that in some cases, state transfer may fail because ​libmagicrt ​erroneously detects an incompatibility between a pointer type and the type of data being pointed to. As an example, the following state transfer error was reported during state transfer of the PM service:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=timers.515278380,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**timers**.515278380,​ type=TYPE: (id=96 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func opaque%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*)) +  * SELEMENT: ​''<​nowiki>​(parent=timers.515278380,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**timers**.515278380,​ type=TYPE: (id=96 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func opaque%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=1, type=ptr, flags(DIVW)=1110,​ value=0x08147460,​ trg_name=mproc,​ trg_offset=274616,​ trg_flags(RL)=D0,​ trg_selements=(**#​2**|0:​ **1**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=**mproc/​143/​mp_timer**,​ type=TYPE: (id=38 ​  , name=minix_timer,​ size=16, num_child_types=4,​ type_id=9, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ names='​minix_timer_t|minix_timer',​ **type_str**={ $minix_timer tmr_next { $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/​*,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } })), **2**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=mproc/​143/​mp_timer/​tmr_next,​ type=TYPE: (id=37 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=1, type=ptr, flags(DIVW)=1110,​ value=0x08147460,​ trg_name=mproc,​ trg_offset=274616,​ trg_flags(RL)=D0,​ trg_selements=(**#​2**|0:​ **1**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=**mproc/​143/​mp_timer**,​ type=TYPE: (id=38 ​  , name=minix_timer,​ size=16, num_child_types=4,​ type_id=9, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ names='​minix_timer_t|minix_timer',​ **type_str**={ $minix_timer tmr_next { $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/​*,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } })), **2**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=mproc/​143/​mp_timer/​tmr_next,​ type=TYPE: (id=37 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=D0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)</​nowiki>''​
  
 In this case, the analysis failed on the global **timers** variable. The analysis dump shows that two matching types (**#2**) were found, both associated with the **mproc[143].mp_timer** structure field, but neither type matched the type of the pointer. A closer look at the textual representations of the pointer type (the **type_str** of the primary //​selement//​) and of the data types (the //​type_str//​ of the target //​selement//​s) reveals that there is only one difference between the two: the **tmr_func** field of the structure type to which the //timers// variable should point is an **opaque** pointer, whereas the same //​tmr_func//​ field of the target structures is a particular function pointer (to a function referred to as **hash_3792421438**). The remainder of the types are the same. In this case, the analysis failed on the global **timers** variable. The analysis dump shows that two matching types (**#2**) were found, both associated with the **mproc[143].mp_timer** structure field, but neither type matched the type of the pointer. A closer look at the textual representations of the pointer type (the **type_str** of the primary //​selement//​) and of the data types (the //​type_str//​ of the target //​selement//​s) reveals that there is only one difference between the two: the **tmr_func** field of the structure type to which the //timers// variable should point is an **opaque** pointer, whereas the same //​tmr_func//​ field of the target structures is a particular function pointer (to a function referred to as **hash_3792421438**). The remainder of the types are the same.
Line 815: Line 706:
  
   * **[ERROR]** uncaught ptr with violations. Current state element:   * **[ERROR]** uncaught ptr with violations. Current state element:
-  * SELEMENT: (parent=sched_timer.29458437,​ num=4, depth=1, address=0xdfbe70b0,​ **name**=**sched_timer.29458437/​tmr_func**,​ type=TYPE: (id=17 ​  , name=tmr_func_t,​ size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=**opaque%%*%%**)) +  * SELEMENT: ​''<​nowiki>​(parent=sched_timer.29458437,​ num=4, depth=1, address=0xdfbe70b0,​ **name**=**sched_timer.29458437/​tmr_func**,​ type=TYPE: (id=17 ​  , name=tmr_func_t,​ size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=**opaque%%*%%**))</​nowiki>''​ 
-  * SEL_ANALYZED:​ (num=4, type=ptr, flags(DIVW)=1110,​ value=0x08048dc0,​ trg_name=**balance_queues**.29458437,​ trg_offset=0,​ trg_flags(RL)=T0,​ trg_selements=(#​1|0:​ 1|o=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x08048dc0,​ name=???, type=TYPE: (id=119 ​ , name=, size=1, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=11000001,​ values=%%''​%%,​ type_str=**hash_3792445575/​**)))) +  * SEL_ANALYZED: ​''<​nowiki>​(num=4, type=ptr, flags(DIVW)=1110,​ value=0x08048dc0,​ trg_name=**balance_queues**.29458437,​ trg_offset=0,​ trg_flags(RL)=T0,​ trg_selements=(#​1|0:​ 1|o=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x08048dc0,​ name=???, type=TYPE: (id=119 ​ , name=, size=1, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=11000001,​ values=%%''​%%,​ type_str=**hash_3792445575/​**))))</​nowiki>''​ 
-  * SEL_STATS: (type=ptr, trg_flags(RL)=T0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)+  * SEL_STATS: ​''<​nowiki>​(type=ptr, trg_flags(RL)=T0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)</​nowiki>''​
  
 In this case, the type mismatch was not between two structures that differed in opaque fields, but between two function pointers themselves: the function pointer in **sched_timer.tmr_func**,​ and the function it is pointing to, **balance_queues**. Registering these types as compatible would result in much more complexity in the magic pass, and likely still not resolve the more general problem of opaque pointers. This is currently one of the open issues, and we believe that another approach would be more viable; see below. In this particular case, it turned out that the sched service did not need to use timers at all, and we simplified it by getting rid of its use of timers altogether. Obviously, adapting the actual functionality of a service to allow for state transfer is not always an option, nor is it generally the right approach: the core code of system services should not have to be (re)written specifically to allow for state transfer. In this case, the type mismatch was not between two structures that differed in opaque fields, but between two function pointers themselves: the function pointer in **sched_timer.tmr_func**,​ and the function it is pointing to, **balance_queues**. Registering these types as compatible would result in much more complexity in the magic pass, and likely still not resolve the more general problem of opaque pointers. This is currently one of the open issues, and we believe that another approach would be more viable; see below. In this particular case, it turned out that the sched service did not need to use timers at all, and we simplified it by getting rid of its use of timers altogether. Obviously, adapting the actual functionality of a service to allow for state transfer is not always an option, nor is it generally the right approach: the core code of system services should not have to be (re)written specifically to allow for state transfer.
Line 829: Line 720:
 ==== The build system ==== ==== The build system ====
  
-As shown in the setup part of the users guide, the entire ​live update ​build infrastructure consists of separate set of scripts stacked on top of the regular build system. These scripts deviate from the standard build system approach in various ways, for example by building ​a separate ​copy of the LLVM toolchain, placing binaries in the MINIX3 source tree, and separately performing scripted steps which should ​be performed from the regular makefile infrastructure instead. All of these issues should be resolved through proper integration of the live update build infrastructure into the regular build system. +As shown in the setup part of the users guide, the live update ​functionality requires that a a separate ​instance ​of the LLVM toolchain be built. Unlike the standard toolchain, this separate instance is suitable for Link-Time Optimization (LTO). It is built by ''​minix/​llvm/​generate_gold_plugin.sh'',​ and placed in ''​obj_llvm.i386''​. The exact same LLVM 3.6.1 source code is used to compile both the LTO-enabled toolchain and the additional regular crosscompilation toolchain in ''​obj.i386'',​ using the exact configuration flags. The separate compilation is necessary only because of a problem with makefiles.
- +
-=== Two LLVM toolchains === +
- +
-A major part of the problem is the current necessity to build a separate instance of the LLVM toolchain. Unlike the standard toolchain, this separate instance is suitable for Link-Time Optimization (LTO). It is built by ''​minix/​llvm/​generate_gold_plugin.sh'',​ and placed in ''​obj_llvm.i386''​. The exact same LLVM 3.source code is used to compile both the LTO-enabled toolchain and the additional regular crosscompilation toolchain in ''​obj.i386'',​ using the exact configuration flags. The separate compilation is necessary only because of a problem with makefiles.+
  
 NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain being built in a different way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation. NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain being built in a different way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation.
Line 839: Line 726:
 The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the necessary parts of the toolchain without the need to build LLVM twice. The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the necessary parts of the toolchain without the need to build LLVM twice.
  
-=== Lack of integration === +As part of this, the generated ​instrumentation passes ​should ​not be placed in the ''​minix/​llvm/​bin'' ​subdirectory of the source MINIX3 tree. Instead, they should end up in an appropriate subdirectory of ''​obj.i386'',​ thereby keeping the source directory clean.
- +
-Once that step has been taken, it should be possible to resolve the other issues as well, effectively replacing all the ''​*.llvm''​ scripts in ''​minix/​llvm''​ with extensions in the regular build system, specifically by adapting the ''​share/​mk''​ set of makefiles as appropriate. All of this should be optional, controlled by the ''​MKMAGIC''​ build (pseudo)variable and possibly other, new build variables (e.g., to control ASR settings). Ultimately, relinking with libmagic and invoking the appropriate link-time passes should be performed by those makefiles. As an example, the WeakAliasModuleOverride pass is already invoked this way. +
- +
-All passes, as well as the magic library, should be (re)built as part of the standard build system infrastructure. As we have indicated earlier, the lack of this step puts an unnecessary burden on the user of the system. +
- +
-As part of this, none of the generated ​binaries ​should be placed in ''​minix/​llvm/​bin''​. Instead, they should end up in an appropriate subdirectory of ''​obj.i386'',​ thereby keeping the source directory clean+
- +
-Finally, any generated ASR-rerandomized service binaries should automatically be removed when the corresponding service is reinstalled,​ so as to prevent that stale ASR binaries end up in a generated image.+
  
 ==== Instrumentation ==== ==== Instrumentation ====
Line 862: Line 741:
 However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has created the situation that not all cases of identical types are properly registered as //​compatible types//. As of writing, this has not yet been a real problem, but it is likely to become a problem in the future. However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has created the situation that not all cases of identical types are properly registered as //​compatible types//. As of writing, this has not yet been a real problem, but it is likely to become a problem in the future.
  
-We believe that the right solution would be the introduction of a new **type unification pass**. This pass would unify all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. The pass could then be run before the magic pass. This would not only resolve the complete problem, but also free the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagic ​to search through compatible types.+We believe that the right solution would be the introduction of a new **type unification pass**. This pass would unify all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. The pass could then be run before the magic pass. This would not only resolve the complete problem, but also free the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagicrt ​to search through compatible types.
  
-=== ASR skipping ​libmagic ​===+=== ASR skipping ​libmagicrt ​===
  
-The ASR pass currently exempts all of the magic library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagic ​is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service.+The ASR pass currently exempts all of the magic runtime ​library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagicrt ​is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service.
  
 The exact reasons as to why this exception was made are currently unknown. However, if possible, this overall limitation should be resolved by either removing the exception or at least narrowing it to the exact scope of the problem. The exact reasons as to why this exception was made are currently unknown. However, if possible, this overall limitation should be resolved by either removing the exception or at least narrowing it to the exact scope of the problem.
Line 886: Line 765:
 MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocated pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no free memory, the service will be killed, possibly taking down the entire system with it. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent. MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocated pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no free memory, the service will be killed, possibly taking down the entire system with it. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent.
  
-The live update and rerandomization support is making this situation even more problematic. The magic library uses extra dynamic memory, and is not particularly careful about using preallocated memory where necessary. The ASR functionality increases memory usage even further. For example, its stack padding feature requires a considerable amount of extra stack space. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.+The live update and rerandomization support is making this situation even more problematic. The magic runtime ​library uses extra dynamic memory, and is not particularly careful about using preallocated memory where necessary. The ASR functionality increases memory usage even further. For example, its stack padding feature requires a considerable amount of extra stack space. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.
  
 Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulted-in stack pages illustrates that this is not a viable option in general. There has been a partial attempt to prepare file system service'​s buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a certain amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services. Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulted-in stack pages illustrates that this is not a viable option in general. There has been a partial attempt to prepare file system service'​s buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a certain amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services.
Line 902: Line 781:
 Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was originally created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection. Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was originally created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection.
  
-Since libmagic ​performs extra memory allocations,​ this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''​PROT_NONE''​ protection instead. Theoretically,​ this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken.+Since libmagicrt ​performs extra memory allocations,​ this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''​PROT_NONE''​ protection instead. Theoretically,​ this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken.
  
 Ideally, this issue would be resolved by implementing proper support for page protection in VM, including for example an implementation of mprotect(2). Ideally, this issue would be resolved by implementing proper support for page protection in VM, including for example an implementation of mprotect(2).
Line 914: Line 793:
 The case of userspace threads has shown that it may be not just useful, but actually //​necessary//​ for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. These services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that not just users, but also scripts - the update_asr(8) script in particular - be aware of specific services requiring custom quiescence state. This is inconvenient and dangerous. The case of userspace threads has shown that it may be not just useful, but actually //​necessary//​ for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. These services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that not just users, but also scripts - the update_asr(8) script in particular - be aware of specific services requiring custom quiescence state. This is inconvenient and dangerous.
  
-The default quiescence state is currently hardcoded in the service(8) utility, in the form of ''​DEFAULT_LU_STATE''​ in ''​minix/​commands/​service/​service.c''​. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not yet clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service'​s own SEF routines. Otherwise, the SEF may have to send the default state as extra data to RS at service initialization time.+The default quiescence state is currently hardcoded in the minix-service(8) utility, in the form of ''​DEFAULT_LU_STATE''​ in ''​minix/​commands/​minix-service/minix-service.c''​. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not yet clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service'​s own SEF routines. Otherwise, the SEF may have to send the default state as extra data to RS at service initialization time.
  
 === Policy redundancy === === Policy redundancy ===
Line 950: Line 829:
 The performance of various parts of the live update infrastructure is not fantastic. This is true for both the instrumentation passes and, more importantly,​ the run-time functionality. As one of the effects, live update operations may have to be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option. The performance of various parts of the live update infrastructure is not fantastic. This is true for both the instrumentation passes and, more importantly,​ the run-time functionality. As one of the effects, live update operations may have to be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option.
  
-We have not yet looked into the causes of the poor performance. Part of it may be due to the extra memory allocations performed by libmagic, but that is only a guess. This issue is therefore rather open ended. Statistical profiling may provide at least some hints.+We have not yet looked into the causes of the poor performance. Part of it may be due to the extra memory allocations performed by libmagicrt, but that is only a guess. This issue is therefore rather open ended. Statistical profiling may provide at least some hints
 + 
 +=== Grant table transfer === 
 + 
 +Currently, the safecopy memory grant tables of system services are transferred as is: the main union of the ''​cp_grant_t''​ structure as defined in ''​include/​minix/​safecopies.h''​ is marked as **ixfer**. 
 +In some scenarios, however, it is possible that during a service'​s live update, the service has grants allocated for remote services. For direct grants (of type ''​CPF_DIRECT''​),​ ''​cp_direct.cp_start''​ is actually a pointer into the local address space. The identity transfer therefore prevents this local pointer from being updated. Especially with ASR, there is a risk that after the live update, the grant points to arbitrary memory within the updated service. In the worst case, a remote user of the grant may end up overwriting this arbitrary memory in the updated service. 
 + 
 +To resolve this, the grant structure should not be using **ixfer** for its main union. This probably means that a custom state transfer routine for the grant structure must be written, so as to use a pointer transfer only for ''​CPF_DIRECT''​ grants. 
 + 
 +The same does //not// apply to magic grants (of type ''​CPF_MAGIC''​),​ as ''​cp_magic.cp_start''​ is an address in a remote process, which is either a userland process or a system process blocked on a call to VFS (as of writing, only VFS can use magic grants at all), and thus never subject to live update while the magic grant is active.
  
 === Testrelpol failure === === Testrelpol failure ===
  
-If the ''​testrelpol''​ script is run a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //​testrelpol//​. If that is the case, it should be possible to change //​testrelpol//​ to disable the exponential backoff using existing service(8) flags.+If the ''​testrelpol''​ script is run a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //​testrelpol//​. If that is the case, it should be possible to change //​testrelpol//​ to disable the exponential backoff using existing ​minix-service(8) flags.
  
-=== Libmagic ​asserts ===+=== Libmagicrt ​asserts ===
  
-The implementation of the magic library currently relies on asserts being enabled. We have changed its Makefile so that asserts should be enabled regardless of build system settings, but this is merely a workaround. Instead, ​libmagic ​should function properly (and, in particular, fail properly) regardless of whether asserts are enabled.+The implementation of the magic runtime ​library currently relies on asserts being enabled. We have changed its Makefile so that asserts should be enabled regardless of build system settings, but this is merely a workaround. Instead, ​libmagicrt ​should function properly (and, in particular, fail properly) regardless of whether asserts are enabled.
  
 === VM fork warning === === VM fork warning ===
Line 977: Line 865:
  
   * Cristiano Giuffrida, [[http://​www.minix3.org/​theses/​Cristiano_Giuffrida_PhD_thesis.pdf|Safe and Automatic Live Update]], Ph.D. thesis, 2014   * Cristiano Giuffrida, [[http://​www.minix3.org/​theses/​Cristiano_Giuffrida_PhD_thesis.pdf|Safe and Automatic Live Update]], Ph.D. thesis, 2014
 +
developersguide/liveupdate.txt · Last modified: 2022/02/12 22:42 by stux