User Tools

Site Tools


developersguide:liveupdate

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
developersguide:liveupdate [2015/09/14 13:16]
dcvmoole First half of the initial document
developersguide:liveupdate [2022/02/12 22:42]
stux renamed service(8) to minix-service(8) in various places
Line 1: Line 1:
-<div center round important>​ 
-**This is a draft.** 
-</​div>​ 
- 
 ====== Live update and rerandomization ====== ====== Live update and rerandomization ======
  
-MINIX3 now has support for live update and rerandomization of its system services. These features are based on LLVM bitcode compilation and instrumentation in combination with various run-time extensions. Live update and rerandomization support is currently fully functional ​but in an experimental state, not enabled by default, and available for x86 only. This document describes the basic idea, provides instructions on how to enable and use the functionality,​ provides more in-depth information for developers, and lists open issues.+MINIX3 now has support for live update and rerandomization of its system services. These features are based on LLVM bitcode compilation and instrumentation in combination with various run-time extensions. Live update and rerandomization support is currently fully functional, although still in an experimental state, not enabled by default, and available for x86 only. This document describes the basic idea, provides instructions on how to enable and use the functionality,​ provides more in-depth information for developers, and lists open issues ​and further reading material.
  
 ===== Introduction ===== ===== Introduction =====
Line 15: Line 11:
 A live update is an update to a software component while it is active, allowing the component'​s code and data to be changed without affecting the environment around it. The MINIX3 live update functionality allows such updates to be applied to its **system services**: the usermode server and driver processes that, in addition to the microkernel,​ make up the operating system. As a result, these services can be be updated at run time without requiring a system reboot. There is no support for live updating the microkernel or user applications at this time. A live update is an update to a software component while it is active, allowing the component'​s code and data to be changed without affecting the environment around it. The MINIX3 live update functionality allows such updates to be applied to its **system services**: the usermode server and driver processes that, in addition to the microkernel,​ make up the operating system. As a result, these services can be be updated at run time without requiring a system reboot. There is no support for live updating the microkernel or user applications at this time.
  
-The live update procedure can be summarized as follows. The component responsible for orchestrating live updates is the RS (Reincarnation Server) service. When RS applies an update to a particular system service, it first brings that service to a stop in a known state, by exploiting the message-based nature of MINIX3. A new instance of the service is created. This new instance performs its own **state transfer**, copying and adjusting all the relevant data from the old instance to itself. If the state transfer succeeds, the new instance continues to run, and the old instance is killed. If the state transfer fails, RS performs a **rollback**:​ the new instance is killed, and the system resumes execution of the old instance. In order to maintain the illusion to the rest of the system that there only ever was one service process, the process slots of the old and the new instance are swapped before the new instance gets to run, and swapped back upon rollback.+The live update procedure can be summarized as follows. The component responsible for orchestrating live updates is the RS (Reincarnation Server) service. When RS applies an update to a particular system service, it first brings that service to a stop in a known **quiescence** ​state, ensuring that the live update will not interfere with the service'​s normal operation, by exploiting the message-based nature of MINIX3. A new instance of the service is created. This new instance performs its own **state transfer**, copying and adjusting all the relevant data from the old instance to itself. If the state transfer succeeds, the new instance continues to run, and the old instance is killed. If the state transfer fails, RS performs a **rollback**:​ the new instance is killed, and the system resumes execution of the old instance. In order to maintain the illusion to the rest of the system that there only ever was one service process, the process slots of the old and the new instance are swapped before the new instance gets to run, and swapped back upon rollback.
  
-The MINIX3 live update system allows updates to all system services. Those includes ​the RS service itself, and the VM (Virtual Memory) service. The VM service can be updated with severe restrictions only, however. The system also supports **multicomponent** live updates: atomic live updates of several system services at once, possibly including RS and/or VM. In principle, this allows for an atomic live update of the entire MINIX3 service layer.+The MINIX3 live update system allows updates to all system services. Those include ​the RS service itself, and the VM (Virtual Memory) service. The VM service can be updated with severe restrictions only, however. The system also supports **multicomponent** live updates: atomic live updates of several system services at once, possibly including RS and/or VM. In principle, this allows for an atomic live update of the entire MINIX3 service layer.
  
 The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM "​optimization"​ passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization,​ the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the **magic pass**. ​ The state transfer aspect of live update relies heavily on compile-time and in particular link-time instrumentation of system services. This instrumentation is implemented in the form of LLVM "​optimization"​ passes, which operate on LLVM bitcode modules. In most cases, these passes are run after (initial) program linking, by means of the LLVM Link-Time Optimization (LTO) system. Thus, in order to support live update and rerandomization,​ the system must be compiled using LLVM bitcode and with LTO support. The LLVM pass that performs the static analysis and link-time instrumentation for live update is called the **magic pass**. ​
  
-In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the **magic library** or //libmagic//. Together, the magic pass and library make up the **magic framework**.+In addition, live updates require runtime support for state transfer in each service. For this reason, system services are relinked with a library that provides all the run-time functionality which ultimately allow a new service instance to perform state transfer from its old instance. This library is called the **magic ​runtime ​library** or //libmagicrt//. Together, the magic pass and runtime ​library make up the **magic framework**.
  
 ==== Live rerandomization ==== ==== Live rerandomization ====
  
-Live rerandomization consists of randomizing the internal address space layout of a component at run time. While the concept of ASR or ASLR - Address Space (Layout) Randomization - is well known, most implementations are rather limited: they do such randomization only once, when starting a process; ​the randomize the base location of entire process regions, for example the process stack; and, they apply the concept to user processes only. In contrast, the MINIX3 live rerandomization can randomize the address space layout of system services, as often as desired, and with fine granularity. In order to achieve this, the live rerandomization makes use of live updates.+Live rerandomization consists of randomizing the internal address space layout of a component at run time. While the concept of ASR or ASLR - Address Space (Layout) Randomization - is well known, most implementations are rather limited: they perform ​such randomization only once, when starting a process; ​they merely ​randomize the base location of entire process regions, for example the process stack; and, they apply the concept to user processes only. In contrast, the MINIX3 live rerandomization can randomize the address space layout of operating ​system services, as often as desired, and with fine granularity. In order to achieve this, the live rerandomization makes use of live updates.
  
-The fundamental ​idea is to first generate ​a new version ​of the service ​binary, using link-time randomization of various parts of the binary. Ideally, this would be done at run time; due to various limitations,​ MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another.+The fundamental ​approach consists of two-step process. First, ​new versions ​of the service ​program are generated, using link-time randomization of various parts of its program ​binary. Ideally, this would be done at run time; due to various limitations,​ MINIX3 currently only supports pregenerated randomized binaries of system services. Then, at runtime, the live update system is used to update from one randomized version of each service to another.
  
-The randomization of binaries is done with another link-time pass, called the **asr pass**. The magic library implements ​the runtime aspects of ASR rerandomization during live update.+The randomization of binaries is done with another link-time pass, called the **asr pass**. The magic runtime ​library implements ​various ​runtime aspects of ASR rerandomization during live update.
  
 ===== Users guide ===== ===== Users guide =====
  
-In this section, we explain how to set up a MINIX3 system that supports live update and rerandomization,​ and we describe how to use these functionalities ​when running MINIX3.+In this section, we explain how to set up a MINIX3 system that supports live update and rerandomization,​ and we describe how to use the new functionality ​when running MINIX3.
  
 ==== Setting up the system ==== ==== Setting up the system ====
  
-We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only. The current procedure ​is not quite ideal, but it is what we have right now, and it should work.+We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only. 
 + 
 +The current procedure ​has been tested only from **Linux** as host platform, and may require minor adjustments on other host platforms. We provide a few additional instructions for those other platforms, but these may currently not be complete. Please feel free to add more instructions to this page, and/or open GitHub issues for other platforms and link to them from here. 
 + 
 +After setting up an initial environment,​ the first step is to obtain the MINIX3 source code. After that, the next step is to build an LLVM toolchain with LTO support, which is needed because the regular MINIX3 crosscompilation LLVM toolchain does not include LTO support (yet - we are working on this). Once the LTO-supporting toolchain has been built, the final step is to build the MINIX3 sources, with extra flags to enable magic instrumentation and possibly ASR rerandomziation. 
 + 
 +Once these steps have been completed successfully for the first timeone can later update the MINIX3 source ​and then rebuild the system. The LTO-supporting toolchain need not be rebuilt unless we upgrade the LLVM source code itself.
  
-After setting up an initial environment,​ the MINIX3 update cycle basically consists of four steps: obtaining or updating the MINIX3 source code, building the system, instrumenting the system system, and generating a bootable image. ​We will go through all steps in detail. ​There is also a summary of commands to issue at the end.+We will now go through all steps in detail. ​At the end of this section, there is also a summary of the commands to issue.
  
 All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges. All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges.
Line 49: Line 51:
   $ sudo apt-get install curl clang binutils zlibc zlib1g zlib1g-dev libncurses-dev qemu-system-x86   $ sudo apt-get install curl clang binutils zlibc zlib1g zlib1g-dev libncurses-dev qemu-system-x86
  
-In terms of directory ​organizationthe idea is that everything will end up in one containing directory. Here we use ''/​home/​user/​minix-liveupdate''​ as an example, but the location is entirely up to you. This containing directory will end up having one subdirectory for the MINIX3 source code (called ''​minix-src''​ in this document), one subdirectory for the LLVM LTO toolchain (called ''​obj_llvm.i386''​),​ and one subdirectory for the crosscompilation tool chain and compiled objects (called ''​obj.i386''​). Thus, the ultimate directory structure will look like this:+The MINIX3 build system uses one single directory in which to place all its files. This directory is one level up from the root of the MINIX3 source ​directory. Thusit is advisable to create this containing directory ​at a location known to have enough free hard disk space. Here we use ''/​home/​user/​minix-liveupdate''​ as an example, but the location is entirely up to you. The containing directory will end up having one subdirectory for the MINIX3 source code (called ''​minix-src''​ in this document), one subdirectory for the LLVM LTO toolchain (called ''​obj_llvm.i386''​),​ and one subdirectory for the crosscompilation tool chain and compiled objects (called ''​obj.i386''​). Thus, the ultimate directory structure will look like this:
  
   /​home/​user/​minix-liveupdate/​minix-src   /​home/​user/​minix-liveupdate/​minix-src
Line 55: Line 57:
   /​home/​user/​minix-liveupdate/​obj.i386   /​home/​user/​minix-liveupdate/​obj.i386
  
-You have to choose a location for the containing directory, and create it yourself. The three subdirectories ​will be created automatically as part of the following steps. In terms of placement, expect to be needing a bare minimum of **30GB** for the combination of these three subdirectories,​ with a recommended **40GB** of available space.+You have to choose a location for the containing directory, and create it yourself. The three subdirectories ​should ​be created automatically as part of the following steps. However, it has been reported that on some platforms (e.g., FreeBSD), some or all of these directories have to be created manually; this can be done with nothing more than a few basic ''​mkdir''​ commands. In terms of disk space usage, expect to be needing a bare minimum of **30GB** for the combination of these three subdirectories,​ with a recommended **40GB** of available space.
  
 === Obtaining or updating the MINIX3 source code === === Obtaining or updating the MINIX3 source code ===
  
-The first real step is then to check out the MINIX3 source code. Other wiki pages cover this in more detail, but the gist is to check out the sources from the main MINIX3 repository using [[.:​usinggit|git]]:​+The first real step is to fetch the MINIX3 source code. Other wiki pages cover this in more detail, but the simplest approach ​is to check out the sources from the main MINIX3 repository using [[.:​usinggit|git]]:​
  
   $ cd /​home/​user/​minix-liveupdate   $ cd /​home/​user/​minix-liveupdate
   $ git clone git://​git.minix3.org/​minix minix-src   $ git clone git://​git.minix3.org/​minix minix-src
  
-This will create a ''​minix-src''​ subdirectory ​containing ​the latest version of the MINIX3 source code.+This will create a ''​minix-src''​ subdirectory ​with the latest version of the MINIX3 source code.
  
 Later on, a newer version of the source code can be pulled from the MINIX3 repository: Later on, a newer version of the source code can be pulled from the MINIX3 repository:
Line 72: Line 74:
  
 In both cases, the next step is now to build the source code. In both cases, the next step is now to build the source code.
 +
 +=== Building the LTO toolchain ===
 +
 +The second step is to build the LLVM LTO infrastructure,​ if it has not yet been built before. Eventually, this will be done automatically as part of the regular build. For now, we have a script that can perform the build, called ''​generate_gold_plugin.sh''​. It is located in the ''​minix/​llvm''​ subdirectory of the MINIX3 source tree. The basic procedure therefore consists of the following steps (but read this entire section first):
 +
 +  $ cd /​home/​user/​minix-liveupdate/​minix-src/​minix/​llvm
 +  $ ./​generate_gold_plugin.sh
 +
 +On some platforms, it may be needed to specify the C/C++ compiler and/or the name of the GNU make utility, which can be done as follows:
 +
 +  $ CC=clang CXX=clang++ MAKE=make ./​generate_gold_plugin.sh
 +
 +On FreeBSD and similar platforms, one may have to ensure that GNU make is installed (typically as ''​gmake''​) first, and pass in ''​MAKE=gmake''​ to point to it.
 +
 +This step may take several hours. It can be sped up by supplying a number of parallel jobs, through a ''​JOBS=n''​ variable:
 +
 +  $ JOBS=8 ./​generate_gold_plugin.sh
 +
 +As stated before, after this command has finished successfully,​ it need not be reissued until LLVM is upgraded in the MINIX3 source tree. This is a rare event which is typically part of a larger resynchronization with NetBSD code, and we will clearly announce such events. When this happens, it may be advisable to remove the entire ''​obj_llvm.i386''​ directory as well as any files in ''​minix-src/​minix/​llvm/​bin'',​ before rerunning the generate_gold_plugin.sh script.
  
 === Building the system === === Building the system ===
  
-The next step consists of building the system. When run for the first time, this step will also build the LLVM LTO infrastructure,​ the crosscompilation ​tools, and the instrumentation. The first run may take several hours.+The third step consists of building the system ​and generating a bootable image out of it. When run for the first time, this step will also build the regular (non-LTOcrosscompilation ​toolchain. The first run may therefore (also) ​take several hours. The build procedure is just like regular MINIX3 crosscompilation,​ differing in only two aspects.
  
-The center of all the instrumentation ​activities is the ''​minix/llvm'' ​subdirectory of the MINIX3 source tree. This directory contains ​the instrumentation passes, runtime library, and supporting scriptsThis step and the next steps therefore assume this subdirectory as the current directory:+First, ​the appropriate build variables must be passed in to enable the desired functionality. In order to build the system with live update support through magic instrumentation, the build system must be invoked with the ''​MKMAGIC'' ​build variable set to //yes//. This will perform a bitcode build of the entire system, and perform magic instrumentation on all system services.
  
 +In order to build the system with ASR instrumentation,​ the build system must be invoked with the ''​MKASR''​ build variable set to //yes//. This will automatically enable magic instrumentation,​ perform ASR randomization on all system services, and pregenerate a number of ASR-rerandomized service binaries for each service. This number can be controlled with an additional ''​ASRCOUNT=n''​ build variable, where the //n// value must be between 1 and 65536 (inclusive). The default //​ASRCOUNT//​ is 3.
 +
 +Second, in order to build a hard disk image suitable for use by the resulting bitcode builds, the ''​x86_hdimage.sh''​ script must be invoked with the **-b** flag. This will enlarge the generated image to account for the larger binaries, and enable inclusion of ASR-rerandomized binaries if necessary.
 +
 +These two aspects can be covered in a single build command. The following short procedure will build a hard disk image with magic instrumentation:​
 +
 +  $ cd /​home/​user/​minix-liveupdate/​minix-src
 +  $ BUILDVARS="​-V MKMAGIC=yes"​ ./​releasetools/​x86_hdimage.sh -b
 +
 +In order to speed up the build, a number of parallel jobs may be supplied. It is typically advisable to use as many jobs as there are hardware threads of execution (i.e., CPU cores or hyperthreads) in the system:
 +
 +  $ JOBS=8 BUILDVARS="​-V MKMAGIC=yes"​ ./​releasetools/​x86_hdimage.sh -b
 +
 +It may be necessary to ensure that clang is used as the compiler:
 +
 +  $ CC=clang CXX=clang++ JOBS=8 BUILDVARS="​-V MKMAGIC=yes"​ ./​releasetools/​x86_hdimage.sh -b
 +
 +Also, some platforms may not be able to compile the compiler toolchain for the target platform due to running out of memory. In that case, it is possible to build an image that does not come with its own compiler toolchain, by passing in the ''​MKLLVMCMDS=no''​ build variable. This build variable can also be used simply to speed up the compilation procedure.
 +
 +  $ BUILDVARS="​-V MKMAGIC=yes -V MKLLVMCMDS=no"​ ./​releasetools/​x86_hdimage.sh -b
 +
 +In order to build an image with ASR randomization,​ including four additional ASR-rerandomized versions of each system service, use the following build variables:
 +
 +  $ BUILDVARS="​-V MKASR=yes -V ASRCOUNT=4"​ ./​releasetools/​x86_hdimage.sh -b
 +
 +Obviously, all variables shown above can be combined as appropriate. The author of this document has used the following command line on several occasions:
 +
 +  $ CC=clang CXX=clang++ JOBS=4 BUILDVARS="​-V MKASR=yes -V ASRCOUNT=2 -V MKLLVMCMDS=no"​ ./​releasetools/​x86_hdimage.sh -b
 +
 +After the first run, the build system will perform recompilation of only the parts of the source code that have changed, and should not take nearly as long to run as the first time. In case of unexpected problems when rebuilding, it may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entirety. This can be done by going to the top-level ''​obj.i386''​ directory and deleting all files and directories in there, except the ''​tooldir.{yourplatform}''​ subdirectory. Fully rebuilding the MINIX3 source code will take longer than an incremental rebuild, but since the crosscompilation toolchain is left as is, it will still be nowhere close as long as the first run.
 +
 +As explained in more detail on the [[.:​crosscompiling|crosscompilation page]], it is also possible to rebuild particular parts of the system without going through the entire "make build" process. This involves the use of the ''​nbmake-i386''​ tool and generally requires a good understanding of the compilation process.
 +
 +=== Running the image ===
 +
 +The x86_hdimage command produces a bootable MINIX3 hard disk image file. The generated image file is called ''​minix_x86.img''​ and located in the root of the MINIX3 source tree - ''​minix-src''​ in our examples. Once an image has been generated, it can be run. The most convenient way to run the image is to use **qemu/​KVM**. This can be done using the command as given at the end of the x86_hdimage output.
 +
 +While explaining the use of qemu is beyond the scope of this document, it may be useful to look into the ''​-append'',​ ''​-curses'',​ and ''​-serial file:​..''​ qemu command line arguments. The following command line will launch qemu with KVM support (remove ''<​nowiki>​--enable-kvm</​nowiki>''​ to disable KVM support), a curses-based user interface, and system output redirected to a file named ''​serial.out'':​
 +
 +  $ cd /​home/​user/​minix-liveupdate/​minix-src
 +  $ (cd ../​obj.i386/​destdir.i386/​boot/​minix/​.temp && qemu-system-i386 --enable-kvm -m 256 -kernel kernel -initrd "​mod01_ds,​mod02_rs,​mod03_pm,​mod04_sched,​mod05_vfs,​mod06_memory,​mod07_tty,​mod08_mib,​mod09_vm,​mod10_pfs,​mod11_mfs,​mod12_init"​ -hda ../​../​../​../​../​minix-src/​minix_x86.img -curses -serial file:​../​../​../​../​../​minix-src/​serial.out -append "​rootdevname=c0d0p0 cttyline=0"​)
 +
 +Extra [[usersguide:​bootmonitor|boot options]] can be supplied in the (space-separated) list that follows the ''​-append''​ switch. For example, adding ''​ rs_verbose=1''​ will enable verbose output in the RS service, which is highly useful for debugging issues with live update. ​
 +
 +=== Summary ===
 +
 +The following commands can be used to obtain and build a MINIX3 system that supports live update and live rerandomization,​ including three alternative rerandomized versions of all system services, in addition to the randomized standard ones:
 +
 +  $ export CC=clang CXX=clang++ JOBS=8
 +  $ cd /​home/​user/​minix-liveupdate
 +  $ git clone git://​git.minix3.org/​minix minix-src
   $ cd minix-src/​minix/​llvm   $ cd minix-src/​minix/​llvm
 +  $ ./​generate_gold_plugin.sh
 +  $ cd ../..
 +  $ BUILDVARS="​-V MKASR=yes -V MKLLVMCMDS=no"​ ./​releasetools/​x86_hdimage.sh -b
  
-It may be necessary to ensure that clang is used as the compiler, by exporting the following shell variables. GCC should work as well, but has not been tested as thoroughly.+The entire procedure will typically take about 30GB of disk space and several hours of time.
  
-  $ export CC=clang CXX=clang+++Sometime later, the following steps can be used to update the installation to a newer MINIX3 version:
  
-Then, the system can built with support for instrumentation by running the ''​configure.llvm''​ script in the current directory, with the ''​MKMAGIC''​ build variable set to //​yes//​. ​To build the infrastructure and system without parallel compilation,​ simply run the script:+  $ cd /home/user/​minix-liveupdate/​minix-src 
 +  $ git pull 
 +  $ CC=clang CXX=clang++ JOBS=8 BUILDVARS="​-V MKASR=yes -V MKLLVMCMDS=no"​ ./releasetools/x86_hdimage.sh -b
  
-  $ BUILDVARS="​-V MKMAGIC=yes" ​./​configure.llvm+In contrast to the initial run, the entire update procedure should take no more than an hour.
  
-Alternatively,​ a number of parallel jobs may be supplied. It is typically advisable to use as many jobs as there are hardware threads of execution (i.e., CPUs or hyperthreads) in the system:+==== Using live update ====
  
-  $ JOBS=8 BUILDVARS="​-V MKMAGIC=yes" ​./configure.llvm+Once an instrumented MINIX3 system has been built and started, it should be ready for live updatesMINIX3 offers two scripts that make use of the live update functionality:​ one for testing the infrastructure,​ and one for performing runtime ASR rerandomization. In addition, the user may perform live updates manually. In this section, we cover both parts.
  
-After the first run, the ''​configure.llvm''​ will perform recompilation of only parts of the source code that have changed, and should not take nearly as long to run as the first time. In case of unexpected problems when rebuilding, it may be necessary to throw away the previously generated objects and rebuild the MINIX3 source code in its entiretyThis can be done by going to the top-level ​''​obj.i386'' ​directory and deleting all files and directories except the ''​tooldir.{yourplatform}''​ subdirectory ​in there. Fully rebuilding ​the MINIX3 source code will take longer than an incremental rebuild, but since the crosscompilation toolchain is left as is, it will still be nowhere close as long as the first run.+The commands in this section are to be run within MINIX3rather than on the host systemThey must be run as root, because performing a live update ​of a system service requires superuser privilegesThese two things are reflected ​by the ''​minix#'' ​prompt used in the examples below.
  
-As explained in more detail on the [[.:​crosscompiling|crosscompilation page]], it is also possible to rebuild particular parts of the system without going through the entire "make build" process. This involves the use of the ''​nbmake-i386''​ tool and generally requires a good understanding of the compilation process. It may be worth mentioning that the first ''​configure.llvm''​ run saves the ''​MKMAGIC''​ value, so this variable need not be passed to ''​nbmake-i386''​ each time.+=== Pre-provided scripts ===
  
-=== Rebuilding ​the instrumentation ===+The MINIX3 distribution comes with two scripts that can be used to test and use the live update and rerandomization functionality. The first one is //​testrelpol//​. This script may be used for basic regression testing of the MINIX3 live update infrastructure. The second one is //​update_asr//​. This command performs live rerandomization of system services at runtime.
  
-When building the system for the first time, this step may be skipped, as it is performed automatically. However, when the source code is changed for any of the LLVM passes or the magic library, that is, the source code in ''​minix/​llvm'',​ the changed component must be recompiled. **Warning**updating the MINIX3 source code with ''​git pull''​ may also upgrade any of these components, in which case it is the responsibility of the user (you) to recompile and reinstall them!+== Infrastructure testingtestrelpol ==
  
-Once we properly integrate the LLVM LTO infrastructure into the MINIX3 ​build system, this step should disappear altogether.+The MINIX3 test suite has a test script that tests the basic MINIX3 ​crash recovery and live update functionalityThe script is called **testrelpol** and can be found in ''/​usr/​tests/​minix-posix'':​
  
-== Rebuilding libmagic ==+  minix# cd /​usr/​tests/​minix-posix 
 +  minix# ./​testrelpol
  
-This substep must be performed whenever ​the source code of libmagic changesThis is due to the fact that dependency tracking is not working correctly ​for libmagicwhich means the automated step in ''​configure.llvm'' ​may not recompile the library properly.+For its live update tests, this script does //not// use the magic framework for state transfer at allInstead it uses **identity transfer** which performs a basic memory copy between ​the old and the new instance. As a result, the testrelpol script should succeed whether or not services are instrumented. However, it may not work reliably on MINIX3 systems ​that are not built for magic instrumentation (i.e.built with neither ​''​MKMAGIC=yes'' ​nor ''​MKASR=yes''​).
  
-The source code of libmagic is located in the ''​minix/​llvm/​static/​magic''​ subdirectory of the MINIX3 source code. To (re)compile and install libmagic, go to its source directory, issue a ''​make clean''​ and a ''​make install''​:+== Live rerandomizationupdate_asr ==
  
-  $ cd static/magic +As we have shown before, the ''​MKASR=yes''​ host-side build variable performs the //​build-time//​ preparation of a MINIX3 system for live rerandomization. Complementing this, the //​run-time//​ side of the live rerandomization is provided by means of the **update_asr** command. The update_asr command will update system services to their next pregenerated ASR-rerandomized version, using a cyclic system. Live rerandomization is not automatic, and thus, the MINIX3 system administrator is responsible for running the update_asr command at appropriate times.
-  $ make clean install+
  
-The library is installed to ''​minix/​llvm/​bin''​. In a later step, the ''​relink.llvm''​ script will pick it up from there.+By default, the update_asr command performs one round of ASR rerandomization,​ updating each service to its next version:
  
-== Rebuilding a pass ==+  minix# update_asr
  
-This substep is also performed automatically for the first timeby the ''​generate_gold_plugin.sh'' ​script invoked from ''​configure.llvm''​. However, whenever the source code of any of the LLVM instrumentation passes changes, that pass must be recompiled and installed.+By defaultthis command will report errors only. More verbose information can be shown using the ''​-v'' ​switch:
  
-The source code of the passes is located in the ''​minix/​llvm/​passes''​ subdirectory of the MINIX3 source code. A pass can be compiled and installed by going to its ''​minix/​llvm/​passes/​{pass}''​ subdirectory,​ and issuing ''​make install''​.+  ​minix# update_asr -v
  
-For exampleto recompile and install ​the magic pass:+For further details about this commandsee the update_asr(8) manual page.
  
-  $ cd passes/magic +Aside from providing actual security benefits, the update_asr script is the **most complete test** of the live update and rerandomization functionality at this time. It uses the magic framework for state transfer, with full relocation of all state, and it applies the runtime ASR features. As of writing, it runs in the default qemu environment without any errors or subsequent issues.
-  $ make install+
  
-The passes are installed to ''​minix/llvm/bin''​. In a later step, the ''​build.llvm''​ script will pick them up from there.+The only aspect that is not tested with this command, is whether ASR rerandomization is //effective//, that is, whether all parts of service address space were properly randomized by the asr passAfter all, ASR rerandomization between identical service copies works just as well, but provides substantially fewer security guarantees. Developers working on the asr pass are encouraged to verify its effectiveness manually, for example using nm(1) on generated service binaries on the host side.
  
-=== Instrumentation and image building ​===+=== Live update commands ​===
  
-After building ​the systemtwo more steps need to be performed: instrumentation of system services, and generation of a bootable hard disk imageThese steps must be performed every time the system ​is built, including ​the first timeIn particular: every time system ​service ​is (re)compiled, ​it must be (re)instrumented afterwards. Every time any part of the compiled MINIX3 installation ​is changed, a new image must be built.+RS can be instructed to perform live updates through ​the minix-service(8) commandspecifically through its **minix-service update** subcommandThis command ​is also used by the automated scriptsFor full overview of the command'​s functionality,​ please see the minix-service(8manual page as well as the command'​s output when it is run with no parameters.
  
-In order to generate ​fully instrumented system image with a number of pregenerated ASR binaries for all servicesone can run command that automates both stepsThis is covered in the first subsectionAlternativelythe details of manual instrumentation ​and image building are covered in the two subsections after.+In its most fundamental form, the //​minix-service update// command will update ​running serviceidentified by its label, to new version provided as an on-disk binary fileIt is however also possible to tell RS to update ​the service into a copy of itselfIn additionvarious flags and options can be used for fine-grained control of the live update actionThe basic syntax to perform a live update on a single system service is as follows:
  
-== The easy way: bulk ASR generation ==+  minix# minix-service [flags] update [self|<​binary>​] -label <​label>​ [options]
  
-The ''​clientctl'' script in ''​minix/​llvm''​ provides a convenient way to instrument all services ​for live update ​and rerandomizationgenerate a number of rerandomized versions ​of each service, and build a hard disk imageThe command has the following syntax:+Through various combinations of this command's parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions ​for the overall ​live update ​infrastructure in generaland state transfer in particular. We will now go through all of them, and explain how they can be performedFor more details regarding what is actually going on below the surface, please consult the developers guide section of this document.
  
-  $ ./clientctl buildasr [N]+== Identity transfer ==
  
-Hereis an optional parameter specifying the number ​of rerandomized binaries that should be generated ​in addition to the standard set of randomized binariesN defaults to 1For example, the following command will produce a system with four randomized sets of service ​binaries: one set of ASR-randomized services that are used by default, and three extra rerandomized binaries ​to which the system can switch at run time:+The first update type is **identity transfer**. In this casethe service ​is updated to an identical copy of itself, with all functions and static data in the new instance located at the exact same addresses as the old instanceIdentity transfer bluntly copies over entire memory sections at once, thus requiring no instrumentation at allThis makes it suitable for testing of the MINIX3-specific side of the live update infrastructurehence its use in the ''​testrelpol''​ script. Identity transfer is the default ​of the minix-service(8) command when "​self"​ is given instead ​of a path to a new binary:
  
-  ​$ ./clientctl buildasr 3+  ​minix# minix-service update self -label pm
  
-The result is a MINIX3 ​hard disk image file which can be booted in (for exampleqemu; see further below.+This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 ​system services. As mentioned, it is guaranteed to work only when the system was built with ''​MKMAGIC=yes'',​ although it will mostly work on systems built without magic support as well. It works regardless of whether the target service was instrumented with the magic framework ​(or ASR).
  
-== The manual way (1/2): instrumentation ==+If the live update is successful, the minix-service(8command will be silent, but RS will print a system message that the update succeeded:
  
-Instrumentation takes place at the granularity of individual system services. The ''​minix/​llvm''​ directory contains scripts that allow for relinking services against runtime libraries, and instrumenting services with LLVM passes. The general procedure is like this:+  RSupdate succeeded
  
-  - First, ​the service is compiled and linked to its basic form. +If the system was started on qemu with ''​OUT=F''​this message will end up in ''​serial.out''​. Otherwise, the message should show up in the MINIX3 system log (''/​var/​log/​messages''​) and possibly on the first console.
-  - Thenthe resulting linked bitcode object is relinked with **libmagic**. +
-  - Finallylink-time instrumentation is applied by running ​the **magic pass**, ​possibly ​as well as the **asr pass**, ​on the linked bitcode object.+
  
-Each step also (re)generates a ready-to-execute machine code version of the service.+If the live update fails, RS should print an error to the system log, and minix-service(8will complain. In order to debug such failures, it may be useful to enable verbose mode in RS, buy starting ​the system with ''​rs_verbose=1''​ as shown earlier.
  
-Step 1 happens in the "​building the system"​ step, using ''​configure.llvm'',​ as explained before.+== Self state transfer ==
  
-Step 2 is done with the ''​relink.llvm''​ script in ''​minix/​llvm''​. This script will relink services against ​space-separated list of librariesFor live update, ​only the magic library is relevant:+The second update type is **self state transfer**. Self state transfer also performs an update of a service into an identical copy of itself, but instead uses the state transfer functionality of the magic frameworkThus, self state transfer requires that the service be instrumented properly. This update type can be used to test whether ​service'​s state can be transferred without problemsPlease note that many of the points covered here also apply to the remaining two update ​typesas all three are using the state transfer of the magic framework.
  
-  $ ./​relink.llvm magic+Self state transfer is performed by supplying the ''​-t''​ flag along with "​self"​ to the minix-service update command:
  
-This command will relink all services against libmagic, thus providing them with runtime support for live update.+  minix# minix-service -t update ​self -label pm
  
-Step 3 is done with the ''​build.llvm''​ script ​in ''​minix/​llvm''​. This script ​will instrument services with a space-separated list of LLVM passesFor live update, the magic pass should be used:+This command will perform self state transfer of the PM serviceThe libmagicrt state transfer routine ​in the new service instance ​will print additional system messages while it is runningUpon success, the system output will look somewhat like this:
  
-  ​./build.llvm magic+  ​total remote functions: 57relocated: 54 
 +  total remote sentries: 186relocated normal: 84 relocated string: 101 
 +  total remote dsentries: 5 
 +  st_data_transfer:​ processing sentries 
 +  st_data_transfer:​ processing dsentries 
 +  st_data_transfer:​ processing sentries 
 +  st_data_transfer:​ processing dsentries 
 +  st_state_transfer:​ state transfer is done, num type transformations:​ 0 
 +  RS: update succeeded
  
-This command ​will instrument ​all services with the magic pass, performing static analysis and changing the service to include the information necessary ​for libmagic ​to perform ​live updates at runtime.+If the state transfer routine is not able to perform state transfer successfully,​ it will print messages that start with ''​[ERROR]''​. RS will then roll back the service to the old instance, and both RS and minix-service(8) will report failure. Self state transfer should succeed for all MINIX3 system ​services ​that have been built with bitcode and instrumented with libmagicrt and the magic pass. As of writingthere are no system services ​for which self state transfer is known to result in ''​[ERROR]''​ lines and subsequent ​live update failureHowever:
  
-For live rerandomization supportone must apply not only the magic passbut also the asr pass:+  * It is possible that new changes to system servicesand even usage scenarios which we have not yet testeddo result in state transfer errors. Such errors should be resolved. The developers guide further below contains information on how to resolve some of these errors.
  
-  ​./build.llvm magic asr+  ​* Currently, one service is not built with bitcode, namely the memory driverIt is therefore also not instrumented. An attempt to perform self state transfer on any service that is not instrumented will result in a "​Function not implemented (error 78)" error. For services other than the memory driver, this is usually a good indication that a step was missed during the build phase.
  
-The resulting service ​will not only be ready for live update, but also be subjected to fine-grained randomization,​ as well as be supplied with parameters to perform ​the runtime component of rerandomization during live updates.+  * Some services have no state to transfer, in which case their new instances ​will perform a fresh start instead of state transfer. In that case, live update ​with self state transfer will succeed, but not print the state transfer system messages shown above. This is the case for the IS (Information Server) and readclock.drv services, for example.
  
-For reference, the ''​clientctl buildasr'' ​command shown above performs this step multiple times to generate different rerandomized versions ​of each servicestoring each in a different location.+  * Some services may only be updated once brought into a specific state of quiescencebecause ​the default quiescence state is not sufficiently restrictive. In that case, the user must specify an alternative quiescence state explicitly, through the minix-service(8) ​''​-state'' ​option. This currently applies ​to all services that make use of userspace threadsnamely the VFS, ahci, and virtio_blk servicesThese services must be updated using quiescence state 2 (//request free//) rather than state 1 (//work free//):
  
-Some details that might be useful to know about relinking and applying passes:+  minix# minix-service -t update self -label vfs -state 2
  
-  * By default, ''​relink.llvm''​ and ''​build.llvm''​ perform their respective actions on **all** system services. It is possible instrument only subset ​of servicesleaving ​the other services as is. This can be done by passing ​''​C''​ shell variable with comma-separate list of individual servicesFor example, the following command relinks the PM (Process Manager) service against the magic library:+Omitting the appropriate state parameter may result in crash of the service after live update. At the moment, the update_asr(8) script has hardcoded knowledge about these necessary states. None of this is great, and we will be working towards ​situation where the default state will not result in crash see the section on open issues further below.
  
-  ​$ C=pm ./relink.llvm magic+  ​* State transfer may be slow, and RS applies a rather strict default timeout for live updatesTherefore, it may sometimes be necessary to set a longer timeout in order to avoid needless failuresThis can be done through the ''​-maxtime''​ option to minix-service(8):​
  
-The pseudo-targets ''​servers'',​ ''​fs'',​ ''​net'',​ and ''​drivers''​ will perform actions on the services in the corresponding subdirectories in the MINIX3 source tree. The ''​rd''​ pseudo-target regenerates the ramdisk, which must be redone after changing any service ​on the ramdisk. For example, the following command instruments core servers and file system services with the magic and asr passes, and rebuilds the ramdisk:+  minix# minix-service ​-t update self -label vfs -state 2 -maxtime 120HZ
  
-  $ C=servers,fs,rd ./build.llvm magic asr+The maximum time is specified in clock ticks by defaultbut may be given in seconds by appending "​HZ"​ to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system'​s clock frequencyalso known as its HZ settingThe above example allows the live update of VFS to take up to two minutes.
  
-The ''​clientctl buildasr''​ command accepts this optional ''​C''​ shell variable as well. +== ASR rerandomization ==
-  * Each of the three steps undoes the effects of prior invocations of this step and subsequent steps, but not earlier steps: compiling and linking a service (step 1) will undo any previous relinking and instrumentation. Relinking a service (step 2) will similarly undo any previous relinking and instrumentation of the same service. Instrumenting a service (step 3) will undo any previous instrumentation,​ reapplying the instrumentation to the same relinked binary. Therefore, a single ''​build.llvm''​ invocation must be used to apply all passes at once.+
  
-  ​Instrumentation with the magic pass will fail if the service ​has not been relinked with libmagic first. The same applies to the asr passHowever, ​the asr pass will not fail if the service ​has not been instrumented with the magic passInstrumenting ​service with the asr pass but not the magic pass is of limited use: the service will be randomized, but cannot be subjected to live rerandomization.+The third update type is **ASR rerandomization**. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service ​performs state transfer into a rerandomized version of the same serviceThis involves specifying ​the path to a rerandomized ASR binary to the minix-service(8) command, as well as the ''​-a''​ flagThe ''​-a''​ flag tells the new instance to enable ​the run-time parts of rerandomization during ​the live update.
  
-== The manual way (2/2): building the image ==+  minix# minix-service -a update ​/service/​asr/​pm-1 -progname pm -label pm
  
-Finally, ​MINIX3 image can be built from the compiled MINIX3 code using the ''​clientctl'' ​**buildimage** ​command:+In system that has been built with ASR rerandomization, ​the (randomized) base service binaries are located in ''/​service''​ and the (randomized) alternative service binaries are located as numbered files in ''​/​service/​asr''​. As mentioned before, the update_asr(8) ​command ​can be used to perform these updates semi-automatically.
  
-  $ ./clientctl buildimage+Compared to self state transfer, ASR rerandomization comes with one extra restriction:​ the VM service cannot be subjected to forms of state transfer more complicated than self state transfer. For this reason, VM is also skipped by the update_asr(8) command. We will explain the restrictions regarding the VM service in the developers guide.
  
-This command produces a bootable MINIX3 hard disk image file. The generated image file is called ''​minix_x86.img''​ and located in the root of the MINIX3 source tree - ''​minix-src''​ in our examples.+== Functional update ==
  
-This command ​is called automatically as part of ''​clientctl buildasr''​.+The final update type is a **functional update**. Compared to self state transfer, ASR rerandomization relocates code and more data. However, for ASR rerandomization,​ there are still fundamentally no differences between the old and the new version ​of the service. In contrast, in the case of a functional update, the service performs state transfer into a new program. While this new program is typically highly similar, it may be different from the running service in various ways.
  
-=== Running ​the image ===+In terms of the minix-service(8) command, such functional updates can be performed by simply using //​minix-service update// with a new binary. For example, one could test a new version of the UDS (UNIX Domain Sockets) service, without installing it into ''/​service''​ yet, and without affecting its open sockets:
  
-Once a hard disk image has been generated, it can be run. The most convenient way to run the image is to use qemu. For convenience,​ the ''​clientctl''​ script in ''​minix/llvm''​ has a **run** command to run the image in qemu without further effort:+  ​minix# minix-service update /​usr/​src/​minix/​net/​uds/uds -label uds
  
-  $ OUT=F ./clientctl run+The possibility of actual differences between the old and new service versions adds an extra dimension for the state transfer. Additional state transfer problems can be expected in this case, and must be dealt with accordingly. The developers guide will (eventually) elaborate on this point.
  
-The ''​OUT''​ shell variable can be set to other values to control what to do with serial output. The ''​F''​ value specifies that the serial output will be redirected to ''​F''​ile,​ namely ''​serial.out''​. The other supported settings are ''​S''​tdout,​ ''​C''​onsole,​ and ''​P''​ty.+Similarly, depending on the nature of the update, the update action may require a specific state of quiescence. Taking UDS as an example, an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors ​be in flight at the time of the update. The old instance of the service must support this as custom quiescence stateThis custom state can then be specified through the ''​-state'' ​option of the //​minix-service update// command.
  
-Extra [[usersguide:​bootmonitor|boot options]] can be supplied through ​the APPEND variable:+Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned!
  
-  $ OUT=F APPEND="​rs_verbose=1" ./clientctl run+== Multicomponent updates ==
  
-This example will enable verbose output in the RS servicewhich is highly useful for debugging issues with live update.+From the user's perspectiveupdating multiple services at once is not much more complex than updating a single serviceFirst, a number of **minix-service update** commands should be issued, just as before, but each with the ''​-q''​ flag added:
  
-=== Summary ===+  minix# minix-service -q -t update /service/pm -label pm 
 +  minix# minix-service -q -t update /​service/​vfs -label vfs -state 2
  
-The following commands ​can be used to obtain, build, instrument, and start a MINIX3 system that supports live update and live rerandomization,​ including three alternative rerandomized versions, in addition to the standard ones, of all system services:+Then, the entire update ​can be launched with the **minix-service sysctl upd_run** command:
  
-  ​$ git clone git://​git.minix3.org/​minix minix-src +  minixminix-service sysctl upd_run
-  $ cd minix-src/​llvm +
-  $ export CC=clang CXX=clang++ +
-  $ JOBS=8 BUILDVARS="​-V MKMAGIC=yes"​ ./​configure.llvm +
-  $ ./clientctl buildasr 3 +
-  $ OUT=F ./clientctl run+
  
-The entire procedure ​will typically take about 30GB of disk space and several hours of time.+The RS output ​will be much more verbose in this case. Note that timeouts are still to be specified on a per-service basis, rather than for the entire update at onceIf necessary, any queued //​minix-service update// commands may be canceled with the **upd_stop** subcommand:
  
-Sometime later, the following steps can be used to update the installation to a newer MINIX3 version:+  minix# minix-service sysctl upd_stop
  
-  $ cd minix-src/​llvm +This will cancel the entire multicomponent live update action.
-  $ git pull +
-  $ export CC=clang CXX=clang++ +
-  $ JOBS=8 BUILDVARS="​-V MKMAGIC=yes" ​./​configure.llvm +
-  $ for pass in WeakAliasModuleOverride sectionify magic asr; do (cd passes/​$pass && make clean install); done +
-  $ (cd static/​magic && make clean install) +
-  $ ./clientctl buildasr 3 +
-  $ OUT=F ./clientctl run+
  
-In contrast to the initial run, the entire update procedure should take no more than an hour.+===== Developers guide =====
  
-Instead ​of the ''​./​clientctl buildasr 3''​ step in the above two examples, one can for example also instrument the system ​for live update ​but not live rerandomizationusing the following three replacement steps:+This part of the document provides ​in-depth information ​for developers. We start with information ​for system service developers, explaining how to support ​live update ​for a newly written service. This requires limited understand of the details of the live update infrastructureand is therefore a somewhat separate section.
  
-  $ ./relink.llvm magic +The rest of the developers guide is targeted towards people who maintain the live update infrastructureWe first cover some of the theoretical and practical aspects of the live update approach and infrastructure on MINIX3We then elaborate on several practical aspects related to state transfer using the magic framework, including how to prevent and resolve state transfer issues.
-  $ ./build.llvm magic +
-  $ ./clientctl buildimage+
  
-==== Using live update ​====+==== Writing a service ​====
  
-Once an instrumented MINIX3 ​system ​has been built and started, it should be ready for live updatesMINIX3 offers ​two scripts that make use of the live update functionalityone for testing the infrastructure, and one for performing runtime ASR rerandomization. In addition, the user may let the system perform ​live updates explicitly. In this section, we cover both parts.+This section is for writers of system ​servicesWe cover two aspectsgeneral requirements ​for live updates, and specifying custom ​live update quiescence states.
  
-The commands in this section are to be run within MINIX3, rather than on the host system. They must be run as root, because performing a live update of a system service requires superuser privileges. These two things are reflected by the ''​minix#''​ prompt used in the examples below.+=== General requirements ===
  
-=== Pre-provided scripts ===+In by far most cases, allowing future live updates on a new service requires **no action at all** from the service developer. That is, if the service has been written properly, it can also be updated. Specifically,​ a service can be updated if it meets these conditions:
  
-The MINIX3 distribution comes with two scripts that can be used to test and use the live update and rerandomization functionality. The first one is //​testrelpol//​. This script may be used for basic regression testing on the MINIX3 live update infrastructure. The second ​one is //​update_asr//​. This command ​performs ​live rerandomization of system services at runtime.+  - It uses the System Event Framework (SEF) API throughout ​the service; 
 +  - It has one main message processing loop; 
 +  - It performs ​all its initialization in SEF initialization callback routines; 
 +  - It does not suffer from specific state transfer issues.
  
-== Infrastructure testingtestrelpol ==+The first three points are required for all services in any case, and are not specific to live update. These points are therefore covered better on other pages, in particular those on [[.:driverprogramming|programming device drivers on MINIX3]] and the [[.:​sef|System Event Framework API]] (warning: currently outdated). We do explain the reason behind these three points in detail later on.
  
-The MINIX3 test suite has a test set script that tests the basic MINIX3 crash recovery and live update ​functionalityThe script ​is called ​**testrelpol** and can be found in ''/​usr/​tests/​minix-posix'':​+Only the fourth point is specific to live update, and is relevant only to a small subset of servicesThis point is covered in more detail in the "State transfer in practice"​ section below. Specifically,​ as a service developer, you will want to verify that your new service does not suffer from potential issues with long-running memory grants, userspace threads, and physically unmovable memory. Then, you will want to test **self state transfer** on your service, ​and resolve any state transfer errors that come up. Only in these cases does the SEF live-update API (that is, the sef_*_lu_*(3) calls) become relevant. We do not elaborate on most of the SEF API in this document.
  
-  minix# cd /​usr/​tests/​minix-posix +=== Custom quiescence states ===
-  minix# ./​testrelpol+
  
-For its live update teststhis script does //not// use the magic framework for state transfer at all. Instead it uses **identity transfer** which basically just performs ​memory copy between ​the old and the new instance. As a result, the   testrelpol script should work whether or not services ​are instrumentedHoweverit may not work reliably on MINIX3 systems that are not built for magic instrumentation at all (i.e.built without ''​MKMAGIC=yes''​).+In certain cases, a service may have to meet custom requirements before it is allowed to be updated. This depends on both the service ​and the update. We previously gave an example regarding the UDS service and transferring file descriptors before. As another examplean update that affects message protocols may have to ensure that the service has no outstanding requests to other services ​using that protocolAs yet another examplecertain drivers ​may want to avoid being updated while certain types of DMA are ongoingetcetera.
  
-== Live rerandomizationupdate_asr ==+It is up to the writer of the service to implement any such custom quiescence states, assigning a number to each of them. It is then up to the system administrator to supply such a state with the //​minix-service update// command, using the ''​-state <​number>''​ option. Some of the quiescence states are predefined; others must be defined by the service developer explicitly. The following states are defined:
  
-As we have shown before, the ''​clientctl buildasr'' ​host-side command can perform ​the //​build-time//​ preparation of a MINIX3 system for live rerandomizationComplementing this, the //run-time// side of the live rerandomization ​is provided by means of the **update_asr** commandThe update_asr command will update system ​services ​to their next pregenerated rerandomized version, using a cyclic systemLive rerandomization ​is not automatic, and thus, the MINIX3 system administrator ​is responsible for running ​the update_asr command at appropriate times.+  * State **1** (''​SEF_LU_STATE_WORK_FREE''​): work free. This state ensures that the service is not currently performing any workThe fact that the service is being prepared at the time of verifying ​the quiescence state implies that it is not doing any other work, and thus, SEF is hardcoded to accept updates in this state. The service developer can not override ​the check for this state. 
 +  ​State **2** (''​SEF_LU_STATE_REQUEST_FREE''​):​ request freeThis state ensures that the service is not currently processing any requests from other services. ​The state is not valid by default, and may be implemented by the service writer. 
 +  * State **3** (''​SEF_LU_STATE_PROTOCOL_FREE''​):​ protocol free. This state ensures that the service ​is not currently engaged in any protocol exchange with other services. The state is not valid by default, and may be implemented by the service writer. 
 +  * State **4** to **6**: predefined states for specific purposes. These states are handled entirely by RS and SEF, and not relevant for service developers. 
 +  * State **7** and higher (''​SEF_LU_STATE_CUSTOM_BASE''​+//​n//​):​ custom states. These states may be used by services to define their own custom states. The namespace is per-service,​ so each service may define its custom states with numbers starting from 7 (''​SEF_LU_STATE_CUSTOM_BASE+0''​).
  
-By defaultthe update_asr command performs one round of ASR rerandomization,​ updating each service to its next version:+Thusservice ​writer may want to implement states 2, 3, and/or any additional states starting from 7. This involves two necessary parts, and a third optional part.
  
-  minix# update_asr+First, the service must use the sef_setcb_lu_state_isvalid(3) SEF API call to specify a callback routine which specifies whether a particular state is valid for the service. In order to allow for states 2 and 3, but not any custom states, the standard sef_cb_lu_state_isvalid_standard(3) SEF callback routine may be given:
  
-By default, this command will report errors only. More verbose information can be shown using the -v switch:+  sef_setcb_lu_state_isvalid(sef_cb_lu_state_isvalid_standard);​
  
-  minix# update_asr -v+The service would typically issue this call before calling sef_startup(3). In order to allow for additional custom states, a custom callback routine must be supplied:
  
-For further details about this command, see the update_asr(8manual page.+  sef_setcb_lu_state_isvalid(my_state_isvalid);
  
-Aside from providing actual security benefits, ​the update_asr script ​is the **most complete test** of the live update and rerandomization functionality at this timeIt uses the magic framework for state transfer, with full-scale relocation of all state, and it applies the runtime ASR featuresAs of writing, it runs in the default qemu environment without ​any errors or subsequent issues.+This routine has the signature ''​int my_state_isvalid(int state, int flags)'',​ and will be called when a live update ​is initiated through minix-service(8). As its most important parameter, ''​state''​ is the requested quiescence state. The ''​flags''​ parameter contains ​update ​flags and is typically unusedThe routine must return ''​TRUE''​ if the state is valid for the service, and ''​FALSE''​ otherwiseMost services will want to allow the standard states as well as any custom states:
  
-The only aspect that is not tested with this command, is whether ASR rerandomization is //​effective//,​ that is, whether all parts of its address space were properly randomized by the asr pass. After all, ASR rerandomization between identical service copies works just as well, but provides substantially fewer security guarantees. Developers working on the asr pass are encouraged to check its effectiveness manually, for example using nm(1on generated service binaries on the host side.+  #define MY_CUSTOM_STATE_0 ​(SEF_LU_STATE_CUSTOM_BASE+0) 
 +  #define MY_CUSTOM_STATE_n (SEF_LU_STATE_CUSTOM_BASE+n) 
 +   
 +  return SEF_LU_STATE_IS_STANDARD(state) || (state >= MY_CUSTOM_STATE_0 && state <= MY_CUSTOM_STATE_n);​
  
-=== Live update ​commands ===+Second, the service must use the sef_setcb_lu_prepare(3) SEF API call to specify a callback routine which verifies whether the service accepts a live update ​for a particular state, typically also before calling sef_startup(3):​
  
-RS can be instructed to perform live updates through the service(8command, specifically through its **service update** subcommand. This command is also used by the automated scripts. For a full overview of the command'​s functionality,​ please see the service(8) manual page as well as the command'​s output when it is run with no parameters.+  sef_setcb_lu_prepare(my_lu_prepare);
  
-In its most fundamental formthe //service update// command ​will update ​a running ​service, ​identified by its labelto a new on-disk binary fileIt is however possible to tell RS to update ​the service into a copy of itself, and to influence ​the process using various flags and optionsThe basic syntax to perform a live update on a single system service is as follows:+This routine has the signature ''​int my_lu_prepare(int state)''​and will be called when a live update ​is initiated through minix-service(8)after ensuring the given state is valid. Again''​state''​ is the requested quiescence stateThe function must return ''​OK''​ if the live update ​can proceed in this state, and ''​ENOTREADY''​ otherwise. It should check the standard states ​and/or any custom states, typically in a switch statement.
  
-  minix# ​service ​[flags] update [self|<​binary>​] ​-label <​label>​ [options]+Third, the service ​may optionally provide a quiescence state debugging function through the sef_setcb_lu_state_dump(3) SEF API call. The given callback routine has the signature ''​int my_lu_state_dump(int state)''​ and should use the sef_lu_dprint(3) printf-like function to print information about the given quiescence state and its current internal state as appropriate,​ using newline-terminated lines.
  
-Through various combinations of this command'​s parameters, MINIX3 basically supports four types of updates, representing increasingly challenging conditions for the overall live update infrastructure in general, and state transfer in particular. We will now go through all of them, and explain how they can be performed. Later on, the developers guide will provide a more in-depth explanation of the four types of updates.+==== What is where ====
  
-== Identity transfer ==+We now get into the details of the live update infrastructure. For many parts of the story, it may be useful to take a look at the actual source code as well. In this section we give a quick overview of what parts of the source code are where, and what they do.
  
-The first update type is **identity transfer**. In this case, the service is updated to an identical copy of itself, with all functions and static data in the new instance ​located ​at the exact same addresses as the old instance. Identity transfer bluntly copies over entire sections at once, thus requiring no instrumentation at all. This makes it suitable for testing of the MINIX3-specific side of the live update infrastructure,​ hence its use in the ''​testrelpol'' ​script. Identity transfer is the default of the service(8) command when "​self"​ is given instead of a path to a new binary:+The LLVM instrumentation passes are located in ''​minix/​llvm''​ of the MINIX3 source codealong with generate_gold_plugin.sh script described ​in the users guide. The following relevant LLVM passes are located in ''​minix/​llvm/​passes'':​
  
-  ​minix# service update self -label pm+  ​* The **WeakAliasModuleOverride** pass resolves a particular issue with weak symbols being used in assembly code. TODO
  
-This will perform an identity transfer of the PM service. Identity transfer should work for literally all MINIX3 system services. As mentioned, it is guaranteed ​to work only when the system was built with ''​MKMAGIC=yes'',​ although it will mostly work on systems built without magic support ​as wellIt works regardless ​of whether ​the target service was instrumented with the magic framework ​(or ASR).+  * The **sectionify** pass is used to tag certain functions and data of bitcode modules ​as belonging to a certain sectionIts main purpose is to tag certain parts of the compiled code such that the magic pass (see below), in a subsequent run over the same code, will treat the tagged parts as special. For example, it is used to ignore all variables in the libc malloc code for state transfer, for reasons explained later.
  
-If the live update ​is successful, the service(8) command will be silentbut RS will print a system message that the update succeeded:+  * The **magic** pass performs link-time static analysis and instrumentation of system services. It is responsible for supplying libmagicrt (see below) with the necessary information to allow for state transfer at runtimeby including descriptions of data types, global variables, and other information,​ in the service ​module. In additionit is responsible for replacing certain function calls in the module, in particular memory management functions, with calls to wrappers in libmagicrt. This allows for runtime tracking of dynamically allocated memory objects.
  
-  ​RS: update succeeded+  ​* The **asr** pass performs randomization of the service binary, for example by rearranging functions, basic blocks within functions, and data, adding padding between those, and letting functions allocate stack padding. The ASR pass does not deal with randomization of dynamically allocated objects. Instead, it passes some settings on to libmagicrt.
  
-If the system was started on qemu with the ''​OUT=F'',​ this message will end up in ''​serial.out''​. Otherwise, the message should show up the system ​log (''/​var/​log/​messages''​) and possibly on the first console.+In addition to the passes, the following pieces of system ​functionality are especially important for live update:
  
-If the live update failsRS should print an error to the system ​logand service(8) will complain. In order to debug such failures, it may be useful to enable verbose mode in RSbuy starting ​the system with ''​rs_verbose=1'' ​as shown earlier.+  * The magic runtime library**libmagicrt**,​ is the runtime component of system ​services. It implements the actual state transfer routinewhich uses both the information embedded in the service ​by the magic passand the tracking information ​it has gathered about dynamically allocated memory objects at run time. It also implements that actual runtime tracking. Furthermorelibmagicrt implements ​the aforementioned runtime part of the ASR functionality. For example, libmagicrt can add extra padding when performing memory allocations. The magic runtime library is located in ''​minix/​lib/​libmagicrt''​.
  
-== Self state transfer ==+  * The glue between system services and libmagicrt is implemented as part of the **System Event Framework** (SEF) library routines. These routines also handle the communication between the system service and RS. Use of SEF is compulsory for all system services. The SEF code is part of **libsys**. Its implementation can be found in the ''​minix/​lib/​libsys/​sef*.c''​ files.
  
-The second update type is **self state transfer**. Self state transfer also performs an update ​of a service into an identical copy of itself, but instead uses the state transfer ​functionality ​of the magic framework. Thusself state transfer requires that the service be instrumented properly, and the update type can be used to test whether a service's state can be transferred without problems. Many of the things explained here also apply to the remaining two update types, as all three are using the state transfer of the magic framework.+  * The source code of **RS**, the Reincarnation Server, is located in ''​minix/​servers/​rs''​RS uses live update functionality ​implemented in the kernellocated in ''​minix/​kernel''​, and VM, located in ''​minix/​servers/​vm'​'.
  
-Self state transfer is performed by supplying the ''​-t''​ flag along with "​self"​ to the service update command:+==== The infrastructure ====
  
-  minix# service ​-update ​self -label pm+We now elaborate on various MINIX3-specific aspects that are important to understand regarding live update. We describe the live update procedure, show the consequences of the quiescence approach, list the properties of various process memory sections, describe the two supported types of state transfer, and elaborate on the exceptions to the general model for various core system services.
  
-This command will perform self state transfer of the PM service. ​The libmagic state transfer routine in the new service instance will print additional system messages while it is running. Upon success, the system output will look somewhat like this:+=== The update procedure ===
  
-  total remote functions: 57relocated: 54 +We first describe the live update procedure in more depth.
-  total remote sentries: 186. relocated normal: 84 relocated string: 101 +
-  total remote dsentries: 5 +
-  st_data_transfer:​ processing sentries +
-  st_data_transfer:​ processing dsentries +
-  st_data_transfer:​ processing sentries +
-  st_data_transfer:​ processing dsentries +
-  st_state_transfer:​ state transfer is done, num type transformations:​ 0 +
-  RS: update succeeded+
  
-If the state transfer routine ​is not able to perform state transfer successfullyit will print messages ​that start with ''​[ERROR]''​. RS will then roll back the service to the old instanceand both RS and service(8) ​will report failureSelf state transfer should succeed ​for all MINIX3 ​system ​services that have been built with bitcode ​and instrumented with libmagic ​and the magic pass. As of writingthere are no system services for which self state transfer is known to result in ''​[ERROR]''​ lines and subsequent live update failureHowever:+In general, properly achieving //​quiescence// ​is one of the main challenges for a live update system. For exampleif a live update changes the implementation of a particular function, the component being updated must not be executing ​that function at the time of the live update - if it isthe live update ​will most likely result in a crash of the componentIn MINIX3, the quiescence issue is resolved in a way that leaves little room for problems, by exploiting MINIX3'​s message-based nature. In essence, ​all the MINIX3 services ​consist of a main message loop that repeatedly receives a message ​and processes this message. MINIX3 supports no kernel threads, ​and thus, the MINIX3 services have no internal CPU-level concurrency. As a resulta message can be used to enforce quiescence.
  
-  * It is possible that changes to system ​services, and even usage scenarios ​of services ​which we have not yet testedresults in new state transfer errors. Such errors should be resolved. The developers guide further below contains ​instructions on how to resolve some of these errors.+MINIX3 live updates are orchestrated by the RS (Reincarnation Server) service. The administrator of the system ​first compiles a new version of the service into an executable on disk, and then instructs RS to update a particular running system service into the new version, through the minix-service(8) utility. RS starts by loading the new version ​of the service as a new service process, without letting it run. Thus, there are temporarily two instances of the service: the old instance, ​which is still runningand the new instance, which contains ​the new code but not yet any of the necessary state.
  
-  * Currentlyone service is not built with bitcodenamely ​the memory driverIt is therefore also not instrumented. An attempt ​to perform self state transfer ​on any service ​that is not instrumented will result in a "​Function not implemented (error 78)" errorThis is usually a good indication ​that a step was missed during ​the build phase.+RS then asks the old instance of the service to prepare to be updatedby sending a __prepare__ request message to it. At the moment that the service ​receives and processes the preparation message, it is by definition in a known stateas it cannot also be doing something else at the same timeWhile this is a good start for quiescence, the service may have to meet additional requirements regarding its current activity, depending ​on the service ​and the type of live updateThe administrator provides the intended //​quiescence state// for the live update when starting the update, and the service itself determines whether or not it is //ready// when handling the __prepare__ message. If the service decides ​that it does not meet the given quiescence requirements,​ the live update is aborted.
  
-  * Some services have no state to transfer, in which case their new instances will perform ​fresh start instead of state transfer. In that case, live update with self state transfer will succeedbut not print the state transfer system messages shown above. This is the case for the IS (Information Server) ​and readclock.drv servicesfor example.+However, if the old instance does meet the requirements,​ it acknowledges that it is ready by sending a __ready__ message ​to RSblocking on receipt of a reply from RS. Thus, the old instance is effectively stopped ​in a known state. In order to maintain the externally visible ​state (most importantly, the communication endpoint) of the service being updated, ​the process slots of the old and the new service instances are swappedThe new instancenow in the original process slot, is then allowed to run. Upon startup, the new instance finds out from RS that it is the new instance of an old, stopped process, and attempts to perform state transfer from this old process into itself.
  
-  * Some services may only be updated once brought into a specific state of quiescence, because ​the default quiescence state is not sufficiently restrictive. In that case, the user must specify an alternative quiescence ​state explicitly, through the service(8) ''​-state''​ optionThis currently applies ​to all services ​that make use of usermode threads, namely ​the VFSahcivirtio_blk servicesThey must be updated using quiescence state 2 (request free) rather than state 1 (work free):+State transfer requires transfer ​of all individual pieces of data from the old to the new process, possibly to a new location. This is performed by the magic framework. In a nutshell, the magic state transfer approach relies on having a full view of all the individual pieces of data that make up the processalong with type information about the data, including for example structure layouts and types of pointers. For static data, this information is generated by the magic pass through ​static analysis performed at compile time, and included with the service ​binaryFor dynamic data, the information is collected and maintained by the magic runtime library attached ​to the service. The end result is that the state transfer framework knows about all global variables and functionsand for each pointerwhat type of data the pointer points to.
  
-  minix# ​service ​-t update self -label vfs -state 2+This knowledge, in addition to full access to the memory of the old instance through a special memory grant, allows the libmagicrt state transfer procedure in the new instance to iterate over all data of the old process. This procedure recursively follows any pointers it encounters, and //pairs// each piece of data with the corresponding piece of data in the new process, copying over and adjusting (as necessary) the data for the new layout as necessary. In certain cases, the state transfer system may not be able to pair all pieces of data, or deal with all pointers. In that case, state transfer fails. Annotations in the service ​source code, as well as custom data transfer methods, can be provided in order to aid the state transfer process.
  
-Omitting the appropriate ​state parameter may result ​in a crash of the service after live updateAt the momentupdate_asr script has hardcoded knowledge about these necessary statesNone of this is great, and we will be working towards a situation where the default state will not result ​in a crash.+Regardless of whether ​state transfer succeeded or failed, the new instance sends the result of the state transfer to RS using an __init__ request messageIf state transfer succeeded, RS allows ​the new instance to continue to runand kills the process of the old instanceIf the state transfer fails, RS again swaps the process slots of the old and the new instance, allows the old instance to run again, and kills the new instance. In both cases, RS communicates ​the result ​to the minix-service(8) utility as well, ultimately letting the system administrator know about the outcome of the live update.
  
-  * State transfer may be slow, and RS applies a default timeout for live updates. Thereforeit may be necessary ​to set a longer timeout ​in order to avoid needless failures. This can be done through ​the ''​-maxtime''​ option to service(8):+For multicomponent ​live updates, ​all affected services are first brought into the //ready// state, after which they are all updated. Any service failing ​to get ready in the preparation phase will cause an abort of the entire update, and any service ​failing the state transfer phase causes a rollback of the entire update.
  
-  minix# ​service ​-t update self -label vfs -state 2 -maxtime 120HZ+Updating the RS and VM components requires various deviations from the procedure sketched above. In addition, support for live updating the VM service ​is limited. We elaborate on these points later on.
  
-The maximum time is specified in clock ticks by default, but may be given in seconds by appending "​HZ"​ to the timeout. The latter may sound confusing and it is, but the original idea was supposedly that the number of seconds is multiplied by the system'​s clock frequency aka its HZ setting. The above command allows the live update of VFS to take up to two minutes.+=== The quiescence model ===
  
-== ASR rerandomization ==+We describe the quiescence model in a bit more detail, in order to make two points: 1) the implementation of system services must follow a basic standard structure in order to allow for live update, and 2) the process stack is and can be disregarded for the purpose of state transfer.
  
-The third update type is **ASR rerandomization**. Like self state transfer, ASR rerandomization uses the magic framework to perform state transfer. In this case, the service performs state transfer into rerandomized ​version of the same service. This involves specifying the path to a rerandomized ASR binary to the service(8) command, as well as the ''​-a''​ flag. The ''​-a''​ flag tells the new instance to enable the run-time parts of rerandomization during the live update.+The following piece of pseudocode represents ​simplified and flattened ​version of the general structure of each system ​service:
  
-  minixservice -update /service/asr/1/pm -label pm+<​code>​ 
 +main: 
 + initialization 
 + receive INIT message from RS 
 + if INIT message requests ​FRESH start: 
 + perform ​service ​initialization 
 + if INIT message requests a LIVE UPDATE start: 
 + perform state transfer 
 + send result of performed action to RS
  
-When a system has been built with ASR rerandomization,​ the (randomized) base service binaries are located in ''/​service''​ and the (randomized) alternative service binaries are located in numbered subdirectories in ''/​service/​asr''​. As mentioned before, the update_asr(8) command can be used to perform these updates semi-automatically.+        # there should ​be nothing else here
  
-ASR rerandomization comes with one extra restrictionthe VM service cannot be subjected to more complicated forms of state transfer than self state transfer. It is also skipped by the update_asr(8) command ​for this reason. We will explain the restrictions regarding the VM service in the developer section.+ # the main message loop 
 + while true: 
 + receive message 
 + if message from RS and message ​is PREPARE: 
 +for simplicity, we are always ready 
 + send READY message to RS 
 + receive response message from RS 
 + # if we get here, the live update has failed 
 + continue 
 + handle message 
 +</​code>​
  
-== Functional ​update ​==+As can be seen, the service'​s initialization code starts by learning from RS what type of initialization it should perform. 
 +This can be either //fresh// initialization of the system service, or state transfer for the purpose of live update ​(for simplicity we disregard crash recovery). If the service is started anew, typically during system boot, it will perform the service initialization. Such fresh initialization typically consists of initializing global variables, performing initialization-only procedures, etcetera. In contrast, if a new service instance is started for the purpose of live update, it will skip the fresh initialization and instead perform state transfer from the old instance.
  
-The final update type is a **functional update**. Compared to self state transfer, ASR rerandomization relocates ​code and more dataHowever, there are fundamentally no differences between the old and the new version of the service. In the case of a functional update, the service ​performs ​state transfer ​into new program. While typically highly similar, ​the new program may be different from the running ​service ​in various ways.+In practice, all interaction with RS is implemented in the System Event Framework (SEF) library ​code. The service-specific actions such as the fresh initialization action ​are implemented as callbacks from SEF. In the case of fresh initialization, the service ​is to provide a callback function to SEF using the sef_setcb_init_fresh(3) API call. The default ​state transfer ​action for //live update// start does not require code in the actual ​service ​at all.
  
-In terms of the service(8) commandsuch functional updates can be performed by simply using //service update// with a new binary. For example, ​one could test new version ​of the UDS (UNIX Domain Sockets) servicewithout installing it into ''/​service''​ yet, and without affecting its open sockets:+If the service ​has initialization code that is called outside of the "fresh initialization"​ procedurefor example ​at the "there should be nothing else here" pointthen this code will also be called in case of live update, possibly undoing the effects ​of the state transfer. Thereforeservices must perform initialization only from the designated initialization routines.
  
-  minix# ​service update ​/​usr/​src/​minix/​net/​uds/​uds -label uds+After either type of initialization,​ the service ​will enter the main message loop, where it will repeatedly receive a message and handle that message. If the received message is a __prepare__ request from RS, then the service is about to be updated, and it sends a __ready__ message to RS, blocking until it gets a response. If the live update ​is successful, this old instance will never get a response, and instead be killed.
  
-The fact that this time there may be actual differences between ​the old and new versions ​of the services adds an extra dimension ​to the state transfer issueAdditional ​state transfer ​failures can be expected ​in this caseand must be dealt with accordinglyThe developers guide will eventually elaborate on this point.+As can be seen, in terms of the process stack of the service, the execution path from main() ​to the point where the service gets blocked receiving the __ready__ response from RS (let's call this the //​quiescence point//) is short and simpleAs a result, if the state transfer ​procedure restored the new instance'​s stack and program counter to continue from the quiescence point, the result would essentially ​be the same as not doing so: in both casesthe new service would end up at the start of the message loopTherefore, the MINIX3 state transfer approach chooses to disregard the execution context of the old process, thus obviating the need to transfer the stack altogether. This is viable only due to the well defined quiescence model.
  
-Similarlydepending on the nature of the update, the update action may require a specific ​state of quiescenceTaking UDS as an example, ​an update may change file descriptor transfers over sockets, in which case the update may impose that no file descriptors are being transferred at the time of the update. The old instance ​of the service must support ​this as quiescence state. This state can then be specified through ​the ''​-state''​ option ​of the //service update// command.+Howeverit is possible that the functions leading up to the quiescence pointincluding ​the main message loop, have local variables on the stack that maintain long-running ​state. ​For example, the main() function could maintain a counter for the number ​of messages received so far. The values ​of such variables will be lost during ​the live update. If this were major issue, the live update framework could be made to instrument ​the stack as well, but this could come at great cost since instrumenting only the stack of functions leading up to the quiescence point would be difficult. In practice, not having essential long-running variables in main() is rather simple, and we have not seen problems so far.
  
-Since the live update functionality is relatively new for MINIX3, we do not yet have much experience with the practical side of performing functional updates to services. This document will be expanded as we gain more insight into the common usage patterns of live update. Stay tuned!+=== Process sections ===
  
-== Multicomponent updates ==+The address space of a process is typically made up of various memory sections with different purposes, and MINIX3 system services are no different. There are important differences between various sections when it comes to state transfer.
  
-From the user'​s ​perspective,​ updating multiple services at once is not much more complex than updating a single ​service. ​Firsta number of **service update** commands should ​be issuedeach with the ''​-q'' flag:+  * The new instance'​s **text** section is already as it should be: it contains ​the new code which has been loaded for the new instance by RS. 
 +  * The new instance'​s ​**data** section ​is initialized as though the service just started, and the state in this section must be transferred from the old service. 
 +  * As explained in the previous sectionthe **stack** section of the old instance can be ignored altogetherinstead letting ​the new instance naturally reconstruct the stack by going through the regular process starting procedure to get back into main() and the message loop. 
 +  * The new instance will have an empty **heap** section. Its state transfer procedure will have to use the brk(2) system call in order to request heap memory for itself so that it can transfer the heap state from the old service. 
 +  * For the memory-mapped pages that make up the old instance's **mmap** section, things are slightly differentMINIX3 ensures that the new instance automatically inherits a copy-on-write version of all memory-mapped pages. Thus, the new instance will automatically have the old instance'​s memory-mapped pages mapped into its address space. For some pages, copy-on-write mappings are not possible. This is the case with memory-mapped I/O and for memory used for DMA transfers. Such pages are mapped with full sharing between the two instances.
  
-  minix# service -q -t update /service/pm -label pm +For a live update ​of the VM service, the last two points are different. We describe the exceptions for VM in a later section.
-  minix# service -q -t update ​/service/vfs -label vfs -state 2+
  
-Then, the entire update can be launched with the **service sysctl upd_run** command:+With this situation as a givenMINIX3 allows for two forms of state transfer: identity transfer, and state transfer by the magic framework. These forms of state transfer are covered in the next two sections.
  
-  minix# service sysctl upd_run+=== Identity transfer ===
  
-The RS output will be much more verbose in this case. Timeouts are still to be specified on per-service ​basisrather than for the entire update at onceIf necessary, any queued //service ​update// commands may be canceled with the **upd_stop** sysctl subcommand:+The simple ​case is identity transferIdentity transfer is a minimal state transfer approach that can only transfers state from an old instance ​to a new instance of exactly the same service, ​that is, a process with exactly ​the same address space layout and functionalityIdentity transfer is also supported when the target ​service ​has not been instrumented,​ and in fact even when the system has not been compiled using LLVM bitcode altogether.
  
-  minix# ​service ​sysctl upd_stop+Since new instance is a newly started copy of the same service, it already has a text section that is identical to the old instance. As described, the stack section need not be transferred,​ and the mmap section is inherited automatically.
  
-This will cancel ​the entire ​multicomponent live update action.+Therefore, identity transfer is concerned with the data and heap sections only. The new instance'​s identity transfer procedure starts by copying over the old instance'​s entire data section to itself. ​This includes ​the variable that contains the size of the old instance'​s heap (''​_brksize''​). The identity transfer procedure then calls brk(2) to allocate a heap for itself which is just as large, and copies over the old instance'​s ​entire ​heap section it itself as well. The identity transfer procedure is implemented in the System Event Framework (SEF) as part of libsys.
  
-==== Useful host commands ====+If the system is not built with ''​MKMAGIC=yes'',​ which means that ''​_MINIX_MAGIC''​ is not defined, then the mmap section of the process is not well delineated and may in fact overlap with other memory areas. This is intentional,​ as it ensures that for such a set-up, the address space layout of services is not unnecessarily restricted and services can use the full address space for, say, a page cache. However, as a result, some memory-mapped areas may not be mapped into the new process, possibly leading to segmentation fault after the live update. Therefore, even identity transfer is not expected to be reliable on a system //not// built with ''​MKMAGIC=yes''​. Eventually, MINIX3 should be changed to use another approach for transferring memory-mapped regions to the new process altogether, which is either not based on ranges or not the default at all. See also the section on open issues in this document.
  
-The host-side ''​clientctl''​ script in ''​minix/​llvm''​ offers a number of additional convenient commands, mainly for developers. We list some of them here.+=== Magic state transfer ===
  
-The **buildboot** command installs just the services that are part of the boot imageIt can be used instead ​of ''​clientctl buildimage''​ when only boot-image services have been changedthus speeding up the development cycle:+The other case is state transfer by the magic framework. This type of state transfer is used by the //self state transfer//, //ASR rerandomization//,​ and //​functional update// update types as covered in the users guideThis form of state transfer relies on the magic pass and library to implement instrumentation and runtime support for state transfer. Againstate transfer is performed by the new instance of the service, using full access to the address space of its old instance.
  
-  $ ./clientctl buildboot+The magic framework'​s state transfer procedure transfers data objects one by oneThis includes all //static// objects. In this context, an object may for example be one global variable. The actual transfer of an object is not a simple memory copy; it involves analyzing any pointers in the object and adjusting these pointers as appropriate to match the address layout of the new instance.
  
-Using this commandit is possible to make and test changes to boot system services fairly quicklyAs an example, the following set of steps suffices to make and test changes to the PM service:+The state transfer procedure also transfers //dynamic// data objectswhich are located in the heap and mmap sections of the old instanceIn essence, the procedure recreates the heap and mmap sections during ​the state transfer, by allocating new heap or mapped memory for each dynamic object, and then transferring its actual contents. This again includes pointer analysis and adjustment. Here, one object is one piece of memory created by a call to malloc(3) or mmap(2), for example.
  
-  $ export PATH=$PATH:/​home/​user/​minix-liveupdate/​obj.i386/tooldir.{platform}/bin +Since MINIX3 already transfers the mmap section to the new instance automatically,​ the state transfer framework starts by unmapping all memory-mapped areas that it knows it will recreateHowever, since some memory areas (the aforementioned memory-mapped I/O and DMA memory) cannot be recreated by the magic framework, these are not destroyed and recreatedThese areas are called ​//special////out-of-band// memoryThe service has to tell the magic runtime library about special memory areasFor the two common ways of allocating such memory, alloc_contig(3) and vm_map_phys(2),​ this is done automatically by libsys.
-  $ cd minix-src/minix/servers/pm +
-  [make changes to the PM source code] +
-  $ nbmake-i386 all install +
-  $ cd ../../llvm +
-  $ C=pm ./relink.llvm magic +
-  $ C=pm ./build.llvm magic +
-  $ ./clientctl buildboot +
-  $ OUT=F ./clientctl run+
  
-The **unstack** command shows stacktrace ​of pretty much any MINIX3 binary in human-readable form:+Out-of-band memory is seen as opaque, physically and virtually unmovable memory, and ignored entirely for the purpose of state transfer. Thus, if piece of out-of-band memory contains a pointer to a piece of memory that is //not// marked as out-of-band,​ this pointer will be missed during state transfer. For the aforementioned (memory-mapped I/O and DMA) memory types, this is not a problem.
  
-  $ ./clientctl unstack <​name>​ [address [address ​..]]+The default of inheriting the entire mmap section leads to the situation that if the magic framework misses any memory-mapped areas for any reason, these will effectively translate to a memory leak in the new instanceCurrently, one such memory leak is addressed explicitly: the page directory that is allocated with mmap(2) internally by the libc malloc code.
  
-For example, to show a stack trace of the PM service ​in a human-readable form:+The state transfer procedure may fail if its analysis is not successfulin which the system will roll back to the old instance, and the live update fails. It is then up to the programmer to deal with such problems. This may involve annotating source code, for example to instruct the state transfer procedure to ignore certain pointers, or to copy over certain data as is. It may involve adding special state transfer routines to libmagicrt, which deal with fundamentally problematic cases such as unions. In rare cases, it may involve adapting source code to avoid state transfer problems. We discuss all this in more detail later.
  
-  $ ./clientctl unstack pm 0x805a7fd 0x80492a5 0x8048050+In the case of self state transfer, all static objects will have the same location in both the old and the new instance. However, due to their dynamic recreation, the addresses of dynamic objects may change during self state transfer.
  
-Note that on ASR-enabled installations, the unstack command works only on the base versions of system services: there is currently no way to unstack a stacktrace for any of the ASR-rerandomized service binariesOn one occasion, the author of this document has done that process by hand, by finding ​the matching assembly code of an ASR-rerandomized service'​s ​crash site in the service's base version.+In the case of ASR rerandomizationnot just the dynamic part, but also the static part of the address space will have objects that are relocated between the old and the new instanceIn additionASR rerandomization permutes ​the order in which the old instance'​s ​dynamic objects are allocated in the new instance. Finally, the asr pass may insert padding which may expose wrong assumptions about alignment of various buffers. Thus, while live rerandomization is a security feature, in practice it may expose not only additional problems with state transfer, but also bugs in the service ​itself.
  
-===== Developers ​guide =====+In the case of a functional update, the new instance may be fundamentally different from the old instance. Unlike the previous cases of state transfer, such live updates may involve functions and global variables that are added or removed, thus causing problems in the //pairing// part of the state transfer. The programmer may have to provide explicit state transfer routines in order to deal with these problems. 
 + 
 +=== Exceptions for services ​==
 + 
 +While MINIX3 allows all of its services to be updated, certain services require special exceptions to allow for live updates, because they are crucial to the live update process itself. These services are RS and VM. This section elaborates on the exceptions made for RS and VM, and explains why VM in particular cannot be updated arbitrarily. 
 + 
 +== The RS service == 
 + 
 +TODO 
 + 
 +== The VM service == 
 + 
 +MINIX3 has limited support for performing a live update of the VM (Virtual Memory) service. There are two reasons why VM is a special case. First, VM provides essential memory management and page fault handling functionality to the other system services. Thus, the live update must ensure that none of that functionality is required during the course of a live update that includes VM. Second, VM's core data structures include page tables. If these page tables are changed during a live update, it may be impossible to perform a proper rollback. 
 + 
 +During normal operation, VM may allocate memory for itself. VM has both a heap and dynamic pages, implementing special local versions of brk(2) and mmap(2) to support this. In particular, page tables are stored in dynamically allocated memory, effectively in VM's mmap section. For VM, the live update procedure must therefore include the transfer of such dynamic state from the old to the new VM instance. 
 + 
 +Since page tables cannot simply be copied, they are made visible to the new instance by mapping the old instance'​s dynamic memory ranges directly into the new instance'​s address space. That means that any changes made to the dynamic data structures by the new instance (page tables included) becomes visible to the old instance after a rollback. However, the two instances do each have their own static memory (i.e., text and data sections, as well as a preallocated stack). Thus, any changes to dynamic memory made by the new instance, would create a potential mismatch between the static and dynamic memory in the old instance after rollback. 
 + 
 +Therefore, in order to allow for rollback, VM must not make any changes to its dynamic memory during the live update. That also means that it may not allocate memory during the live update, not for other processes and not for itself. This situation leads to the following exceptions:​ 
 + 
 +  * First and foremost, since the new VM instance essentially inherits the old instance'​s dynamic memory, the dynamic memory must be ignored by the state transfer framework. For this reason, at startup, VM tells libmagicrt that its entire dynamic memory region consists of special, out-of-band data. As a result, any pointers in this region will not be analyzed or adjusted by the state transfer procedure. This is a good thing, as changes to such pointers would not be undone after a rollback. However, the main consequence is that if the static memory layout of the VM process changes, any pointers in dynamic memory that point to static memory will become invalid. Therefore, updates to VM are limited to the **identity transfer** and **self state transfer** update types. 
 +  * Another effect of the automatic dynamic memory inheritance is that dynamic memory allocations need and must not be tracked. Therefore, dynamic memory allocation functions are not instrumented in VM at all, requiring an instrumentation override. This override also requires the need to disable some other instrumentation features, such as the aforementioned libc malloc page directory exception. The features are disabled during VM's linking process, through special statements in its Makefile. 
 +  * After a rollback, the old VM instance still has to perform a small number of corrective actions to undo changes made by the new instance. These actions are however kept to a minimum. In the future, more extended non-transparent rollback may be the key to allowing more invasive live updates to the VM service. 
 +  * The state transfer procedure requires some temporary memory to do its job. Since it cannot allocate such memory directly, an //​initialization buffer// is preallocated in the new VM instance, and the state transfer procedure uses this buffer instead of allocating memory dynamically. 
 +  * RS requests VM to preallocate (//pin//) RS's memory before starting a live update, so that RS will not require VM's functionality during the live update. 
 +  * For multicomponent live update operations that include VM, all memory-modifying actions are performed before, rather than during, the actual live update operation, using special preparation requests sent by RS to VM. The memory of all new instances is also preallocated in order to avoid memory allocation and pagefaults during the live update. The old VM instance is the last process that is made ready for the update, and the new VM instance is the first process that gets to run right after. 
 +  * Despite the pinning, the new VM instance may have to handle brk(2) system calls coming from other new service instances that are all part of the same multicomponent live update. IPC filters are used to ensure that the new VM instance gets requests only from the other services in this group, and not from any other running services. Note that VM does not make any changes to its dynamic memory while handling a brk(2) call. Also, since all memory is preallocated,​ the VM instance should never get any pagefaults or handle-memory requests for other services'​ new instances; such requests are blocked by the IPC filters as well. If they do occur, they should result in a timeout of the entire multicomponent live update. 
 + 
 +Overall, it should be clear that live update for the VM service is rather brittle. Eventually, a full revision of the live update approach for VM will have to reveal whether some or all of the current restrictions can be lifted. 
 + 
 +==== State transfer in practice ==== 
 + 
 +In this section, we elaborate on some of the practical details of the state transfer of the magic framework, mainly aiming to allow developers to resolve real-world state transfer failures. 
 + 
 +We do //not// get into most of the theoretical side of the state transfer, and we skip over many other practical details. Interested readers are advised to read the published work of Cristiano Giuffrida - see the "​Further reading"​ section at the bottom of this document. 
 + 
 +=== Some basics and terminology === 
 + 
 +The magic framework keeps track of each //static// object of data using a **sentry** ("​state entry"​) data structure. The framework keeps track of each //dynamic// object of data using a **dsentry** ("​dynamic state entry"​) data structure, which itself has an embedded //sentry// data structure. The magic pass installs libmagicrt wrappers around memory allocation routines so that it can allocate extra memory to store the dsentry metadata right before the actual memory object. Special, out-of-band memory regions are maintained in **obdsentry** ("​out-of-band dynamic state entry"​) data structur. Since no extra memory can be allocated next to the actual memory object in this case, obdsentries themselves are (currently) stored as static data as part of libmagicrt'​s own state. 
 + 
 +The magic framework also uses the concept of a **selement** ("​state element"​),​ which is a particular element within a state entry; for example, it can be one particular field in a structure. State is transferred one element at a time. If the state transfer procedure encounters a problem, it will report about the state element that is causing the problem. 
 + 
 +Each pointer in the service process is expected to point a data type known to libmagicrt. All the possible data types that can be used by the service are enumerated through static analysis by the magic pass, and stored in a **type** table as part of the instrumented service. It may happen that one data type is cast to another, either in the source code of the service or as a result of the LLVM compilation and linking process. As a result, while the static analysis may conclude that a pointer is for one type, runtime state transfer may find that the pointer was (for example) allocated for another type. Normally, such mismatches would cause state transfer to fail, but casting makes this a legitimate case. Therefore, the magic framework has the notion of **compatible types**: if type A is cast to type B anywhere, type A is marked as a compatible type for type B, and finding type A when transferring data of supposed type B will not result in state transfer failure. The magic pass adds a list of compatible types to the service binary as well, all for use by libmagicrt at state transfer time. 
 + 
 +=== Annotation === 
 + 
 +In particular the analysis part of state transfer may not always succeed, for a variety of reasons. In particular, the state transfer framework has problems with unknown pointers, unions, and more generally cases of ambiguity. Such issues can often be resolved by the programmer through annotation in source code, which instructs the state transfer framework what to do with a particular variable. A variable can be annotated by prefixing either its type name (through ''​typedef''​) or its variable name with the annotation prefix followed by an underscore (e.g., ''​noxfer_foo''​). The following annotation prefixes are supported by the magic framework. 
 + 
 +  * **noxfer**: No Transfer. This annotation will prevent transfer of the state altogether, instead zeroing out the memory in the new instance. As an example, the noxfer annotiation can be used in cases where analysis is failing (e.g., in unions) and the new instance will never be using the old instance'​s data anyway. A practical example where it is used is the ''​message''​ type. This data type contains a complicated union, and the quiescence model typically ensures that transferring this state is not necessary, as the service being updated is not involved in processing a message at the time of the update. 
 +  * **ixfer**: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain external pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer. 
 +  * **cixfer**: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''​void *''​ but may be used to store a small integer as well. 
 +  * **pxfer**: Pointer Transfer. This annotation forces a value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. The pxfer annotation may also be used for a union of (differently typed) pointers. Thus, in some cases, a union-of-structures can be split up into a union of non-pointers and one or more unions of pointers, marking the non-pointer union with ''​ixfer''​ and the pointer union(s) with ''​pxfer''​. This is indeed how ''​pxfer''​ is currently used in practice as well. 
 +  * **sxfer**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them, etcetera. The sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure'​s fields as appropriate when annotating the union as sxfer. 
 + 
 +The transfer exception is applied to the type (or variable) with the annotation. For example, a noxfer typedef for a pointer to a structure will refrain from transferring that pointer: 
 + 
 +  typedef struct foo * noxfer_foo_ptr_t;​ /* annotate the pointer */ 
 +   
 +  struct foo my_foo_struct;​ /* the structure will be transferred */ 
 +  noxfer_foo_ptr_t my_foo_pointer = &​my_foo_struct;​ /* the pointer will not be transferred */ 
 + 
 +However, a pointer to a noxfer typedef'​ed structure will be transferred;​ the contents of the structure will not: 
 + 
 +  typedef struct foo noxfer_foo_t;​ /* annotate the structure */ 
 +   
 +  noxfer_foo_t my_foo_struct;​ /* the structure will not be transferred */ 
 +  noxfer_foo_t * my_foo_pointer = &​my_foo_struct;​ /* the pointer will be transferred */ 
 + 
 +It is possible to enable debugging flags in libmagicrt such that it will print more details on how it handles annotated exceptions: in ''​minix/​lib/​libmagicrt/​include/​st/​callback.h'',​ change ''​ST_CB_DEFAULT_FLAGS''​ from ''​(ST_CB_PRINT_ERR)''​ to ''​(ST_CB_PRINT_ERR|ST_CB_PRINT_DBG)''​. The debugging statements will be sent to the system log, and have a ''​[DEBUG]''​ prefix. 
 + 
 +=== Custom state transfer routines === 
 + 
 +Custom state transfer routines can be used in cases where annotation does not suffice. 
 + 
 +TODO 
 + 
 +There is currently one example case where a custom state transfer routine is used, namely for the ''​dsi_u''​ union in the ''​struct data_store''​ structure which is used by the Data Store (DS) service and defined in ''​minix/​servers/​ds/​store.h''​. The custom state transfer routine is located in ''​minix/​lib/​libmagicrt/​magic_ds.c'',​ and provides the state transfer process with information as to which of the union'​s fields should be transferred. 
 + 
 +=== Preventing state transfer issues === 
 + 
 +In some cases, small adjustments must be made to a service in order to prevent issues with state transfer. These types of issues will not result in failure of the state transfer procedure; instead, they may result in a crash of the new instance after a seemingly successful live update. We cover three topics: memory grants, userspace threads, and physically unmovable regions. 
 + 
 +== Memory grants == 
 + 
 +One potential issue concerns memory grants. Each service has a memory grant table, which is an array of all the memory grants that allow other processes to read and/or write the service'​s memory. If the service has any grants active at the time of a live update, the grants should in theory be adjusted in accordance with any relocation of the memory pointed to by the grants. 
 + 
 +However, the main union of the grant structure (''​cp_grant_t'',​ defined in ''​minix/​include/​minix/​safecopies.h''​) is currently marked as //ixfer//, meaning it will be transferred as is. This is not a problem for grants that point to memory //outside// the process being updated, and that means that indirect and magic grants pose no problem for state transfer. It is however a problem for grants that point to memory //inside// the process being updated, that is, for **direct grants**. 
 + 
 +For this reason, for a service that may potentially have direct grants active at the time of the live update, its writer has two options: 1) implement a custom state transfer routine for the ''​cp_grant_t''​ structure in libmagicrt, thus resolving the problem described in this entire section altogether, or 2) prevent live updates of the service whenever the service has active memory grants. The first option is preferred. In any case, the potential consequence of doing neither is that the service ends up suffering from arbitrary memory corruption after a live update, since the transferred direct grant will point to the wrong memory location. 
 + 
 +The live update system itself actually relies on the presence of a long-running direct grant, which provides access of the process'​s full address space to the process itself. The new instance uses this grant during a live update to access the memory contents of the old instance. Since the grant provides access to the process'​s entire address space, it does not suffer from the problem above. 
 + 
 +== Userspace threads == 
 + 
 +Userspace threads pose a problem for state transfer as well. We have previously explained that the process stack of the old instance can be disregarded by the state transfer procedure because it is "​naturally"​ recreated in the new instance. The same does not apply to the stacks of userspace threads, since stack variables are not tracked at run time: even though the threads'​ stacks are transferred to the new instance by the magic framework, they are seen as blobs of (typically) memory-mapped character arrays. The result is that any pointers on these stacks will not be known to libmagicrt and thus not be transferred properly. In addition, thread context (CPU register) state will typically be stored as an array of integers, and similarly end up being skipped by the state transfer procedure. The result is that while state transfer may (appear to) succeed, the service will crash after completion of the live update. 
 + 
 +At this time, the recommended solution is for the service to shut down all threads explicitly before starting state transfer, and to recreate the threads both after successful live update and as part of a rollback. The service may refuse to be updated if any of its threads are in use and cannot be shut down. The last point requires that the service supply a custom callback routine to SEF to perform that check for a quiescence state other than the default, through sef_setcb_lu_prepare(3). In order to allow the use of a nondefault state, a sef_setcb_lu_state_isvalid(3) callback routine must be supplied as well. For VFS and libblockdriver,​ we have chosen the following approach: 
 + 
 +  * Before SEF startup, the call ''​sef_setcb_lu_state_isvalid(sef_cb_lu_state_isvalid_standard)''​ is used to mark all standard quiescence states as valid, including ''​SEF_LU_STATE_REQUEST_FREE''​ and ''​SEF_LU_STATE_PROTOCOL_FREE''​. 
 +  * At the same time, a custom callback function is set using sef_setcb_lu_prepare(3). When SEF calls this function with either the ''​SEF_LU_STATE_REQUEST_FREE''​ or the ''​SEF_LU_STATE_PROTOCOL_FREE''​ state, the function will first check whether all worker threads are idle. If they are not, it will return a failure, aborting the live update. If they are, it will shut them down and report success. 
 +  * Similarly, a custom callback function is set using sef_setcb_lu_state_changed(3). When SEF calls this function with the //old// state being either ''​SEF_LU_STATE_REQUEST_FREE''​ or ''​SEF_LU_STATE_PROTOCOL_FREE'',​ the function recreates the worker threads. This ensures that worker threads are recreated in the old instance upon a rollback. 
 +  * Finally, the service supplies its own state transfer hook using sef_setcb_init_lu(3). This function will first call the normal state transfer function using SEF_CB_INIT_LU_DEFAULT(),​ returning an error if state transfer failed. If state transfer succeeded however, and the preparation state given in ''​info->​prepare_state''​ was either ''​SEF_LU_STATE_REQUEST_FREE''​ or ''​SEF_LU_STATE_PROTOCOL_FREE'',​ then it continues by recreating the worker threads. This ensures that the new instance has worker threads before it leaves the state transfer phase. 
 + 
 +The result of this approach is that updates must be invoked with ''​-state 2''​ (//request free//) or ''​-state 3''​ (//protocol free//) in order to guarantee proper state transfer. As a sidenote, none of these issues are a problem for identity transfer, which should continue to work even with ''​-state 1''​ (//work free//, the default). 
 + 
 +== Physically unmovable regions == 
 + 
 +Another case where the programmer may have to ensure that state transfer does not result in problems that will surface only after the live update, is when a service uses memory areas that are physically unmovable. Such memory areas are typically in use for DMA purposes. If the state transfer procedure changes the physical location of the buffers, DMA may be performed from or to the original physical location, resulting in garbage and possibly arbitrary memory corruption. Such DMA areas must be marked as special out-of-band memory in libmagicrt, and unmarked when freed, using the sef_llvm_add_special_mem_region(3) and sef_llvm_del_special_mem_region(3) SEF calls. This is done automatically by the alloc_contig(3) and free_contig(3) wrapper routines, but must be done explicitly for memory allocated in different ways. 
 + 
 +However, this is only necessary if DMA can happen across a live update. In cases where it is known that no DMA can possibly be ongoing during the live update, the regions are not actually physically unmovable, and therefore need not be marked as such. For example, this is the case for the file system buffer cache implemented in libminixfs. This library allocates and manages buffers without using physically contiguous memory and alloc_contig(3),​ instead using mmap(2) directly and requesting DMA I/O in page-sized chunks (in order to avoid DMA issues on ARM). Therefore, it would be affected by the above problem, were it not for the fact that all its block I/O calls are synchronous. Any future introduction of more asynchrony will turn this situation into a real problem for live update, though. 
 + 
 +As we mentioned before, memory-mapped I/O poses a similar problem. However, the only way to map such I/O memory is currently through the vm_map_phys(2) and vm_unmap_phys(2) calls, of which the libsys wrappers automatically call the special-memory marking/​unmarking functions as well. 
 + 
 +=== Resolving state transfer errors === 
 + 
 +If the magic state transfer procedure encounters problems, it will report failure, with details written to the system log entries using an ''​[ERROR]''​ prefix. In this section, we cover a number of common reasons for state transfer to fail in practice, including some example errors and workarounds. 
 + 
 +== Dangling pointers == 
 + 
 +In order to know how to transfer a piece of memory, the magic runtime library must know about the data type associated to that piece of memory. If no type information is known for a piece of memory, it cannot be transferred. There are various reasons why libmagicrt might not have type information about a piece of memory. The simplest one is a case of a **dangling pointer**: a pointer that used to be valid at some point, but no longer is, because the memory pointed to has been deallocated. While the actual program may know not to use that particular pointer anymore, the state transfer routine does not have such knowledge. A typical error resulting from a dangling pointer may look like this, with some important parts of the output highlighted:​ 
 + 
 +  * **[ERROR]** uncaught ptr with violations. Current state element: 
 +  * SELEMENT: ''<​nowiki>​(parent=sbuf.1900354961,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**sbuf**.1900354961,​ type=TYPE: (id=53 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**))</​nowiki>''​ 
 +  * SEL_ANALYZED:​ ''<​nowiki>​(num=1,​ type=ptr, flags(DIVW)=1110,​ **value**=**0x080cb49f**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=D0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
 +  * SEL_STATS: ''<​nowiki>​(type=ptr,​ trg_flags(RL)=D0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​ 
 + 
 +In this case, the global variable **sbuf** (suffixed with a tag to make its name unique) is a char* pointer to location 0x080cb49f. Since the magic runtime library knows no type information about this target (//trg//) memory location, it marks the location with the placeholder type UNKNOWN_TYPE and aborts state transfer because an unknown type was found. Another example: 
 + 
 +  * **[ERROR]** uncaught ptr with violations. Current state element: 
 +  * SELEMENT: ''<​nowiki>​(parent=inode.3951291702,​ num=80, depth=2, address=0xdfbe3210,​ **name**=**inode.3951291702/​4/​i_data**,​ type=TYPE: (id=61 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01010000,​ values=%%''​%%,​ type_str=i8/​**char%%*%%**))</​nowiki>''​ 
 +  * SEL_ANALYZED:​ ''<​nowiki>​(num=80,​ type=ptr, flags(DIVW)=1110,​ **value**=**0x08108098**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=H0,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
 +  * SEL_STATS: ''<​nowiki>​(type=ptr,​ trg_flags(RL)=H0,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​ 
 + 
 +In this case, the **i_data** field of the fifth element (**/4/**) of the global **inode** structure, also a char* pointer, is pointing to address 0x08108098 which is unknown to libmagicrt. The pointer address typically allows one to determine what kind of memory it is, by means of the memory sections of the process. In this particular example, the address was somewhat higher than the service'​s data end, thus suggesting the memory pointed to is heap memory. This matched with the source code of the service (PFS, the Pipe File Server), which dynamically allocates and frees the i_data buffers using malloc(3) and free(3). 
 + 
 +It is up to the programmer of the service to ensure that the state transfer routine will not attempt to transfer a dangling pointer. This can be as simple as zeroing out the pointer after use, which is usually good practice anyway: 
 + 
 +  free(rip->​i_data);​ 
 +  rip->​i_data = NULL; 
 + 
 +That is the solution that we applied in both cases. 
 + 
 +== External pointers == 
 + 
 +A similar problem occurs when a process has a pointer that is only valid in the address space of another process, or possibly the kernel. Unlike dangling pointers, such external pointers are never valid, and thus do not need to be transferred as pointers. The magic framework must be instructed to that end, for example using //noxfer// annotation. However, external pointers often end up in the local address space as a result of copying in entire structures at once (we already gave process tables as an example), in which case it may be necessary to use //ixfer// rather than //noxfer//. For example, the ProcFS (/proc File System) service has several instances of the following construction:​ 
 + 
 +  typedef struct mproc ixfer_mproc_t;​ 
 +  static ixfer_mproc_t mproc; 
 + 
 +In some cases, it may make more sense to zero out pointers instead. In other cases, we have changed code to retrieve not entire kernel tables but only specific values, or to use the kernel-mapped pages instead of copies of kernel structures to retrieve values. The magic runtime library already ignores pointers into kernel space (that is, 0xf0000000 and higher) altogether. 
 + 
 +Theoretically it is possible that remote pointers end up being valid in the local address space by sheer luck. In known cases of copying in external pointers, it is best to not to rely on failures in the magic framework, but rather annotate the code in a proactive manner. 
 + 
 +== Weak symbols == 
 + 
 +If a service uses weak symbols, the code and data pointed to by these weak symbols may not be included in the linked service object at the time that the instrumentation passes are run. These weak symbols will be resolved and included only after the instrumentation stage. This results in the situation that some of the code and data that is part of the service, will not have been analyzed by the magic pass. The result is a range of possible state transfer failures, including cases where pointers end up pointing to unknown static memory and cases where memory allocation is not properly instrumented,​ ultimately leading to pointers to unknown dynamic memory. 
 + 
 +The following example is from the DS service, where its use of the weak aliases for regcomp(3) and regfree(3) resulted in regcomp'​s malloc(3) calls not being instrumented:​ 
 + 
 +  * **[ERROR]** uncaught ptr with violations. Current state element: 
 +  * SELEMENT: ''<​nowiki>​(parent=ds_subs.1944246923,​ num=9, depth=3, address=0xdfbe6108,​ name=**ds_subs.1944246923/​0/​regex/​re_g**,​ type=TYPE: (id=18 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=opaque*))</​nowiki>''​ 
 +  * SEL_ANALYZED:​ ''<​nowiki>​(num=9,​ type=ptr, flags(DIVW)=1110,​ **value**=**0x08111000**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
 +  * SEL_STATS: ''<​nowiki>​(type=ptr,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​ 
 + 
 +In this case, the pointer **ds_subs[0].regex.re_g** ended up pointing to the unknown heap-section value of 0x08111000. We worked around this issue by forcing DS to use the targets of the weak aliases, _regcomp and _regfree, rather than their original names, using Makefile hacks. 
 + 
 +== Code used only in libmagicrt == 
 + 
 +If the magic runtime library itself uses other library modules, for example from libc, and these modules are not already used by the service itself anyway, then the bitcode linker may not include them in the linked object on which the instrumentation passes are run. Again, this may result in various failures, and unknown pointers in particular:​ 
 + 
 +  * **[ERROR]** uncaught ptr with violations. Current state element: 
 +  * SELEMENT: ''<​nowiki>​(parent=_ctype_tab_,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**_ctype_tab_**,​ type=TYPE: (id=204 ​ , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=11000000,​ values=%%''​%%,​ type_str=i16/​unsigned short*))</​nowiki>''​ 
 +  * SEL_ANALYZED:​ ''<​nowiki>​(num=1,​ type=ptr, flags(DIVW)=1110,​ **value**=**0x0809ccb6**,​ trg_name=, trg_offset=0,​ trg_flags(RL)=,​ trg_selements=(#​1|0:​ 1|p=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x00000000,​ name=???, type=TYPE: (id=0    , name=**UNKNOWN_TYPE**,​ size=0, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=10000000,​ values=%%''​%%,​ type_str=UNKNOWN_TYPE/​UNKNOWN_TYPE))))</​nowiki>''​ 
 +  * SEL_STATS: ''<​nowiki>​(type=ptr,​ ptr_found=1,​ **unknown_found=1**,​ violations=1)</​nowiki>''​ 
 + 
 +In this particular failure case, the global **_ctype_tab_** variable pointed into another global variable, at location 0x0809ccb6 the data section. The other global variable was invisible to the magic pass, so no **sentry** object could be created for it. As a result, libmagicrt did not know about the target of the pointer. The _ctype_tab_ variable itself was used by the ''<​ctype.h>''​ isalpha(3) (etc) set of macros from within libmagicrt. We worked around this issue by putting our own replacement set of macros in libmagicrt instead. 
 + 
 +== Assembly code == 
 + 
 +Yet another case that leads to invisibility of certain aspects is the direct inclusion of assembly code. Assembly code is machine code, not bitcode, and thus, the bitcode instrumentation passes will have problems processing them. Needless to say, the use of assembly code should be minimal throughout the source code. In cases where it cannot be avoided, custom solutions have to be found for any resulting state transfer problems. Fortunately,​ much of the assembly in use by services these days is the result of optimized str*(3) and mem*(3) functions, which require no special treatment for the purpose of state transfer. 
 + 
 +== Incompatible types == 
 + 
 +Finally, we describe one class of state transfer failures which are the result of shortcomings in the magic instrumentation framework itself. LLVM bitcode has the notion of an **opaque** data type. The opaque data type is used for data of which the type has been declared but not defined, typically as a result of forward declarations of structures (''​struct foo;''​). Instead of resolving these types after they have been instantiated,​ LLVM tends to cast between various data types which are identical except for the presence of opaque pointers. As a result, opaque pointers may show up in various places in linked bitcode. 
 + 
 +The magic pass should mark all these practically identical data types as //​compatible types//. However, due to the fact that the casts can take rather complex forms, this is not always happening. The result is that in some cases, state transfer may fail because libmagicrt erroneously detects an incompatibility between a pointer type and the type of data being pointed to. As an example, the following state transfer error was reported during state transfer of the PM service: 
 + 
 +  * **[ERROR]** uncaught ptr with violations. Current state element: 
 +  * SELEMENT: ''<​nowiki>​(parent=timers.515278380,​ num=1, depth=0, address=0xdfb760a8,​ **name**=**timers**.515278380,​ type=TYPE: (id=96 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=01000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func opaque%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))</​nowiki>''​ 
 +  * SEL_ANALYZED:​ ''<​nowiki>​(num=1,​ type=ptr, flags(DIVW)=1110,​ value=0x08147460,​ trg_name=mproc,​ trg_offset=274616,​ trg_flags(RL)=D0,​ trg_selements=(**#​2**|0:​ **1**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=**mproc/​143/​mp_timer**,​ type=TYPE: (id=38 ​  , name=minix_timer,​ size=16, num_child_types=4,​ type_id=9, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ names='​minix_timer_t|minix_timer',​ **type_str**={ $minix_timer tmr_next { $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, tmr_func hash_3792421438/​*,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } })), **2**|o=SELEMENT:​ (parent=mproc,​ num=0, depth=0, address=0x08147460,​ name=mproc/​143/​mp_timer/​tmr_next,​ type=TYPE: (id=37 ​  , name=, size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ **type_str**={ $minix_timer tmr_next \2, tmr_exp_time i32/long unsigned int, **tmr_func hash_3792421438/​%%*%%**,​ tmr_arg { (U) $ixfer_tmr_arg_t ta_int i32/int } }*))))</​nowiki>''​ 
 +  * SEL_STATS: ''<​nowiki>​(type=ptr,​ trg_flags(RL)=D0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)</​nowiki>''​ 
 + 
 +In this case, the analysis failed on the global **timers** variable. The analysis dump shows that two matching types (**#2**) were found, both associated with the **mproc[143].mp_timer** structure field, but neither type matched the type of the pointer. A closer look at the textual representations of the pointer type (the **type_str** of the primary //​selement//​) and of the data types (the //​type_str//​ of the target //​selement//​s) reveals that there is only one difference between the two: the **tmr_func** field of the structure type to which the //timers// variable should point is an **opaque** pointer, whereas the same //​tmr_func//​ field of the target structures is a particular function pointer (to a function referred to as **hash_3792421438**). The remainder of the types are the same. 
 + 
 +The type strings are somewhat difficult to read. The asterisk at the end of a { structure } block indicates a pointer to this structure. In this case, the //timers// variable is a pointer to a **minix_timer_t** structure. The **\n** notation indicates type recursion of the type **n** levels up. In the type string of //timers//, the **\2** after the **tmr_next** field indicates that it is again a pointer to //​minix_timer_t//:​ one type level up is the structure itself, two type levels up is the pointer to the //​minix_timer_t//​ structure. In this case there are no three levels up, but in other cases three levels up could for example be a pointer to a pointer to a structure, etcetera. Although irrelevant in this case, the name of each structure is prefixed with a dollar sign, and **(U)** denotes a union. 
 + 
 +In this case, the analysis failed because it foudn different, incompatible,​ and therefore **other** types, even though the opaque pointer and the function pointer were really the same field types. A look at the corresponding declarations in ''​minix/​include/​minix/​timers.h''​ shows that there is indeed a forward declaration of ''​struct minix_timer''​ which is the cause of LLVM's link-time introduction of casts. We resolved this case by extending the casting analysis of the magic pass to include casts of structures through function prototypes. 
 + 
 +The following example also resulted from the same forward declaration of MINIX3 timer structures, this time in the sched (scheduling) service: 
 + 
 +  * **[ERROR]** uncaught ptr with violations. Current state element: 
 +  * SELEMENT: ''<​nowiki>​(parent=sched_timer.29458437,​ num=4, depth=1, address=0xdfbe70b0,​ **name**=**sched_timer.29458437/​tmr_func**,​ type=TYPE: (id=17 ​  , name=tmr_func_t,​ size=4, num_child_types=1,​ type_id=10, bit_width=0,​ flags(ERDIVvUP)=00000000,​ values=%%''​%%,​ type_str=**opaque%%*%%**))</​nowiki>''​ 
 +  * SEL_ANALYZED:​ ''<​nowiki>​(num=4,​ type=ptr, flags(DIVW)=1110,​ value=0x08048dc0,​ trg_name=**balance_queues**.29458437,​ trg_offset=0,​ trg_flags(RL)=T0,​ trg_selements=(#​1|0:​ 1|o=SELEMENT:​ (parent=???,​ num=0, depth=0, address=0x08048dc0,​ name=???, type=TYPE: (id=119 ​ , name=, size=1, num_child_types=0,​ type_id=4, bit_width=0,​ flags(ERDIVvUP)=11000001,​ values=%%''​%%,​ type_str=**hash_3792445575/​**))))</​nowiki>''​ 
 +  * SEL_STATS: ''<​nowiki>​(type=ptr,​ trg_flags(RL)=T0,​ ptr_found=1,​ **other_types_found=1**,​ violations=1)</​nowiki>''​ 
 + 
 +In this case, the type mismatch was not between two structures that differed in opaque fields, but between two function pointers themselves: the function pointer in **sched_timer.tmr_func**,​ and the function it is pointing to, **balance_queues**. Registering these types as compatible would result in much more complexity in the magic pass, and likely still not resolve the more general problem of opaque pointers. This is currently one of the open issues, and we believe that another approach would be more viable; see below. In this particular case, it turned out that the sched service did not need to use timers at all, and we simplified it by getting rid of its use of timers altogether. Obviously, adapting the actual functionality of a service to allow for state transfer is not always an option, nor is it generally the right approach: the core code of system services should not have to be (re)written specifically to allow for state transfer. 
 + 
 +===== Open issues ===== 
 + 
 +In this section, we describe what we believe are currently the main open issues related to live update. For most issues, no active development is ongoing. We therefore invite any interested parties to work on resolving these issues, and welcome both inquiries and updates regarding the current status on both the [[https://​groups.google.com/​forum/#​!forum/​minix3|MINIX3 newsgroup]] and [[info@minix3.org|info@minix3.org]]. 
 + 
 +This section is roughly sorted by order of importance, starting with the most important issues. 
 + 
 +==== The build system ==== 
 + 
 +As shown in the setup part of the users guide, the live update functionality requires that a a separate instance of the LLVM toolchain be built. Unlike the standard toolchain, this separate instance is suitable for Link-Time Optimization (LTO). It is built by ''​minix/​llvm/​generate_gold_plugin.sh'',​ and placed in ''​obj_llvm.i386''​. The exact same LLVM 3.6.1 source code is used to compile both the LTO-enabled toolchain and the additional regular crosscompilation toolchain in ''​obj.i386'',​ using the exact configuration flags. The separate compilation is necessary only because of a problem with makefiles. 
 + 
 +NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain being built in a different way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation. 
 + 
 +The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the necessary parts of the toolchain without the need to build LLVM twice. 
 + 
 +As part of this, the generated instrumentation passes should not be placed in the ''​minix/​llvm/​bin''​ subdirectory of the source MINIX3 tree. Instead, they should end up in an appropriate subdirectory of ''​obj.i386'',​ thereby keeping the source directory clean. 
 + 
 +==== Instrumentation ==== 
 + 
 +A number of shortcomings in our instrumentation passes currently lead to potential problems at run time. 
 + 
 +=== Type unification === 
 + 
 +As shown in the developers guide, the magic instrumentation pass is not always capable of establishing that two different data types are in fact compatible, resulting in state transfer errors at run time. The main cause of these issues lies in LLVM's use of the **opaque** placeholder data type. We described the practical results of this in the earlier "​Incompatible types" section. 
 + 
 +This problem is a product of circumstances. Between LLVM 2.x and LLVM 3.x, a significant change was made in LLVM regarding the way that types are handled. In a nutshell, rather than unifying various instances of the same data type at compile time, LLVM 3.x keeps these instances as separate types, instead using bit casting between the types to resolve the resulting incompatibilities at link time. More details about this change can be found in the LLVM blog post  
 +[[http://​blog.llvm.org/​2011/​11/​llvm-30-type-system-rewrite.html|LLVM 3.0 Type System Rewrite]] by Chris Lattner. 
 + 
 +However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has created the situation that not all cases of identical types are properly registered as //​compatible types//. As of writing, this has not yet been a real problem, but it is likely to become a problem in the future. 
 + 
 +We believe that the right solution would be the introduction of a new **type unification pass**. This pass would unify all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. The pass could then be run before the magic pass. This would not only resolve the complete problem, but also free the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagicrt to search through compatible types. 
 + 
 +=== ASR skipping libmagicrt === 
 + 
 +The ASR pass currently exempts all of the magic runtime library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagicrt is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service. 
 + 
 +The exact reasons as to why this exception was made are currently unknown. However, if possible, this overall limitation should be resolved by either removing the exception or at least narrowing it to the exact scope of the problem. 
 + 
 +==== Memory management ==== 
 + 
 +The MINIX3 memory management, implemented in the VM service, currently has a number of significant limitations and missing features. Some of its problems are relevant for live update only. Other problems are merely becoming more visible as a result of enabling or using live update functionality. 
 + 
 +=== Region transfer issues === 
 + 
 +As we already mentioned earlier, the transfer of memory-mapped pages requires that these pages be in a strictly delineated address range. This range may not overlap with any of the process'​s other sections'​ address ranges. The range is hardcoded globally, and thus, defined much more strictly than necessary for most service processes. Moreover, the definition indiscriminately affects all processes, including application processes. The result is that if the system is built with live update support, all processes are severely restricted in how much of their address space they can use for memory-mapped regions. Conversely, if the system is not built with live update support, even identity transfer may fail. 
 + 
 +Another problem mentioned before, is the bulk transfer of all pages in the process'​s mmap section, regardless of whether the state transfer framework knows about them. This could easily lead to memory leaks due to transfer of untracked pages. 
 + 
 +We believe that both points could be resolved with a system that does not automatically transfer memory-mapped pages from the old to the new instance, but rather performs such transfer on demand, so that the (identity or magic) state transfer routine can determine exactly what memory to transfer. 
 + 
 +=== Out-of-memory issues === 
 + 
 +MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocated pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no free memory, the service will be killed, possibly taking down the entire system with it. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent. 
 + 
 +The live update and rerandomization support is making this situation even more problematic. The magic runtime library uses extra dynamic memory, and is not particularly careful about using preallocated memory where necessary. The ASR functionality increases memory usage even further. For example, its stack padding feature requires a considerable amount of extra stack space. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system. 
 + 
 +Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulted-in stack pages illustrates that this is not a viable option in general. There has been a partial attempt to prepare file system service'​s buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a certain amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services. 
 + 
 +In the meantime, it can be expected that **test64** of the MINIX3 test set - the test case that tests one particular scenario of running out of memory - will causes test or system failure in an increasing number of cases. It may have to be removed from the default set of tests in the short term. 
 + 
 +=== Contiguous/​DMA memory === 
 + 
 +In addition, MINIX3 does not deal with running out of physically contiguous memory at all. Some services require blocks of physically contiguous memory for DMA purposes. VM currently has no way to recombine fragmented blocks of free memory into larger physically contiguous ranges. In addition, some services require memory that is located in the lower 1MB or 16MB of the physical system memory. The support in VM for obtaining memory in those ranges is very limited as well. Both of these cases may result in the inability for a system service to obtain its needed resources if it is not started immediately at system bootup time. 
 + 
 +These problems are not particularly important for live update, since the new instance will inherit special memory from its old memory by default. They are important for crash recovery however, and they are known to cause failures in the ''​testrelpol''​ test set on occasion. 
 + 
 +=== Page protection === 
 + 
 +Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was originally created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection. 
 + 
 +Since libmagicrt performs extra memory allocations,​ this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''​PROT_NONE''​ protection instead. Theoretically,​ this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken. 
 + 
 +Ideally, this issue would be resolved by implementing proper support for page protection in VM, including for example an implementation of mprotect(2). 
 + 
 +==== Runtime infrastructure ==== 
 + 
 +We now list a number of other issues concerning the MINIX3 runtime infrastructure side of live update. 
 + 
 +=== Default states === 
 + 
 +The case of userspace threads has shown that it may be not just useful, but actually //​necessary//​ for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. These services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that not just users, but also scripts - the update_asr(8) script in particular - be aware of specific services requiring custom quiescence state. This is inconvenient and dangerous. 
 + 
 +The default quiescence state is currently hardcoded in the minix-service(8) utility, in the form of ''​DEFAULT_LU_STATE''​ in ''​minix/​commands/​minix-service/​minix-service.c''​. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not yet clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service'​s own SEF routines. Otherwise, the SEF may have to send the default state as extra data to RS at service initialization time. 
 + 
 +=== Policy redundancy === 
 + 
 +While the following issue is relevant more for crash recovery than for live update, it is included here because it affects the infrastructure supporting the ''​testrelpol''​ script. 
 + 
 +Each service effectively knows what its own crash recovery policy should be. Separately, procfs has a policy table with an entry for each service in ''​minix/​fs/​procfs/​service.c'',​ containing the same crash recovery policy information,​ for export to userland and ''​testrelpol''​ in particular. This is effectively redundant information. 
 + 
 +Ideally, each service would communicate its policy to RS. That information can then be used by procfs to expose the policy information to userland, thus eliminating the redundancy. 
 + 
 +=== Live update of VM === 
 + 
 +Earlier in this document, we have described the limitations of performing live updates on the VM service, as well as the reasons behind these limitations. Despite a large number of exceptions that allow VM to be updated at all, the resulting situation is that VM can still not be subjected to any meaningful type of update. 
 + 
 +It is unclear whether all these limitations are fundamental,​ however. We believe it may be possible to restructure the VM live update facilities to resolve at least some of the limitations. For example, it might be possible to store the pagetables in a separate memory section, and make actual copies of all or most other dynamic memory in VM. The out-of-band region could then be limited to the pagetable memory, thus allowing for relocation of at least static memory. Furthermore,​ more explicit rollback support in the old VM instance might even allow changes to VM's own pagetable, thereby possibly allowing dynamic memory allocation during the live update. It remains to be seen whether any of this is possible in practice. 
 + 
 +=== Timed retries of safecopies === 
 + 
 +If process A is being updated, process B should temporarily not make use of process A's grants, because during the live update, those grants may be inaccessible,​ invalid, etcetera. The kernel currently has a simple way to enforce that rule, by responding to process B's safecopy kernel call with an ''​ENOTREADY''​ error response whenever process A is being updated. The service-side libsys implementation of sys_*safecopy*(2) automatically suspends the calling service for a short while (using tickdelay(3)) and then retries the safecopy. This shortcut approach works, but it is not ideal: it should not be the responsibility of system services to determine when the safecopy can be retried again, and the approach could lead to starvation. 
 + 
 +Instead, the kernel should block the caller of a safecopy call for the duration of its target'​s live update procedure, retrying the safecopy operation and unblocking the caller only once the target is no longer being updated. A proper implementation of this functionality requires several cases to be covered: indirect grants, either the granter or the grantee being terminated or having its process slots swapped, etcetera. As a possible simplification,​ the kernel could internally retry the safecopy operation more often than necessary, since the caller would simply remain blocked if the retried safecopy operation hits a case of live update again. 
 + 
 +=== Copying asynsend tables === 
 + 
 +In a very specific scenario, the kernel performs a memory copy of the entire asynsend table between two processes of which the slots are being swapped. Although it is not yet clear which exact circumstances cause the need for this memory copy, the actual copy action relies on very specific conditions which are not fully validated before the copy action. Thus, this is a rather dangerous kernel feature. 
 + 
 +A rather long comment in ''​minix/​lib/​libsys/​sef_liveupdate.c''​ elaborates on the specifics of this case, and suggests why RS may be the only affected service. If the comment is correct, it may be possible to engineer another solution for RS in particular, and remove the copy hack from the kernel. 
 + 
 +==== Other issues ==== 
 + 
 +A number of miscellaneous issues remain. The first issue, regarding performance,​ is a relatively important issue. The other issues listed in this section are relatively minor. 
 + 
 +=== Performance === 
 + 
 +The performance of various parts of the live update infrastructure is not fantastic. This is true for both the instrumentation passes and, more importantly,​ the run-time functionality. As one of the effects, live update operations may have to be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option. 
 + 
 +We have not yet looked into the causes of the poor performance. Part of it may be due to the extra memory allocations performed by libmagicrt, but that is only a guess. This issue is therefore rather open ended. Statistical profiling may provide at least some hints. 
 + 
 +=== Grant table transfer === 
 + 
 +Currently, the safecopy memory grant tables of system services are transferred as is: the main union of the ''​cp_grant_t''​ structure as defined in ''​include/​minix/​safecopies.h''​ is marked as **ixfer**. 
 +In some scenarios, however, it is possible that during a service'​s live update, the service has grants allocated for remote services. For direct grants (of type ''​CPF_DIRECT''​),​ ''​cp_direct.cp_start''​ is actually a pointer into the local address space. The identity transfer therefore prevents this local pointer from being updated. Especially with ASR, there is a risk that after the live update, the grant points to arbitrary memory within the updated service. In the worst case, a remote user of the grant may end up overwriting this arbitrary memory in the updated service. 
 + 
 +To resolve this, the grant structure should not be using **ixfer** for its main union. This probably means that a custom state transfer routine for the grant structure must be written, so as to use a pointer transfer only for ''​CPF_DIRECT''​ grants. 
 + 
 +The same does //not// apply to magic grants (of type ''​CPF_MAGIC''​),​ as ''​cp_magic.cp_start''​ is an address in a remote process, which is either a userland process or a system process blocked on a call to VFS (as of writing, only VFS can use magic grants at all), and thus never subject to live update while the magic grant is active. 
 + 
 +=== Testrelpol failure === 
 + 
 +If the ''​testrelpol''​ script is run a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //​testrelpol//​. If that is the case, it should be possible to change //​testrelpol//​ to disable the exponential backoff using existing minix-service(8) flags. 
 + 
 +=== Libmagicrt asserts === 
 + 
 +The implementation of the magic runtime library currently relies on asserts being enabled. We have changed its Makefile so that asserts should be enabled regardless of build system settings, but this is merely a workaround. Instead, libmagicrt should function properly (and, in particular, fail properly) regardless of whether asserts are enabled. 
 + 
 +=== VM fork warning === 
 + 
 +During live update and crash recovery, the following VM error may be seen: 
 + 
 +  VM: cannot fork with physically contig memory 
 + 
 +The error indicates that it is currently not possible to mark physically contiguous memory as copy-on-write,​ which is true. However, the error may occur during a live update, when VM copies over the memory-mapped pages of a service'​s old instance to the new instance. The error is therefore not the result of a fork(2) call. In addition, the error code thrown by the function producing the error message, is ignored by its caller, with as result that the reference count of the contiguous memory range is increased anyway, which is exactly what needs to happen for live update operations. Thus, during live updates, this error is both misleading and meaningless. However, we have to review whether it is still useful to keep around the error for other scenarios. 
 + 
 +=== State transfer prefixes === 
 + 
 +State transfer makes exceptions based on name prefixes. Some of these name prefixes are overly broad. For example, it is possible that the current exception of the prefix ''​st_''​ also ends up matching certain variables in actual service code by accident. At the very least, all exception prefixes should start with ''​magic_''​. 
 + 
 +===== Further reading ​===== 
 + 
 +The following publication covers the MINIX3 live update architecture,​ design, and implementation,​ and provides more details on various theoretical and practical aspects. 
 + 
 +  * Cristiano Giuffrida, [[http://​www.minix3.org/​theses/​Cristiano_Giuffrida_PhD_thesis.pdf|Safe and Automatic Live Update]], Ph.D. thesis, 2014
  
developersguide/liveupdate.txt · Last modified: 2022/02/12 22:42 by stux