User Tools

Site Tools


developersguide:liveupdate

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
developersguide:liveupdate [2015/09/16 18:38]
dcvmoole [Developers guide] corrections and style
developersguide:liveupdate [2015/11/15 01:51]
dcvmoole [Setting up the system] don't merge my dashes dokuwiki.
Line 1: Line 1:
-<div center round important>​ 
-**This is a draft.** This page is currently not yet visible to the general public. The plan is that once all live update patches have been merged, this page will be moved to its final location in the wiki's developers guide. Until then, various internal wiki links will appear to be broken. -David 
-</​div>​ 
- 
 ====== Live update and rerandomization ====== ====== Live update and rerandomization ======
  
Line 37: Line 33:
 ==== Setting up the system ==== ==== Setting up the system ====
  
-We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only. The current procedure is not quite ideal, but it is what we have right now, and it should ​work.+We cover all the steps to set up a MINIX3 system that is ready for live update and rerandomization. For now, it requires crosscompilation as well as an additional build of the LLVM source code. The procedure is for x86 targets only. The current procedure is not quite ideal, but it is what we have right now
 + 
 +**Warning**:​ in fact, the current procedure has been tested only from **Linux** as host platforms, and may not work from other host platforms without minor or major changes to various scripts. See [[https://​github.com/​Stichting-MINIX-Research-Foundation/​minix/​issues/​93|here]] for additional information on how to get things going from **FreeBSD**. Please feel free to open additional GitHub issues for other platforms and link to them from here. Also, please do keep in mind the current infrastructure is temporary, and will eventually be replaced by proper integration - see also the section on open issues later in this document.
  
 After setting up an initial environment,​ the MINIX3 update cycle basically consists of four steps: obtaining or updating the MINIX3 source code, building the system, instrumenting the system services, and generating a bootable image. We will go through all steps in detail. At the end of this section, there is also a summary of the commands to issue. After setting up an initial environment,​ the MINIX3 update cycle basically consists of four steps: obtaining or updating the MINIX3 source code, building the system, instrumenting the system services, and generating a bootable image. We will go through all steps in detail. At the end of this section, there is also a summary of the commands to issue.
 +
 +<div center round important>​
 +**Important**:​ due to side effects of an unrelated recent change, the instructions in this section currently do not work on the latest MINIX3 source tree. We are working on resolving this problem. In the meantime, for live update support, please check out git revision **b5400f9**,​ for example by issuing ''<​nowiki>​git reset --hard b5400f9</​nowiki>''​ after cloning the MINIX3 source tree.
 +</​div>​
  
 All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges. All of the commands in this section are to be performed on the crosscompilation host system rather than on MINIX3. None of the commands, except the Linux-specific ''​sudo apt-get''​ example in the first subsection, require more than ordinary user privileges.
Line 669: Line 671:
   * **ixfer**: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain external pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer.   * **ixfer**: Identity Transfer. This annotation will copy the data over as is, without performing analysis on the memory. As an example, the ixfer annotation can be used for pointer values that should not be analyzed as pointers, for instance because they are pointers into another address space. A practical example where it is used is a process table copied in from another service. Such process tables typically contain external pointers, which will be unused by the local service. Some other values may still be needed after state transfer, which is why ixfer is used rather than noxfer.
   * **cixfer**: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''​void *''​ but may be used to store a small integer as well.   * **cixfer**: Conditional Identity Transfer. This annotation will cause the state transfer framework to try to interpret and transfer the value as a pointer, and fall back to identity transfer if this fails. As an example, the cixfer annotation can be used for variables which may contain either a pointer or a number value which is never a valid pointer, making the variable effectively a union of the two types. A practical example where it is used is a callback value, which is of type ''​void *''​ but may be used to store a small integer as well.
-  * **pxfer**: Pointer Transfer. This annotation forces a non-pointer ​value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. As of writingthis annotation ​is not used in practice.+  * **pxfer**: Pointer Transfer. This annotation forces a value to be interpreted as a pointer, and transferred accordingly. As an example, the pxfer annotation may be used when a pointer value is stored in an integer type. The pxfer annotation may also be used for a union of (differently typed) pointers. Thusin some cases, a union-of-structures can be split up into a union of non-pointers and one or more unions of pointers, marking the non-pointer union with ''​ixfer''​ and the pointer union(s) with ''​pxfer''​. This is indeed how ''​pxfer''​ is currently ​used in practice ​as well.
   * **sxfer**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them, etcetera. The sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure'​s fields as appropriate when annotating the union as sxfer.   * **sxfer**: Structure Transfer. This annotation forces a union that consists of structures, to be interpreted as one single structure, and transferred accordingly. The annotation requires that the fields of the structures making up the union all line up. For example, if the first field of one structure in the union is an integer value, then the first field of all other structures in the union must be an integer value as well. If the second field is a pointer in one structure, it must be a pointer in all of them, etcetera. The sxfer annotation can be used to resolve state transfer issues with unions that consist of nearly-identical structures. The programmer must line up the structure'​s fields as appropriate when annotating the union as sxfer.
  
Line 833: Line 835:
 ==== The build system ==== ==== The build system ====
  
-As shown in the setup part of the users guide, the entire live update build infrastructure consists of a separate set of scripts ​built on top of the regular build system. These scripts deviate from the standard build system approach in various ways, for example by building a separate copy of the LLVM toolchain, placing binaries in the MINIX3 source tree, and separately performing scripted steps which should be performed from the regular makefile infrastructure instead. All of these issues should be resolved through proper integration of the live update build infrastructure into the regular build system.+As shown in the setup part of the users guide, the entire live update build infrastructure consists of a separate set of scripts ​stacked ​on top of the regular build system. These scripts deviate from the standard build system approach in various ways, for example by building a separate copy of the LLVM toolchain, placing binaries in the MINIX3 source tree, and separately performing scripted steps which should be performed from the regular makefile infrastructure instead. All of these issues should be resolved through proper integration of the live update build infrastructure into the regular build system.
  
 === Two LLVM toolchains === === Two LLVM toolchains ===
  
-A major part of the problem is the current necessity to build a separate instance of the LLVM toolchain. ​This is the instance ​that is suitable for Link-Time Optimization (LTO)built by ''​minix/​llvm/​generate_gold_plugin.sh'',​ and placed in ''​obj_llvm.i386''​. ​Even though the exact same LLVM 3.4 source code is used to compile both this and the additional regular crosscompilation toolchain in ''​obj.i386'',​ the separate compilation is necessary because of a problem with makefiles.+A major part of the problem is the current necessity to build a separate instance of the LLVM toolchain. ​Unlike ​the standard toolchain, this separate ​instance is suitable for Link-Time Optimization (LTO). It is built by ''​minix/​llvm/​generate_gold_plugin.sh'',​ and placed in ''​obj_llvm.i386''​. ​The exact same LLVM 3.4 source code is used to compile both the LTO-enabled toolchain ​and the additional regular crosscompilation toolchain in ''​obj.i386'', ​using the exact configuration flags. The separate compilation is necessary ​only because of a problem with makefiles.
  
-NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain ​not being built in the same way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation.+NetBSD uses its own set of makefiles to build imported code using its own build system. MINIX3 imports this system, and thus also uses the NetBSD set of makefiles to build the LLVM toolchain. The problem is that these makefiles do not operate in the same way as LLVM's own set of makefiles, resulting in certain parts of the LLVM toolchain being built in a different ​way. The separate LLVM LTO toolchain build does use LLVM's own makefiles, thereby generating some missing pieces that are required for the live update instrumentation.
  
-The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the missing pieces ​without the need to build LLVM twice.+The solution here is to adapt the NetBSD set of makefiles to build LLVM in a way that is closer to LLVM's own makefiles, thereby generating all the necessary parts of the toolchain ​without the need to build LLVM twice.
  
 === Lack of integration === === Lack of integration ===
  
-Once that step has been taken, it should be possible to resolve the other issues as well, effectively replacing all the ''​*.llvm''​ scripts in ''​minix/​llvm''​ with extensions in the regular build system, specifically by adapting the ''​share/​mk''​ set of makefiles as appropriate. All of this should be optional, controlled by the ''​MKMAGIC''​ build (pseudo)variable and possibly other, new build variables. Ultimately, relinking with libmagic and invoking the appropriate link-time passes should be performed by those makefiles. As an example, the WeakAliasModuleOverride pass is already invoked this way.+Once that step has been taken, it should be possible to resolve the other issues as well, effectively replacing all the ''​*.llvm''​ scripts in ''​minix/​llvm''​ with extensions in the regular build system, specifically by adapting the ''​share/​mk''​ set of makefiles as appropriate. All of this should be optional, controlled by the ''​MKMAGIC''​ build (pseudo)variable and possibly other, new build variables ​(e.g., to control ASR settings). Ultimately, relinking with libmagic and invoking the appropriate link-time passes should be performed by those makefiles. As an example, the WeakAliasModuleOverride pass is already invoked this way.
  
 All passes, as well as the magic library, should be (re)built as part of the standard build system infrastructure. As we have indicated earlier, the lack of this step puts an unnecessary burden on the user of the system. All passes, as well as the magic library, should be (re)built as part of the standard build system infrastructure. As we have indicated earlier, the lack of this step puts an unnecessary burden on the user of the system.
Line 859: Line 861:
 === Type unification === === Type unification ===
  
-As shown in the developers guide, the magic instrumentation pass is not always capable of establishing that two different data types are in fact compatible, resulting in state transfer errors at run time. The main cause of these issues lies in LLVM's use of the **opaque** placeholder data type.+As shown in the developers guide, the magic instrumentation pass is not always capable of establishing that two different data types are in fact compatible, resulting in state transfer errors at run time. The main cause of these issues lies in LLVM's use of the **opaque** placeholder data type. We described the practical results of this in the earlier "​Incompatible types" section.
  
-This problem is a product of circumstances. Between LLVM 2.x and LLVM 3.x, a significant change was made in LLVM regarding the way that types are handled. In a nutshell, rather than unifying various instances of the same data type at compile time, LLVM 3.x keeps these instances as separate ​type, instead using bit casting between the types to resolve the resulting incompatibilities at link time. More details about this change can be found in the LLVM blog post +This problem is a product of circumstances. Between LLVM 2.x and LLVM 3.x, a significant change was made in LLVM regarding the way that types are handled. In a nutshell, rather than unifying various instances of the same data type at compile time, LLVM 3.x keeps these instances as separate ​types, instead using bit casting between the types to resolve the resulting incompatibilities at link time. More details about this change can be found in the LLVM blog post 
 [[http://​blog.llvm.org/​2011/​11/​llvm-30-type-system-rewrite.html|LLVM 3.0 Type System Rewrite]] by Chris Lattner. [[http://​blog.llvm.org/​2011/​11/​llvm-30-type-system-rewrite.html|LLVM 3.0 Type System Rewrite]] by Chris Lattner.
  
-However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has resulted in the situation that not all cases of identical types  +However, the magic framework was written for LLVM 2.x, and as a result, this problem was dealt with as an afterthought. The combination of the wildly varying forms that these bit casts can take, and the limited support for processing the bit casts in the magic pass, has created ​the situation that not all cases of identical types are properly registered as //​compatible types//​. ​As of writing, this has not yet been a real problem, but it is likely to become a problem in the future.
-As of writing, this is not yet a real problem, but it eventually will be.+
  
-We believe that the right solution would be a new **type unification pass**, which unifies ​all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. ​This pass would be run before the magic pass, thus resolving ​the original ​problem ​while also freeing ​the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagic to search through compatible types.+We believe that the right solution would be the introduction of a new **type unification pass**. This pass would unify all effectively-identical types in the module at link time, eliminating redundant types and bitcasts in the module. ​The pass could then be run before the magic pass. This would not only resolve ​the complete ​problem, but also free the magic pass of the burden to provide a complete system for enumerating compatible types. As a beneficial side effect, there would be a reduction in the amount of type state that needs to be included with the service, and a reduction in effort needed by libmagic to search through compatible types.
  
-=== Exceptions for libmagic ===+=== ASR skipping ​libmagic ===
  
-The instrumentation framework currently makes more exceptions than it should. In particular, the ASR pass exempts all of the magic library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagic is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service. The exact reasons as to why this exception was made are currently unknown. However, if possible, this overall limitation should be resolved by either removing the exception or at least narrowing it to the exact scope of the problem.+The ASR pass currently ​exempts all of the magic library from rerandomization. This is highly problematic for the overall effectiveness of ASR: libmagic is in principle linked with all system services, thus providing any attacker with a well known, large, unrandomized set of code and data for use in an attack on any running service.
  
-In addition, although less importantly,​ state transfer makes some exceptions based on name prefixes, and some of these name prefixes ​are overly broadFor example, it is not impossible that the current exception ​of the prefix ''​st_''​ also ends up matching certain variables in the actual service. At the very least, all exception prefixes should start with ''​magic_''​.+The exact reasons as to why this exception was made are currently unknownHoweverif possible, this overall limitation should be resolved by either removing the exception or at least narrowing ​it to the exact scope of the problem.
  
 ==== Memory management ==== ==== Memory management ====
Line 881: Line 882:
 === Region transfer issues === === Region transfer issues ===
  
-A problem which we already ​flagged ​earlier ​onis the issue that for live update, ​transfer of in particular ​memory-mapped pages requires these pages to be in a strictly delineated address range. This range may not overlap with any of the process'​s other sections'​ address ranges. The range is hardcoded globally, and thus, defined much more strictly than necessary for most service processes. Moreover, the definition indiscriminately affects all processes, including application processes. The result is that when the system is built with live update support, all processes are severely restricted in how much of their address space they can use for memory-mapped regions.+As we already ​mentioned ​earlier, the transfer of memory-mapped pages requires ​that these pages be in a strictly delineated address range. This range may not overlap with any of the process'​s other sections'​ address ranges. The range is hardcoded globally, and thus, defined much more strictly than necessary for most service processes. Moreover, the definition indiscriminately affects all processes, including application processes. The result is that if the system is built with live update support, all processes are severely restricted in how much of their address space they can use for memory-mapped regions. Conversely, if the system is not built with live update support, even identity transfer may fail.
  
 Another problem mentioned before, is the bulk transfer of all pages in the process'​s mmap section, regardless of whether the state transfer framework knows about them. This could easily lead to memory leaks due to transfer of untracked pages. Another problem mentioned before, is the bulk transfer of all pages in the process'​s mmap section, regardless of whether the state transfer framework knows about them. This could easily lead to memory leaks due to transfer of untracked pages.
  
-We believe that both points could be resolved with a system that does not automatically transfer memory-mapped pages from the old to the new instance, but rather performs such transfer on demand, so that the (identity or magic) state transfer routine can determine what memory to transfer.+We believe that both points could be resolved with a system that does not automatically transfer memory-mapped pages from the old to the new instance, but rather performs such transfer on demand, so that the (identity or magic) state transfer routine can determine ​exactly ​what memory to transfer.
  
 === Out-of-memory issues === === Out-of-memory issues ===
  
-MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocation for pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no available ​memory, the service will be killed. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent.+MINIX3 currently does not deal well with running out of memory. Most system services do not have preallocated ​pages in their heap, stack, and mmap sections. This may create major issues in low-memory situations. For example, if a service attempts to use an extra page of stack while the system has no free memory, the service will be killed, possibly taking down the entire system with it. Beyond VM freeing cached file system data when it runs out of memory, any sort of infrastructure to deal with this general problem is completely absent.
  
-Live update is making this situation even more problematic. The magic library uses more dynamic memory, and is not particularly careful about using preallocated memory ​when necessary. The ASR functionality increases memory usage, ​including the use of stack space through ​its stack padding feature. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.+The live update ​and rerandomization support ​is making this situation even more problematic. The magic library uses extra dynamic memory, and is not particularly careful about using preallocated memory ​where necessary. The ASR functionality increases memory usage even further. For example, its stack padding feature ​requires a considerable amount of extra stack space. The result is that there is now an increasingly large number of scenarios where out-of-memory conditions result in failure of running system services, and possibly the entire system.
  
-Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulting ​in stack pages that this is not a viable option in general. There has been a partial attempt to prepare file system service'​s buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a limited ​amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services.+Even though certain services should be rewritten to deal more gracefully with cases of dynamic memory allocation failure, the example of faulted-in stack pages illustrates ​that this is not a viable option in general. There has been a partial attempt to prepare file system service'​s buffer caches for having their memory stolen by VM at run time, but its implementation is, where present, deeply flawed, and will likely be removed altogether soon. Instead, we believe that the easiest solution for this problem is to let VM reserve a certain ​amount of memory exclusively for satisfying page faults and page-handling requests involving memory in system services.
  
-In the meantime, it can be expected that **test64** of the MINIX3 test setthe test case that tests one particular scenario of running out of memorywill causes test or system failure in an increasing number of cases. It may have to be removed from the default set of tests in the short term.+In the meantime, it can be expected that **test64** of the MINIX3 test set the test case that tests one particular scenario of running out of memory ​will causes test or system failure in an increasing number of cases. It may have to be removed from the default set of tests in the short term.
  
 === Contiguous/​DMA memory === === Contiguous/​DMA memory ===
  
-In addition, MINIX3 does not deal well with running out of //​special// ​memory. Some services require blocks of physically contiguous memory for DMA purposes. VM currently has no way to recombine fragmented blocks of free memory into contiguous ranges. ​Some services require memory that is located in the lower 1MB or 16MB of the system memory. The support in VM for obtaining memory in those ranges is very limited as well. Both of these cases may result in the inability for a system service to obtain its needed resources if it is not started immediately at system bootup time.+In addition, MINIX3 does not deal with running out of physically contiguous ​memory ​at all. Some services require blocks of physically contiguous memory for DMA purposes. VM currently has no way to recombine fragmented blocks of free memory into larger physically ​contiguous ranges. ​In addition, some services require memory that is located in the lower 1MB or 16MB of the physical ​system memory. The support in VM for obtaining memory in those ranges is very limited as well. Both of these cases may result in the inability for a system service to obtain its needed resources if it is not started immediately at system bootup time.
  
 These problems are not particularly important for live update, since the new instance will inherit special memory from its old memory by default. They are important for crash recovery however, and they are known to cause failures in the ''​testrelpol''​ test set on occasion. These problems are not particularly important for live update, since the new instance will inherit special memory from its old memory by default. They are important for crash recovery however, and they are known to cause failures in the ''​testrelpol''​ test set on occasion.
Line 905: Line 906:
 === Page protection === === Page protection ===
  
-Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection.+Finally, support for setting or enforcing page protection bits is mostly missing in VM as well. The live update integration has resulted in one particular case where this is now a problem. The MINIX3 userspace threading library, libmthread, inserts a guard page at the bottom of each thread stack in order to detect stack overruns. The guard page was originally ​created by unmapping the bottom page of the stack, thus leaving an unmapped hole there. This approach worked, but was not ideal: the hole could potentially be filled by a separate one-page allocation later, thereby subverting the intended protection.
  
 Since libmagic performs extra memory allocations,​ this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''​PROT_NONE''​ protection instead. Theoretically,​ this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken. Since libmagic performs extra memory allocations,​ this problem is a bit more relevant for live update. For this and other reasons, the libmthread code was changed to reallocate the guard page with ''​PROT_NONE''​ protection instead. Theoretically,​ this should work fine. In practice, since VM does not implement support for protection, the guard page is now simply an additional stack page. Thus, as of writing, the libmthread guard page functionality is broken.
Line 917: Line 918:
 === Default states === === Default states ===
  
-The case of userspace threads has shown that it may be necessary for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. ​Moreover, these services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that both users and scriptsthe update_asr(8) script in particularbe aware of specific services requiring custom quiescence state. This is annoying ​and dangerous.+The case of userspace threads has shown that it may be not just useful, but actually //necessary// for certain services to provide their own handlers for checking, entering, and leaving a custom state of quiescence. ​These services may crash if the default quiescence state is used for a live update instead of the custom state. The result is the requirement that not just users, but also scripts ​the update_asr(8) script in particular ​be aware of specific services requiring custom quiescence state. This is inconvenient ​and dangerous.
  
-The default quiescence state is currently hardcoded in the service(8) utility, in the form of ''​DEFAULT_LU_STATE''​ in ''​minix/​commands/​service/​service.c''​. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service'​s own SEF routines. ​If this approach does not workit would also be possible ​to somehow expose each service'​s ​default state through the procfs per-service ​''/​proc/​service/''​ files, so that at least scripts could add any custom ''​-state''​ options automatically.+The default quiescence state is currently hardcoded in the service(8) utility, in the form of ''​DEFAULT_LU_STATE''​ in ''​minix/​commands/​service/​service.c''​. Instead, we believe that the service should be able to specify its own default quiescence state, possibly using an additional SEF API call. It is not yet clear whether RS would need to be aware of the alternative quiescence state. If not, the translation from a pseudo-state to the real state could take place entirely in the service'​s own SEF routines. ​Otherwisethe SEF may have to send the default state as extra data to RS at service ​initialization time.
  
 === Policy redundancy === === Policy redundancy ===
Line 925: Line 926:
 While the following issue is relevant more for crash recovery than for live update, it is included here because it affects the infrastructure supporting the ''​testrelpol''​ script. While the following issue is relevant more for crash recovery than for live update, it is included here because it affects the infrastructure supporting the ''​testrelpol''​ script.
  
-Each service effectively knows what its own crash recovery policy should be. Separately, procfs has a policy table with an entry for each service in ''​minix/​fs/​procfs/​service.c'', ​exposing ​the same crash recovery policy information to userlandand the //testrelpol// script ​in particular. This is effectively redundant information.+Each service effectively knows what its own crash recovery policy should be. Separately, procfs has a policy table with an entry for each service in ''​minix/​fs/​procfs/​service.c'', ​containing ​the same crash recovery policy information, for export ​to userland and ''​testrelpol'' ​in particular. This is effectively redundant information.
  
 Ideally, each service would communicate its policy to RS. That information can then be used by procfs to expose the policy information to userland, thus eliminating the redundancy. Ideally, each service would communicate its policy to RS. That information can then be used by procfs to expose the policy information to userland, thus eliminating the redundancy.
 +
 +=== Live update of VM ===
 +
 +Earlier in this document, we have described the limitations of performing live updates on the VM service, as well as the reasons behind these limitations. Despite a large number of exceptions that allow VM to be updated at all, the resulting situation is that VM can still not be subjected to any meaningful type of update.
 +
 +It is unclear whether all these limitations are fundamental,​ however. We believe it may be possible to restructure the VM live update facilities to resolve at least some of the limitations. For example, it might be possible to store the pagetables in a separate memory section, and make actual copies of all or most other dynamic memory in VM. The out-of-band region could then be limited to the pagetable memory, thus allowing for relocation of at least static memory. Furthermore,​ more explicit rollback support in the old VM instance might even allow changes to VM's own pagetable, thereby possibly allowing dynamic memory allocation during the live update. It remains to be seen whether any of this is possible in practice.
  
 === Timed retries of safecopies === === Timed retries of safecopies ===
  
-If process A is being updated, process B should not make use of process A's grants, because those grants may temporarily ​be inaccessible,​ invalid, etcetera. The kernel currently has a simple way to enforce that rule, by responding to process B's safecopy kernel call with an ''​ENOTREADY''​ error response. The service-side libsys implementation of sys_*safecopy*(2) automatically suspends the calling service for a short while (using tickdelay(3)) and then retries the safecopy. This shortcut approach works, but it is not ideal, in particular because ​it could theoretically ​lead to starvation.+If process A is being updated, process B should ​temporarily ​not make use of process A's grants, because ​during the live update, ​those grants may be inaccessible,​ invalid, etcetera. The kernel currently has a simple way to enforce that rule, by responding to process B's safecopy kernel call with an ''​ENOTREADY''​ error response ​whenever process A is being updated. The service-side libsys implementation of sys_*safecopy*(2) automatically suspends the calling service for a short while (using tickdelay(3)) and then retries the safecopy. This shortcut approach works, but it is not idealit should not be the responsibility of system services to determine when the safecopy can be retried again, and the approach ​could lead to starvation.
  
-Instead, the kernel should block the caller of a safecopy ​kernel ​call for the duration of its target'​s live update procedure, retrying the safecopy operation and unblocking the caller only once the target is no longer being updated. A proper implementation of this functionality requires several cases to be covered: indirect grants, either the granter or the grantee being terminated or having its process slots swapped, etcetera. As a possible simplification,​ internally ​retrying ​the safecopy operation more than once would not be a problem, since the caller would simply remain blocked if the retried safecopy operation hits a case of live update again.+Instead, the kernel should block the caller of a safecopy call for the duration of its target'​s live update procedure, retrying the safecopy operation and unblocking the caller only once the target is no longer being updated. A proper implementation of this functionality requires several cases to be covered: indirect grants, either the granter or the grantee being terminated or having its process slots swapped, etcetera. As a possible simplification, ​the kernel could internally ​retry the safecopy operation more often than necessary, since the caller would simply remain blocked if the retried safecopy operation hits a case of live update again.
  
 === Copying asynsend tables === === Copying asynsend tables ===
  
-In a very specific scenario, the kernel performs a memory copy of the entire asynsend table between two processes of which the slots are being swapped. Although it is not yet clear which exact circumstances cause the need for this memory copy, the actual copy action relies on very specific conditions which are not validated before the copy action.+In a very specific scenario, the kernel performs a memory copy of the entire asynsend table between two processes of which the slots are being swapped. Although it is not yet clear which exact circumstances cause the need for this memory copy, the actual copy action relies on very specific conditions which are not fully validated before the copy action. Thus, this is a rather dangerous kernel feature.
  
 A rather long comment in ''​minix/​lib/​libsys/​sef_liveupdate.c''​ elaborates on the specifics of this case, and suggests why RS may be the only affected service. If the comment is correct, it may be possible to engineer another solution for RS in particular, and remove the copy hack from the kernel. A rather long comment in ''​minix/​lib/​libsys/​sef_liveupdate.c''​ elaborates on the specifics of this case, and suggests why RS may be the only affected service. If the comment is correct, it may be possible to engineer another solution for RS in particular, and remove the copy hack from the kernel.
Line 947: Line 954:
 === Performance === === Performance ===
  
-The performance of various parts of the live update infrastructureboth at instrumentation ​time and (in particular) at run time, is not fantasticOne of the effects ​is that in several cases, live update operations ​must be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option.+The performance of various parts of the live update infrastructure ​is not fantastic. This is true for both the instrumentation ​passes ​and, more importantly,​ the run-time functionalityAs one of the effects, live update operations ​may have to be given a lenient timeout in order to succeed. In fact, state transfer currently takes too long to consider automatic runtime ASR rerandomization as a realistic option.
  
-We have not yet looked into the causes of the poor performance. This is therefore ​rather open-ended issue.+We have not yet looked into the causes of the poor performance. Part of it may be due to the extra memory allocations performed by libmagic, but that is only a guess. This issue is therefore rather open ended. Statistical profiling may provide at least some hints.
  
 === Testrelpol failure === === Testrelpol failure ===
  
-After running ​the ''​testrelpol''​ script a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //​testrelpol//​. If that is the case, it should be possible to change //​testrelpol//​ to disable the exponential backoff using existing service(8) flags.+If the ''​testrelpol''​ script ​is run a number of times in a row, it will start to fail on the crash recovery tests for unclear reasons. We know that this is a test script failure rather than an actual failure. We suspect that it is caused by RS's default exponential backoff algorithm for crash recovery causing timeouts in //​testrelpol//​. If that is the case, it should be possible to change //​testrelpol//​ to disable the exponential backoff using existing service(8) flags.
  
 === Libmagic asserts === === Libmagic asserts ===
Line 966: Line 973:
  
 The error indicates that it is currently not possible to mark physically contiguous memory as copy-on-write,​ which is true. However, the error may occur during a live update, when VM copies over the memory-mapped pages of a service'​s old instance to the new instance. The error is therefore not the result of a fork(2) call. In addition, the error code thrown by the function producing the error message, is ignored by its caller, with as result that the reference count of the contiguous memory range is increased anyway, which is exactly what needs to happen for live update operations. Thus, during live updates, this error is both misleading and meaningless. However, we have to review whether it is still useful to keep around the error for other scenarios. The error indicates that it is currently not possible to mark physically contiguous memory as copy-on-write,​ which is true. However, the error may occur during a live update, when VM copies over the memory-mapped pages of a service'​s old instance to the new instance. The error is therefore not the result of a fork(2) call. In addition, the error code thrown by the function producing the error message, is ignored by its caller, with as result that the reference count of the contiguous memory range is increased anyway, which is exactly what needs to happen for live update operations. Thus, during live updates, this error is both misleading and meaningless. However, we have to review whether it is still useful to keep around the error for other scenarios.
 +
 +=== State transfer prefixes ===
 +
 +State transfer makes exceptions based on name prefixes. Some of these name prefixes are overly broad. For example, it is possible that the current exception of the prefix ''​st_''​ also ends up matching certain variables in actual service code by accident. At the very least, all exception prefixes should start with ''​magic_''​.
  
 ===== Further reading ===== ===== Further reading =====
developersguide/liveupdate.txt · Last modified: 2022/02/12 22:42 by stux