====== Dev and Proc File Systems: Design Document ====== David C. van Moolenbroek\\ Dept. Computer Science, Vrije Universiteit Amsterdam, The Netherlands\\ dcvmoole AT cs DOT vu DOT nl
**Warning** This document has gone stale. There is up-to-date [[:developersguide:vtreefs|documentation of VTreeFS]].
===== Abstract ===== The MINIX3 multiserver operating system is evolving rapidly. Much effort is being put into expanding the operating system's dependable, secure and flexible architecture to provide a modern UNIX-like operating environment on top. As part of that effort, this project aims to introduce functionality that is common to other operating systems: virtual /dev and /proc file systems. * The goal of the virtual /dev file system is to provide a pool of device nodes that is populated dynamically by device drivers, rather than statically in advance. This allows drivers to add device nodes based on hardware available in the system. * The goal of the virtual /proc file system is to provide a standard interface for providing information about the system. This obviates the need for utilities like ''ps(1)'' and ''top(1)'' to access internal system data structures directly. ===== Overview ===== The current project plan is to cover the following points, preferably in the given order: * A shared proc/dev file system library, managing virtual inodes and handling most of the VFS calls. * A /dev file system, based on the file system library. * Device node registration by existing device drivers. * A /proc file system, based on the file system library. * Modification of ''ps(1)'', ''top(1)'' and various other small utilities and library routines to make use of /proc. * Support for mounting file systems using a "none" device, i.e. no particular block device. * Changes to the MINIX3 boot process to mount /dev and /proc upon boot. * As much additional system information in /proc (from VM, Inet, RS etcetera) as time permits. After each point there will be a (possibly short) testing phase to verify that the results are as required. ====== VTreeFS ====== Both devfs and procfs will present a set of virtual files. From the user's point of view, both are read-only. As a result, both file systems will handle most requests from VFS in the exact same way. It therefore makes sense to put the common functionality into a library. This library should provide the functionality needed to create a virtual tree based read-only file system with ease. Specifically, the library is to provide the following: - the main loop of the server; - handlers for the requests from VFS; - an interface to manipulate the virtual file system tree; - callback hooks for refreshing of directories, and reading from regular files and symlinks. While written to cover the demands of devfs/procfs, the library will be generic enough that new file systems that operate along the same lines can use it as well. ==== The concept ==== An application that uses the library, can add and remove directories and other files. All files (including directories) are represented using the primary object of the library: the **inode**. The library will essentially manage a fully connected tree of inodes. The library hard links are not supported, so every inode except the root inode is also an entry into its parent directory. The entry is identified in that directory by name. To satisfy the requirements of ProcFS, an inode may also have an index associated with the entry into the directory. This optional index determines the inode's position when getting returned by a getdents() call. //Incomplete// ==== VFS request handling ==== The library will provide a meaningful implementation for at least the following VFS requests: * REQ_PUTNODE * REQ_STAT * REQ_FSTATFS * REQ_UNMOUNT * REQ_SYNC * REQ_NEW_DRIVER * REQ_READ_S * REQ_LOOKUP_S * REQ_READSUPER_S * REQ_RDLINK_S * REQ_GETDENTS The remaining calls may be implemented by returning ENOSYS. //Incomplete// ==== The API ==== Even though the library has to be written in C, its interface is close to object oriented: the inode is an opaque structure. The library's public header file should expose a "struct inode" declaration (so that the server using the library can make pointers) but //not// expose its internal fields. Creation, deletion, querying and manipulation of inodes and their properties takes place entirely through API calls. #!C struct inode; typedef int index_t; typedef void *cbdata_t; #define NIL_INODE ((struct inode *)0) #define NO_INDEX (-1) struct inode_stat { mode_t mode; uid_t uid; gid_t gid; size_t size; dev_t dev; }; struct fs_hooks { int (*lookup_hook)(struct inode *inode, char *name, cbdata_t cbdata); int (*getdents_hook)(struct inode *inode, cbdata_t cbdata); int (*read_hook)(struct inode *inode, off_t offset, char **ptr, size_t *len, cbdata_t cbdata); int (*rdlink_hook)(struct inode *inode, char *ptr, size_t *len, cbdata_t cbdata); int (*message_hook)(message *m); }; struct inode *add_inode(struct inode *parent, char *name, index_t index, struct inode_stat *stat, index_t nr_indexed_entries, cbdata_t cbdata); void delete_inode(struct inode *inode); struct inode *get_inode_by_name(struct inode *parent, char *name); struct inode *get_inode_by_index(struct inode *parent, index_t index); char const *get_inode_name(struct inode *inode); index_t get_inode_index(struct inode *inode); cbdata_t get_inode_cbdata(struct inode *inode); struct inode *get_root_inode(void); struct inode *get_parent_inode(struct inode *inode); struct inode *get_first_inode(struct inode *parent); struct inode *get_next_inode(struct inode *previous); void get_inode_stat(struct inode *inode, struct inode_stat *stat); void set_inode_stat(struct inode *inode, struct inode_stat *stat); void start_vtreefs(struct fs_hooks *hooks, struct inode_stat *stat, index_t nr_indexed_entries); The suggested specification of the API is as follows: **add_inode** adds an inode into a //parent// inode (which must be a directory), with the given //name//. The //index// parameter indicates the index position for the inode in the parent directory, unless it's equal to NO_INDEX (or negative in general). The //stat// parameter points to a filled structure of inode metadata. This structure's //mode// field determines the file type and the access permissions (see ''/usr/include/sys/stat.h''); at least directories (S_IFDIR), regular files (S_IFREG), character-special files (S_IFCHR), block-special files (S_IFBLK), and symbolic links (S_IFLNK) have to be supported. The //uid//, //gid//, //size// and //dev// fields specify the owning user and group ID, the size of the inode, and the device number (for block/char special files), respectively. The //nr_indexed_entries// parameter is only used for new directories (S_IFDIR), and indicates the range (0 to nr_indexed_entries-1) reserved for inodes with index numbers; this value may be 0 (and/or negative?) for directories that do not care about index numbers. The //cbdata// parameter specifies a caller-defined value passed to hook calls affecting this inode. **delete_inode** removes the given //inode//. If the inode is a directory, all of its children will be removed recursively as well. **get_inode_by_name** and **get_inode_by_index** return an inode given a directory and either a name or an index number. They may fail, and in that case return NIL_INODE and NO_INDEX, respectively. **get_inode_name** and **get_inode_index** return the name and index (or NO_INDEX) of a given inode, as assigned with the add_inode() call. The name pointer may simply point into the inode. **get_inode_cbdata** returns the cbdata value for an inode. **get_root_inode**, **get_parent_inode**, **get_first_inode** and **get_next_inode** allow walking through the virtual tree of inodes, respectively retrieving the virtual tree's root inode, the parent inode of a given inode, the first child inode of a given parent, and the next inode in a series of children (given the previous result of get_first_inode() or get_next_inode()). The last three may return NIL_INODE if the directory does not have a parent (only in the case of the root directory), has no children, or has no more children, respectively. **get_inode_stat** and **set_inode_stat** retrieve and manipulate inode metadata. **start_vtreefs** starts the main loop of the vtree file system library, accepting requests from VFS and possibly other sources (passing those on to the application), and making the appropriate callbacks to the application based on the hooks given by the application. This API call will return when the file server is instructed to shut down by VFS and/or PM. The //hooks// parameter specifies a structure of function pointers just like MINIX3's libdriver does. Upon being started, the vtreefs library has to create a root inode; the //stat// and //nr_indexed_entries// parameters of start_vtreefs() determine the initial parameters of this root inode. ==== Callback hooks ==== **lookup_hook** is called every time a lookup for an entry other than "." and ".." is made on an inode that is a directory and is search-accessible by the caller of the REQ_LOOKUP call. The hook call is made right before the library does the actual name lookup. The provided //inode// is the directory inode, and //cbdata// is the callback data of that inode. //name// is the path component being looked up. This hook should allow the application to do at least the following things safely: * populate the given directory inode with inodes, leaving the precise lookup of the given //name// to the library; * delete the directory inode. In the latter case, the hook implementation should return an error (typically ENOENT) to indicate that the lookup function should not continue. If OK is returned from the lookup function, the library should continue the lookup. **getdents_hook** is called everytime a REQ_GETDENTS call is made on a directory inode. The hook call is made right before the library does the actual directory entry enumeration. The //inode// parameter is the inode of this directory, and //cbdata// is the callback data of this inode. The same semantics apply as for //lookup_hook// above. **read_hook** is called when a user process reads from a regular (S_IFREG) file inode. //inode// and //cbdata// are the inode and callback data of this regular file, respectively. //offset// is the zero-based offset into the file from which reading should start, and //len// points to the requested read length. The hook implementation may return an error indicating why the file cannot be read. If the hook returns OK, then the library assumes that: * //ptr// is filled with a ''char *'' pointer to a static array containing the data to return, and, * //len// is filled with the length of the data (which may be less, but not more, than the original value of //len//). However, if EOF is reached for the file, then the hook must return OK, and a length of 0 in //len//. The //ptr// value is then unused. //The return-pointer-to-array construction is there to avoid the overhead of memory copying the data from the application to the library on every read.// **rdlink_hook** is called when a user process reads from a symbolic link (S_IFLNK) inode. //inode// and //cbdata// are the inode and callback data of this symlink. //ptr// is a pointer to a memory area within the library, of a size pointed to by //len//. The hook implementation can write up to //len// bytes of data into //ptr//, and is expected to fill //len// with the number of bytes written (which may be less, but no more, than the original value of //len//). The data written must not contain any '\0' bytes. Library implementation suggestion: the memory area should be at least PATH_MAX bytes, but the given value of //len// may be less if the REQ_RDLINK provided a smaller length. **message_hook** is called whenever the library's main loop receives a message that is not a request from VFS and not a SIGTERM signal notification from PM. The //message// parameter points to the message received. It is up to the hook implementer what to do with the message; the library will not send a reply by itself. All hook pointers given in the fs_hooks structure may be NULL. If a hook pointer is NULL, the library must not call it, and instead use sensible defaults: for the lookup and getdents hooks, simply nothing changes in the request handling; for the read hook, EOF may be returned to REQ_READ calls; for the rdlink hook, the library may return an empty result; for the message hook, the message should simply be ignored. ==== Implementation requirements and hints ==== In random order. * The naming used in the specification is just suggested. In fact, the entire specification is just based on likeliness to be suitable for the rest of the whole project, and may be changed later if we discover that a different API provides a more convenient model for devfs/procfs. * The library **must not** perform any dynamic memory allocation. The inodes may just be stored in a statically sized array. This means that the number of inodes is a compile-time configuration option. * The inode number of an inode (used in the VFS-FS protocol) may be a combination of its index into that array, and a per-inode generation number that gets increased every time an inode is reused for different purposes (just like MINIX3's endpoints). * The performance of the library is not the main concern (and must //never// come at the expense of code readability), but we would like to avoid quadratic (or worse) complexity. It seems sensible to have two hashtables: one for (parent,name) -> inode lookups, and one for (parent,index) -> inode lookups. Part of the testing phase may be gathering information about, and tuning, the used hashtable sizes and hashing algorithms. * The parent can be a single pointer field in an inode; the set of children for an inode may be a linked list (doubly linked, for cheap deletion). In total there will be a lot of linked list manipulation; it may be a good idea to use operations from '''' wherever appropriate. * In general, the library may assume that the caller knows what he's doing. However, it is highly recommended that debugging functionality be added (that can be turned on or off with a compile-time configuration option) to make sure that this is indeed the case. A simple example here is that in debugging mode, add_inode() may check whether the given parent already has a child inode with the given name, and delete_inode() may check whether the given inode is the root inode - or even a valid inode pointer at all. The existing assert() and panic() calls may be used in case of errors. * The space to store names must be of size NAME_MAX+1, that is, NAME_MAX actual characters and a terminating '\0' character. In debugging mode, providing a longer name to add_inode() is an error, etcetera. * For the REQ_GETDENTS implementation, the position field may be used as follows: position 0 and 1 are for "." and "..", positions 2 to //nr_indexed_entries+1// are for indexed child inodes (typically not all of these are present, so many positions will be skipped here), and positions //nr_indexed_entries+2// and onwards are for child inodes without an index number. * It is very important that the library is safe with regards to modifications made to the vtree by the application in the callback hook. In particular, if a callback returns an error, the library must not make any assumptions about the original inode still existing. * VFS-referenced inodes may be deleted through the API though. It is up to the library to determine whether it will keep around an inode that has been deleted but is still referenced by VFS. The easiest solution would just be to always delete the inode and just throw errors when VFS makes a request for an invalid inode number; a nicer solution would be to keep around the inode until it is not referenced anymore, but this might potentially be forever (taking up an inode forever). Obviously this has an impact on applications using /proc. ==== Testing ==== To test the library, you'll have to write a simple meaningless file system implementation that uses the library and all its features, and perhaps a number of test programs/scripts that trigger all the VFS->FS calls on this file system. If done properly, that should be sufficient to determine whether the library works as intended. Mounting a file system without specifying a block device is not possible yet (if time permits we'll look at that later in this project), so mounting a file system now requires a block device that you won't use anyway, e.g.: mount -t testfs /dev/fd0 /mnt (this requires that you've installed the test file system to /sbin/testfs) Note that because the block device thing is something that will go away, you don't have to implement REQ_BREAD_S/REQ_BWRITE_S in the library. //Incomplete// ---- ====== ProcFS ====== One major goal of the /proc part of the work is to be able to deny access to the ''getsysinfo(2)'' call for userland programs. That means that all the information offered through ''getsysinfo'' that is currently used by userland programs, has to be offered through /proc. Only procfs (and some other system servers, like IS) should use ''getsysinfo''. This is much cleaner than the current approach, where userland programs obtain and parse raw copies of various system servers' process table. ==== Files in /proc ==== At this moment, we anticipate that the /proc file system will provide at least the following files: ^ File ^ Type ^ Priority ^ Description | | ''/proc/hz'' | text | +++ | System clock frequency, in ticks per second | | ''/proc/loadavg'' | text | +++ | Load average, used by ''getloadavg(3)'' et al | | ''/proc/uptime'' | text | +++ | Uptime information, used by ''uptime(1)'' et al | | ''/proc/version'' | text | + | System version information | | ''/proc/''//pid//''/psinfo'' | text | +++ | Process information for ''ps(1)'', ''top(1)'' | | ''/proc/''//pid//''/status'' | text | +++ | Human-readable process key, value pairs | | ''/proc/''//pid//''/map'' | text | ++ | Memory map of the process | | ''/proc/''//pid//''/cmdline'' | text | ++ | Command line of the process | | ''/proc/''//pid//''/environ'' | text | ++ | Environment variables of the process | | ''/proc/''//pid//''/cwd'' | symlink | + | Current working directory of the process | | ''/proc/''//pid//''/root'' | symlink | + | Root directory of the process | | ''/proc/''//pid//''/exe'' | symlink | + | Executable file for the process | | ''/proc/''//pid//''/fd/''//N// | symlink | + | Open file descriptors of the process | | ''/proc/net/tcp'' | text | ++ | List of TCP connections, for ''tcpstat(1)'' | | ''/proc/net/udp'' | text | ++ | List of UDP connections, for ''udpstat(1)'' | ====== VTreeFS ====== * Generally, the number of forks and exits will outweigh the number of accesses to /proc by far. Unlike DevFS, ProcFS will therefore not be actively updated about changes to the system status that it's interested in: this would simply cause too much overhead in the common case. * Updating of the directory structure for PIDs (''/proc/''//pid//) should therefore be "lazy". On every access, ProcFS' VTreeFS hooks can see if the information that it is about to send back to VFS, is still up-to-date. * This information has to be be obtained from PM, VFS and the kernel (using ''getsysinfo(2)'' calls to PM and VFS). This returns a table of process entries (an array of //slots//, NR_TASKS+NR_PROCS in total). * The set of PIDs can easily change between subsequent getdents calls. To make sure that every PID is returned exactly once in a directory listing, ProcFS can use processes' slot numbers for the //index// value of the inodes. The maximum number of indexed entries of the root directory is therefore NR_TASKS+NR_PROCS. * To make sure that a PID is not reused without ProcFS knowing about it, the //cbdata// value of the ''/proc/''//pid// directory inodes can be the endpoint for that PID. * ProcFS can also create only the containing directory for each PID upon access of the root directory, and fill a specific PID's directory with inodes ('psinfo', 'status' etc) as that directory is accessed. ---- ====== DevFS ====== The file system side of devfs will use only a small subset of the vtreefs library, and should be very easy to implement. The /dev directory need not change compared to the way it is now: it may be completely flat and essentially offer only block-special and character-special files. The main chunk of this work is writing the infrastructure for letting device drivers add device nodes by making function calls. ==== General design ==== The general design of DevFS is as follows. A device driver requests the creation of a device node by making a call to RS. If this is allowed according to RS's policy, RS will publish (store) the information for the device node in DS, using a special key prefix. The devfs process subscribes to that special key prefix, and thereby gets a message when a new device node is published in DS. It will then retrieve the new entry (or entries), and update its virtual file system accordingly. Summarizing the process flow of creating a device node: Driver -> RS -> DS -> DevFS Device nodes have a number of properties: * The name (e.g. "c0d3p1s0", "null") * The type (character or block) * The major device number * The minor device number * The file access mode (e.g. ''rw-r--r--'') * The owning UID * The owning GID Remember that the goal of devfs is to let device drivers create device nodes dynamically. In terms of properties, that means a device driver must be able to specify the name and the minor device number at the very least. Theoretically, device drivers should not have anything to do with the other POSIX semantics. Practically speaking, however, the driver will have to specify the file mode and the owning UID and GID. This part can and hopefully will be improved at some later time, with a much more elaborate policy specification in /etc/drivers.conf. Unfortunately, that change is estimated to be too much work for this project. If the device driver indeed specifies the device name, minor number, UID, GID and file mode for each device, this leaves the device type and major number. Those properties are always the same for all device nodes belonging to a single device driver. It therefore makes sense **not** to specify them on a per-node basis. Additionally, we want to introduce as little redundancy as possible in the whole MINIX3 system. Therefore, we store these two properties on a per-driver (= per-label) basis. ==== Device nodes in RS ==== The 'service' (/bin/service) part of RS currently takes a "-dev " parameter, where is a file in /dev, to tell RS that it should use the major device taken from that device node. With the new DevFS infrastructure, it will be the driver itself that will create device nodes for it, so we cannot rely on the presence of a file in /dev for this anymore. Hence, 'service' needs to be changed in this respect, so that it specifies a major number directly, instead of using a file in /dev. Additionally, the one single place that determines whether a device is a block or character special device, is currently the static /dev itself: based on the file type (block or character special), VFS determines how to talk to the device driver. With DevFS, this information will be put into /dev rather than taken from /dev, so now it becomes necessary to specify this somewhere. For now, the most logical place for that is the same place where the 'major' of the device is specified: the 'service' utility. While retaining the '-dev' parameter for backwards compatibility, 'service' should take two new parameters: -devtype and -devnr. The first one would take 'block' or 'char' and the second would take a number. For example, given the following /dev/c0d0 file in the static /dev we have now: brw------- 1 root operator 3, 0 Dec 19 2007 /dev/c0d0 Instead of starting the first at_wini instance with this command (taken from /usr/src/drivers/memory/ramdisk/rc): /bin/service -c up /bin/at_wini -dev /dev/c0d0 -config /etc/drivers.conf -label at_wini_0 ..the starting command would become: /bin/service -c up /bin/at_wini -devtype block -devnr 3 -config /etc/drivers.conf -label at_wini_0 This eliminates the need for /dev/c0d0 to be present before at_wini is actually started. The major number is already communicated from 'service' to RS (in 'RS_DEV_MAJOR' and the 'rss_major' field of 'struct rs_start'), but a new field needs to be added in both cases to let 'service' pass the "devtype" value to RS as well. For example, call them 'RS_DEV_TYPE' and 'rss_devtype'. In RS itself, a field needs to be added to 'struct rproc' to save this value: eg 'r_dev_type'. The reason for all this will be explained in the next section. ==== Device node tuples in DS ==== DS can store key-value pairs. The key is always a string; the value may be a string or a number. We will use only strings, for both key and value. We use this notation for a mapping from the key "key" to the value "value": "key" => "value" Upon successfully spawning a device driver, RS will create a string entry in DS based on this device driver's label, device type and device major number: "dev " The field is either a 'b' or a 'c' character (with obvious meanings), the field is a decimal number in string form, and the two are separated by a space. In the future, more space-separated fields may be added at the end. For example, the first at_wini instance would have this entry: "dev at_wini_0" => "b 3" Such fields specify the global properties of all device nodes created (indirectly) by that device driver. The device nodes themselves are stored as follows: "node " The