Table of Contents

Dev and Proc File Systems: Design Document

David C. van Moolenbroek
Dept. Computer Science, Vrije Universiteit Amsterdam, The Netherlands
dcvmoole AT cs DOT vu DOT nl

Warning

This document has gone stale. There is up-to-date documentation of VTreeFS.

Abstract

The MINIX3 multiserver operating system is evolving rapidly. Much effort is being put into expanding the operating system's dependable, secure and flexible architecture to provide a modern UNIX-like operating environment on top. As part of that effort, this project aims to introduce functionality that is common to other operating systems: virtual /dev and /proc file systems.

Overview

The current project plan is to cover the following points, preferably in the given order:

After each point there will be a (possibly short) testing phase to verify that the results are as required.

VTreeFS

Both devfs and procfs will present a set of virtual files. From the user's point of view, both are read-only. As a result, both file systems will handle most requests from VFS in the exact same way. It therefore makes sense to put the common functionality into a library. This library should provide the functionality needed to create a virtual tree based read-only file system with ease. Specifically, the library is to provide the following:

  1. the main loop of the server;
  2. handlers for the requests from VFS;
  3. an interface to manipulate the virtual file system tree;
  4. callback hooks for refreshing of directories, and reading from regular files and symlinks.

While written to cover the demands of devfs/procfs, the library will be generic enough that new file systems that operate along the same lines can use it as well.

The concept

An application that uses the library, can add and remove directories and other files. All files (including directories) are represented using the primary object of the library: the inode. The library will essentially manage a fully connected tree of inodes.

The library hard links are not supported, so every inode except the root inode is also an entry into its parent directory. The entry is identified in that directory by name. To satisfy the requirements of ProcFS, an inode may also have an index associated with the entry into the directory. This optional index determines the inode's position when getting returned by a getdents() call.

Incomplete

VFS request handling

The library will provide a meaningful implementation for at least the following VFS requests:

The remaining calls may be implemented by returning ENOSYS.

Incomplete

The API

Even though the library has to be written in C, its interface is close to object oriented: the inode is an opaque structure. The library's public header file should expose a “struct inode” declaration (so that the server using the library can make pointers) but not expose its internal fields. Creation, deletion, querying and manipulation of inodes and their properties takes place entirely through API calls.

#!C
struct inode;
typedef int index_t;
typedef void *cbdata_t;

#define NIL_INODE ((struct inode *)0)
#define NO_INDEX (-1)

struct inode_stat {
  mode_t mode;
  uid_t uid;
  gid_t gid;
  size_t size;
  dev_t dev;
};

struct fs_hooks {
  int (*lookup_hook)(struct inode *inode, char *name, cbdata_t cbdata);
  int (*getdents_hook)(struct inode *inode, cbdata_t cbdata);
  int (*read_hook)(struct inode *inode, off_t offset, char **ptr, size_t *len, cbdata_t cbdata);
  int (*rdlink_hook)(struct inode *inode, char *ptr, size_t *len, cbdata_t cbdata);
  int (*message_hook)(message *m);
};

struct inode *add_inode(struct inode *parent, char *name, index_t index, struct inode_stat *stat,
                        index_t nr_indexed_entries, cbdata_t cbdata);
void delete_inode(struct inode *inode);

struct inode *get_inode_by_name(struct inode *parent, char *name);
struct inode *get_inode_by_index(struct inode *parent, index_t index);

char const *get_inode_name(struct inode *inode);
index_t get_inode_index(struct inode *inode);
cbdata_t get_inode_cbdata(struct inode *inode);

struct inode *get_root_inode(void);
struct inode *get_parent_inode(struct inode *inode);
struct inode *get_first_inode(struct inode *parent);
struct inode *get_next_inode(struct inode *previous);

void get_inode_stat(struct inode *inode, struct inode_stat *stat);
void set_inode_stat(struct inode *inode, struct inode_stat *stat);

void start_vtreefs(struct fs_hooks *hooks, struct inode_stat *stat, index_t nr_indexed_entries);

The suggested specification of the API is as follows:

add_inode adds an inode into a parent inode (which must be a directory), with the given name. The index parameter indicates the index position for the inode in the parent directory, unless it's equal to NO_INDEX (or negative in general). The stat parameter points to a filled structure of inode metadata. This structure's mode field determines the file type and the access permissions (see /usr/include/sys/stat.h); at least directories (S_IFDIR), regular files (S_IFREG), character-special files (S_IFCHR), block-special files (S_IFBLK), and symbolic links (S_IFLNK) have to be supported. The uid, gid, size and dev fields specify the owning user and group ID, the size of the inode, and the device number (for block/char special files), respectively. The nr_indexed_entries parameter is only used for new directories (S_IFDIR), and indicates the range (0 to nr_indexed_entries-1) reserved for inodes with index numbers; this value may be 0 (and/or negative?) for directories that do not care about index numbers. The cbdata parameter specifies a caller-defined value passed to hook calls affecting this inode.

delete_inode removes the given inode. If the inode is a directory, all of its children will be removed recursively as well.

get_inode_by_name and get_inode_by_index return an inode given a directory and either a name or an index number. They may fail, and in that case return NIL_INODE and NO_INDEX, respectively.

get_inode_name and get_inode_index return the name and index (or NO_INDEX) of a given inode, as assigned with the add_inode() call. The name pointer may simply point into the inode. get_inode_cbdata returns the cbdata value for an inode.

get_root_inode, get_parent_inode, get_first_inode and get_next_inode allow walking through the virtual tree of inodes, respectively retrieving the virtual tree's root inode, the parent inode of a given inode, the first child inode of a given parent, and the next inode in a series of children (given the previous result of get_first_inode() or get_next_inode()). The last three may return NIL_INODE if the directory does not have a parent (only in the case of the root directory), has no children, or has no more children, respectively.

get_inode_stat and set_inode_stat retrieve and manipulate inode metadata.

start_vtreefs starts the main loop of the vtree file system library, accepting requests from VFS and possibly other sources (passing those on to the application), and making the appropriate callbacks to the application based on the hooks given by the application. This API call will return when the file server is instructed to shut down by VFS and/or PM. The hooks parameter specifies a structure of function pointers just like MINIX3's libdriver does. Upon being started, the vtreefs library has to create a root inode; the stat and nr_indexed_entries parameters of start_vtreefs() determine the initial parameters of this root inode.

Callback hooks

lookup_hook is called every time a lookup for an entry other than “.” and “..” is made on an inode that is a directory and is search-accessible by the caller of the REQ_LOOKUP call. The hook call is made right before the library does the actual name lookup. The provided inode is the directory inode, and cbdata is the callback data of that inode. name is the path component being looked up. This hook should allow the application to do at least the following things safely:

In the latter case, the hook implementation should return an error (typically ENOENT) to indicate that the lookup function should not continue. If OK is returned from the lookup function, the library should continue the lookup.

getdents_hook is called everytime a REQ_GETDENTS call is made on a directory inode. The hook call is made right before the library does the actual directory entry enumeration. The inode parameter is the inode of this directory, and cbdata is the callback data of this inode. The same semantics apply as for lookup_hook above.

read_hook is called when a user process reads from a regular (S_IFREG) file inode. inode and cbdata are the inode and callback data of this regular file, respectively. offset is the zero-based offset into the file from which reading should start, and len points to the requested read length. The hook implementation may return an error indicating why the file cannot be read. If the hook returns OK, then the library assumes that:

However, if EOF is reached for the file, then the hook must return OK, and a length of 0 in len. The ptr value is then unused.

The return-pointer-to-array construction is there to avoid the overhead of memory copying the data from the application to the library on every read.

rdlink_hook is called when a user process reads from a symbolic link (S_IFLNK) inode. inode and cbdata are the inode and callback data of this symlink. ptr is a pointer to a memory area within the library, of a size pointed to by len. The hook implementation can write up to len bytes of data into ptr, and is expected to fill len with the number of bytes written (which may be less, but no more, than the original value of len). The data written must not contain any '\0' bytes. Library implementation suggestion: the memory area should be at least PATH_MAX bytes, but the given value of len may be less if the REQ_RDLINK provided a smaller length.

message_hook is called whenever the library's main loop receives a message that is not a request from VFS and not a SIGTERM signal notification from PM. The message parameter points to the message received. It is up to the hook implementer what to do with the message; the library will not send a reply by itself.

All hook pointers given in the fs_hooks structure may be NULL. If a hook pointer is NULL, the library must not call it, and instead use sensible defaults: for the lookup and getdents hooks, simply nothing changes in the request handling; for the read hook, EOF may be returned to REQ_READ calls; for the rdlink hook, the library may return an empty result; for the message hook, the message should simply be ignored.

Implementation requirements and hints

In random order.

Testing

To test the library, you'll have to write a simple meaningless file system implementation that uses the library and all its features, and perhaps a number of test programs/scripts that trigger all the VFS→FS calls on this file system. If done properly, that should be sufficient to determine whether the library works as intended.

Mounting a file system without specifying a block device is not possible yet (if time permits we'll look at that later in this project), so mounting a file system now requires a block device that you won't use anyway, e.g.:

  mount -t testfs /dev/fd0 /mnt

(this requires that you've installed the test file system to /sbin/testfs)

Note that because the block device thing is something that will go away, you don't have to implement REQ_BREAD_S/REQ_BWRITE_S in the library.

Incomplete


ProcFS

One major goal of the /proc part of the work is to be able to deny access to the getsysinfo(2) call for userland programs. That means that all the information offered through getsysinfo that is currently used by userland programs, has to be offered through /proc. Only procfs (and some other system servers, like IS) should use getsysinfo. This is much cleaner than the current approach, where userland programs obtain and parse raw copies of various system servers' process table.

Files in /proc

At this moment, we anticipate that the /proc file system will provide at least the following files:

File Type Priority Description
/proc/hz text +++ System clock frequency, in ticks per second
/proc/loadavg text +++ Load average, used by getloadavg(3) et al
/proc/uptime text +++ Uptime information, used by uptime(1) et al
/proc/version text + System version information
/proc/pid/psinfo text +++ Process information for ps(1), top(1)
/proc/pid/status text +++ Human-readable process key, value pairs
/proc/pid/map text ++ Memory map of the process
/proc/pid/cmdline text ++ Command line of the process
/proc/pid/environ text ++ Environment variables of the process
/proc/pid/cwd symlink + Current working directory of the process
/proc/pid/root symlink + Root directory of the process
/proc/pid/exe symlink + Executable file for the process
/proc/pid/fd/N symlink + Open file descriptors of the process
/proc/net/tcp text ++ List of TCP connections, for tcpstat(1)
/proc/net/udp text ++ List of UDP connections, for udpstat(1)

VTreeFS


DevFS

The file system side of devfs will use only a small subset of the vtreefs library, and should be very easy to implement. The /dev directory need not change compared to the way it is now: it may be completely flat and essentially offer only block-special and character-special files. The main chunk of this work is writing the infrastructure for letting device drivers add device nodes by making function calls.

General design

The general design of DevFS is as follows. A device driver requests the creation of a device node by making a call to RS. If this is allowed according to RS's policy, RS will publish (store) the information for the device node in DS, using a special key prefix. The devfs process subscribes to that special key prefix, and thereby gets a message when a new device node is published in DS. It will then retrieve the new entry (or entries), and update its virtual file system accordingly. Summarizing the process flow of creating a device node:

  Driver -> RS -> DS -> DevFS

Device nodes have a number of properties:

Remember that the goal of devfs is to let device drivers create device nodes dynamically. In terms of properties, that means a device driver must be able to specify the name and the minor device number at the very least. Theoretically, device drivers should not have anything to do with the other POSIX semantics. Practically speaking, however, the driver will have to specify the file mode and the owning UID and GID. This part can and hopefully will be improved at some later time, with a much more elaborate policy specification in /etc/drivers.conf. Unfortunately, that change is estimated to be too much work for this project.

If the device driver indeed specifies the device name, minor number, UID, GID and file mode for each device, this leaves the device type and major number. Those properties are always the same for all device nodes belonging to a single device driver. It therefore makes sense not to specify them on a per-node basis. Additionally, we want to introduce as little redundancy as possible in the whole MINIX3 system. Therefore, we store these two properties on a per-driver (= per-label) basis.

Device nodes in RS

The 'service' (/bin/service) part of RS currently takes a “-dev <device>” parameter, where <device> is a file in /dev, to tell RS that it should use the major device taken from that device node. With the new DevFS infrastructure, it will be the driver itself that will create device nodes for it, so we cannot rely on the presence of a file in /dev for this anymore. Hence, 'service' needs to be changed in this respect, so that it specifies a major number directly, instead of using a file in /dev.

Additionally, the one single place that determines whether a device is a block or character special device, is currently the static /dev itself: based on the file type (block or character special), VFS determines how to talk to the device driver. With DevFS, this information will be put into /dev rather than taken from /dev, so now it becomes necessary to specify this somewhere. For now, the most logical place for that is the same place where the 'major' of the device is specified: the 'service' utility.

While retaining the '-dev' parameter for backwards compatibility, 'service' should take two new parameters: -devtype and -devnr. The first one would take 'block' or 'char' and the second would take a number. For example, given the following /dev/c0d0 file in the static /dev we have now:

  brw------- 1 root  operator  3,   0 Dec 19  2007 /dev/c0d0

Instead of starting the first at_wini instance with this command (taken from /usr/src/drivers/memory/ramdisk/rc):

  /bin/service -c up /bin/at_wini -dev /dev/c0d0 -config /etc/drivers.conf -label at_wini_0

..the starting command would become:

  /bin/service -c up /bin/at_wini -devtype block -devnr 3 -config /etc/drivers.conf -label at_wini_0

This eliminates the need for /dev/c0d0 to be present before at_wini is actually started.

The major number is already communicated from 'service' to RS (in 'RS_DEV_MAJOR' and the 'rss_major' field of 'struct rs_start'), but a new field needs to be added in both cases to let 'service' pass the “devtype” value to RS as well. For example, call them 'RS_DEV_TYPE' and 'rss_devtype'. In RS itself, a field needs to be added to 'struct rproc' to save this value: eg 'r_dev_type'. The reason for all this will be explained in the next section.

Device node tuples in DS

DS can store key-value pairs. The key is always a string; the value may be a string or a number. We will use only strings, for both key and value. We use this notation for a mapping from the key “key” to the value “value”:

  "key" => "value"

Upon successfully spawning a device driver, RS will create a string entry in DS based on this device driver's label, device type and device major number:

  "dev " <label> => <type> " " <major>

The <type> field is either a 'b' or a 'c' character (with obvious meanings), the <major> field is a decimal number in string form, and the two are separated by a space. In the future, more space-separated fields may be added at the end. For example, the first at_wini instance would have this entry:

  "dev at_wini_0" => "b 3"

Such fields specify the global properties of all device nodes created (indirectly) by that device driver.

The device nodes themselves are stored as follows:

  "node " <label> " " <name> => <minor> " " <mode> " " <uid> " " <gid>

The <label> field indicates the driver creating the node; the <name> field is the device name string; the <minor> field is the driver-assigned minor number of the device node. The <mode> field is the file mode to be used for the device node, but just the access permissions bits (i.e. NOT the block/char special type bits), in three-digit octal notation: basically, ranging from “000” (---------) to “777” (rwxrwxrwx). The <uid> and <gid> parts are decimal numbers, since currently no party is capable of doing user/group name to ID conversion safely - this too should be fixed using policy specifications later. As example, consider one of at_wini's device nodes:

  "node at_wini_0 c0d0p0s0" => "128 600 0 0"

This indicates that at_wini_0 has device node with name “c0d0p0s0” (to end up as /dev/c0d0p0s0), with minor number 128, access mask 600 (rw-------), UID 0 (root) and GID 0 (operator). The “dev” entry with a matching label name (see above) provides the remaining information about the device node (namely, the device type and the major device number).

RS call for device drivers

The following call should be specified in include/minix/rs.h, and implemented in lib/syslib/rs.c. Use include/minix/ds.h and lib/syslib/ds.c as reference examples of how to do this.

  int rs_register_node(char *name, dev_t minor, _mnx_Mode_t mode, _mnx_Uid_t uid, _mnx_Gid_t gid);

The field types of the actual message sent to RS should be defined in include/minix/com.h; there already is a section on RS there. For example, define and use these names and message fields for the message request type and fields:

message type RS_REG_NODE (RS_RQ_BASE + 8) or so
name RS_NAME_ADDR m2_p1
RS_NAME_LEN m2_i1
minor RS_DEV_MINOR m2_l1
mode RS_DEV_MODE m2_s1
uid RS_DEV_UID m2_i2
gid RS_DEV_GID m2_i3

Upon getting such an RS_REG_NODE request, RS should verify that the caller is indeed a system process (and if not, return an error), obtain the label of the calling process, and based on the label and given information, create a key and value string, use ds_publish_str to publish that string in DS, and return whatever return value that call produced, back to the device driver.

Implementation in RS

To reiterate: RS must create a proper “dev <label>” entry in DS upon spawning a device driver, and create a “node <label> <name>” entry with each RS_REG_NODE request. DS currently does not offer an interface to delete published entries, so RS need not be concerned about removing entries when a device driver is taken down. A new comment in the RS code that indicates that that is left as future work, would be great, though..

In every case, things must be set up in such a way that RS will always have published the “dev <label>” entry of a driver before making “node <label> <name>” entries on behalf of that driver. This will probably be what happens in the most straightforward implementation anyway, though.

DevFS

The basic idea for DevFS is that it uses ds_subscribe() on “node .*”, and repeatedly calls ds_check_str() to get entries that have been changed. Based on that it can create device nodes. Device nodes will never be deleted as DS entries can not be deleted either.

Note that the “dev <label>” entries are not subscribed to: DevFS will pull those in on demand. Caching these within DevFS is not required in the initial DevFS version (but see below). In other words, for each “node <label> <name>” entry that DevFS retrieves using ds_check_str(), it can simply make a ds_retrieve_str() call on the corresponding “dev <label>” entry, retrieving the rest of the information that it needs to create the device node. If the ds_retrieve_str() call on the “dev <label>” entry fails for some reason, then DevFS may print a warning and ignore the original “node <label> <name>” entry.

It may be that a device driver, as a result of being restarted (e.g. due to a crash), re-registers its device nodes. DevFS can detect this by seeing that that name is already registered, with the same major device number as the original (the minor may be different the next time though!). DevFS must silently update the entry to the new values in this case.

It may also be that two device drivers register the exact same name (eg “node foo mynode” and “node bar mynode”). DevFS can detect this by seeing that the name is already registered, with a different major device number from the original. DevFS must print a warning if it happens, and may (but need not) replace the original entry with the new one.

Getting notification messages from DS (as a result of the ds_subscribe() call) can be handled by the message hook in VTreeFS.

If time permits, caching of the “dev <label>” entries should be added to DevFS, so that it stores the <label,type,major> tuples locally (possibly in an array of 256 elements, one per major; accessed by label via a hashtable) and need not call ds_retrieve_str() every time a node gets added.

Changing drivers

The next step is to change all drivers to dynamically create their own device nodes. They do this with the new rs_register_node() call. Note that for example at_wini needs to generate a different name prefix depending in which instance it is (“c0” for the first instance, “c1” for the second).

Incomplete

Changing the boot process

TBD

Planning

This is a very rough sketch.