User Tools

Site Tools


Dev and Proc File Systems: Design Document

David C. van Moolenbroek
Dept. Computer Science, Vrije Universiteit Amsterdam, The Netherlands
dcvmoole AT cs DOT vu DOT nl


This document has gone stale. There is up-to-date documentation of VTreeFS.


The MINIX3 multiserver operating system is evolving rapidly. Much effort is being put into expanding the operating system's dependable, secure and flexible architecture to provide a modern UNIX-like operating environment on top. As part of that effort, this project aims to introduce functionality that is common to other operating systems: virtual /dev and /proc file systems.

  • The goal of the virtual /dev file system is to provide a pool of device nodes that is populated dynamically by device drivers, rather than statically in advance. This allows drivers to add device nodes based on hardware available in the system.
  • The goal of the virtual /proc file system is to provide a standard interface for providing information about the system. This obviates the need for utilities like ps(1) and top(1) to access internal system data structures directly.


The current project plan is to cover the following points, preferably in the given order:

  • A shared proc/dev file system library, managing virtual inodes and handling most of the VFS calls.
  • A /dev file system, based on the file system library.
  • Device node registration by existing device drivers.
  • A /proc file system, based on the file system library.
  • Modification of ps(1), top(1) and various other small utilities and library routines to make use of /proc.
  • Support for mounting file systems using a “none” device, i.e. no particular block device.
  • Changes to the MINIX3 boot process to mount /dev and /proc upon boot.
  • As much additional system information in /proc (from VM, Inet, RS etcetera) as time permits.

After each point there will be a (possibly short) testing phase to verify that the results are as required.


Both devfs and procfs will present a set of virtual files. From the user's point of view, both are read-only. As a result, both file systems will handle most requests from VFS in the exact same way. It therefore makes sense to put the common functionality into a library. This library should provide the functionality needed to create a virtual tree based read-only file system with ease. Specifically, the library is to provide the following:

  1. the main loop of the server;
  2. handlers for the requests from VFS;
  3. an interface to manipulate the virtual file system tree;
  4. callback hooks for refreshing of directories, and reading from regular files and symlinks.

While written to cover the demands of devfs/procfs, the library will be generic enough that new file systems that operate along the same lines can use it as well.

The concept

An application that uses the library, can add and remove directories and other files. All files (including directories) are represented using the primary object of the library: the inode. The library will essentially manage a fully connected tree of inodes.

The library hard links are not supported, so every inode except the root inode is also an entry into its parent directory. The entry is identified in that directory by name. To satisfy the requirements of ProcFS, an inode may also have an index associated with the entry into the directory. This optional index determines the inode's position when getting returned by a getdents() call.


VFS request handling

The library will provide a meaningful implementation for at least the following VFS requests:


The remaining calls may be implemented by returning ENOSYS.



Even though the library has to be written in C, its interface is close to object oriented: the inode is an opaque structure. The library's public header file should expose a “struct inode” declaration (so that the server using the library can make pointers) but not expose its internal fields. Creation, deletion, querying and manipulation of inodes and their properties takes place entirely through API calls.

struct inode;
typedef int index_t;
typedef void *cbdata_t;

#define NIL_INODE ((struct inode *)0)
#define NO_INDEX (-1)

struct inode_stat {
  mode_t mode;
  uid_t uid;
  gid_t gid;
  size_t size;
  dev_t dev;

struct fs_hooks {
  int (*lookup_hook)(struct inode *inode, char *name, cbdata_t cbdata);
  int (*getdents_hook)(struct inode *inode, cbdata_t cbdata);
  int (*read_hook)(struct inode *inode, off_t offset, char **ptr, size_t *len, cbdata_t cbdata);
  int (*rdlink_hook)(struct inode *inode, char *ptr, size_t *len, cbdata_t cbdata);
  int (*message_hook)(message *m);

struct inode *add_inode(struct inode *parent, char *name, index_t index, struct inode_stat *stat,
                        index_t nr_indexed_entries, cbdata_t cbdata);
void delete_inode(struct inode *inode);

struct inode *get_inode_by_name(struct inode *parent, char *name);
struct inode *get_inode_by_index(struct inode *parent, index_t index);

char const *get_inode_name(struct inode *inode);
index_t get_inode_index(struct inode *inode);
cbdata_t get_inode_cbdata(struct inode *inode);

struct inode *get_root_inode(void);
struct inode *get_parent_inode(struct inode *inode);
struct inode *get_first_inode(struct inode *parent);
struct inode *get_next_inode(struct inode *previous);

void get_inode_stat(struct inode *inode, struct inode_stat *stat);
void set_inode_stat(struct inode *inode, struct inode_stat *stat);

void start_vtreefs(struct fs_hooks *hooks, struct inode_stat *stat, index_t nr_indexed_entries);

The suggested specification of the API is as follows:

add_inode adds an inode into a parent inode (which must be a directory), with the given name. The index parameter indicates the index position for the inode in the parent directory, unless it's equal to NO_INDEX (or negative in general). The stat parameter points to a filled structure of inode metadata. This structure's mode field determines the file type and the access permissions (see /usr/include/sys/stat.h); at least directories (S_IFDIR), regular files (S_IFREG), character-special files (S_IFCHR), block-special files (S_IFBLK), and symbolic links (S_IFLNK) have to be supported. The uid, gid, size and dev fields specify the owning user and group ID, the size of the inode, and the device number (for block/char special files), respectively. The nr_indexed_entries parameter is only used for new directories (S_IFDIR), and indicates the range (0 to nr_indexed_entries-1) reserved for inodes with index numbers; this value may be 0 (and/or negative?) for directories that do not care about index numbers. The cbdata parameter specifies a caller-defined value passed to hook calls affecting this inode.

delete_inode removes the given inode. If the inode is a directory, all of its children will be removed recursively as well.

get_inode_by_name and get_inode_by_index return an inode given a directory and either a name or an index number. They may fail, and in that case return NIL_INODE and NO_INDEX, respectively.

get_inode_name and get_inode_index return the name and index (or NO_INDEX) of a given inode, as assigned with the add_inode() call. The name pointer may simply point into the inode. get_inode_cbdata returns the cbdata value for an inode.

get_root_inode, get_parent_inode, get_first_inode and get_next_inode allow walking through the virtual tree of inodes, respectively retrieving the virtual tree's root inode, the parent inode of a given inode, the first child inode of a given parent, and the next inode in a series of children (given the previous result of get_first_inode() or get_next_inode()). The last three may return NIL_INODE if the directory does not have a parent (only in the case of the root directory), has no children, or has no more children, respectively.

get_inode_stat and set_inode_stat retrieve and manipulate inode metadata.

start_vtreefs starts the main loop of the vtree file system library, accepting requests from VFS and possibly other sources (passing those on to the application), and making the appropriate callbacks to the application based on the hooks given by the application. This API call will return when the file server is instructed to shut down by VFS and/or PM. The hooks parameter specifies a structure of function pointers just like MINIX3's libdriver does. Upon being started, the vtreefs library has to create a root inode; the stat and nr_indexed_entries parameters of start_vtreefs() determine the initial parameters of this root inode.

Callback hooks

lookup_hook is called every time a lookup for an entry other than “.” and “..” is made on an inode that is a directory and is search-accessible by the caller of the REQ_LOOKUP call. The hook call is made right before the library does the actual name lookup. The provided inode is the directory inode, and cbdata is the callback data of that inode. name is the path component being looked up. This hook should allow the application to do at least the following things safely:

  • populate the given directory inode with inodes, leaving the precise lookup of the given name to the library;
  • delete the directory inode.

In the latter case, the hook implementation should return an error (typically ENOENT) to indicate that the lookup function should not continue. If OK is returned from the lookup function, the library should continue the lookup.

getdents_hook is called everytime a REQ_GETDENTS call is made on a directory inode. The hook call is made right before the library does the actual directory entry enumeration. The inode parameter is the inode of this directory, and cbdata is the callback data of this inode. The same semantics apply as for lookup_hook above.

read_hook is called when a user process reads from a regular (S_IFREG) file inode. inode and cbdata are the inode and callback data of this regular file, respectively. offset is the zero-based offset into the file from which reading should start, and len points to the requested read length. The hook implementation may return an error indicating why the file cannot be read. If the hook returns OK, then the library assumes that:

  • ptr is filled with a char * pointer to a static array containing the data to return, and,
  • len is filled with the length of the data (which may be less, but not more, than the original value of len).

However, if EOF is reached for the file, then the hook must return OK, and a length of 0 in len. The ptr value is then unused.

The return-pointer-to-array construction is there to avoid the overhead of memory copying the data from the application to the library on every read.

rdlink_hook is called when a user process reads from a symbolic link (S_IFLNK) inode. inode and cbdata are the inode and callback data of this symlink. ptr is a pointer to a memory area within the library, of a size pointed to by len. The hook implementation can write up to len bytes of data into ptr, and is expected to fill len with the number of bytes written (which may be less, but no more, than the original value of len). The data written must not contain any '\0' bytes. Library implementation suggestion: the memory area should be at least PATH_MAX bytes, but the given value of len may be less if the REQ_RDLINK provided a smaller length.

message_hook is called whenever the library's main loop receives a message that is not a request from VFS and not a SIGTERM signal notification from PM. The message parameter points to the message received. It is up to the hook implementer what to do with the message; the library will not send a reply by itself.

All hook pointers given in the fs_hooks structure may be NULL. If a hook pointer is NULL, the library must not call it, and instead use sensible defaults: for the lookup and getdents hooks, simply nothing changes in the request handling; for the read hook, EOF may be returned to REQ_READ calls; for the rdlink hook, the library may return an empty result; for the message hook, the message should simply be ignored.

Implementation requirements and hints

In random order.

  • The naming used in the specification is just suggested. In fact, the entire specification is just based on likeliness to be suitable for the rest of the whole project, and may be changed later if we discover that a different API provides a more convenient model for devfs/procfs.
  • The library must not perform any dynamic memory allocation. The inodes may just be stored in a statically sized array. This means that the number of inodes is a compile-time configuration option.
  • The inode number of an inode (used in the VFS-FS protocol) may be a combination of its index into that array, and a per-inode generation number that gets increased every time an inode is reused for different purposes (just like MINIX3's endpoints).
  • The performance of the library is not the main concern (and must never come at the expense of code readability), but we would like to avoid quadratic (or worse) complexity. It seems sensible to have two hashtables: one for (parent,name) → inode lookups, and one for (parent,index) → inode lookups. Part of the testing phase may be gathering information about, and tuning, the used hashtable sizes and hashing algorithms.
  • The parent can be a single pointer field in an inode; the set of children for an inode may be a linked list (doubly linked, for cheap deletion). In total there will be a lot of linked list manipulation; it may be a good idea to use operations from <sys/queue.h> wherever appropriate.
  • In general, the library may assume that the caller knows what he's doing. However, it is highly recommended that debugging functionality be added (that can be turned on or off with a compile-time configuration option) to make sure that this is indeed the case. A simple example here is that in debugging mode, add_inode() may check whether the given parent already has a child inode with the given name, and delete_inode() may check whether the given inode is the root inode - or even a valid inode pointer at all. The existing assert() and panic() calls may be used in case of errors.
  • The space to store names must be of size NAME_MAX+1, that is, NAME_MAX actual characters and a terminating '\0' character. In debugging mode, providing a longer name to add_inode() is an error, etcetera.
  • For the REQ_GETDENTS implementation, the position field may be used as follows: position 0 and 1 are for “.” and “..”, positions 2 to nr_indexed_entries+1 are for indexed child inodes (typically not all of these are present, so many positions will be skipped here), and positions nr_indexed_entries+2 and onwards are for child inodes without an index number.
  • It is very important that the library is safe with regards to modifications made to the vtree by the application in the callback hook. In particular, if a callback returns an error, the library must not make any assumptions about the original inode still existing.
  • VFS-referenced inodes may be deleted through the API though. It is up to the library to determine whether it will keep around an inode that has been deleted but is still referenced by VFS. The easiest solution would just be to always delete the inode and just throw errors when VFS makes a request for an invalid inode number; a nicer solution would be to keep around the inode until it is not referenced anymore, but this might potentially be forever (taking up an inode forever). Obviously this has an impact on applications using /proc.


To test the library, you'll have to write a simple meaningless file system implementation that uses the library and all its features, and perhaps a number of test programs/scripts that trigger all the VFS→FS calls on this file system. If done properly, that should be sufficient to determine whether the library works as intended.

Mounting a file system without specifying a block device is not possible yet (if time permits we'll look at that later in this project), so mounting a file system now requires a block device that you won't use anyway, e.g.:

  mount -t testfs /dev/fd0 /mnt

(this requires that you've installed the test file system to /sbin/testfs)

Note that because the block device thing is something that will go away, you don't have to implement REQ_BREAD_S/REQ_BWRITE_S in the library.



One major goal of the /proc part of the work is to be able to deny access to the getsysinfo(2) call for userland programs. That means that all the information offered through getsysinfo that is currently used by userland programs, has to be offered through /proc. Only procfs (and some other system servers, like IS) should use getsysinfo. This is much cleaner than the current approach, where userland programs obtain and parse raw copies of various system servers' process table.

Files in /proc

At this moment, we anticipate that the /proc file system will provide at least the following files:

File Type Priority Description
/proc/hz text +++ System clock frequency, in ticks per second
/proc/loadavg text +++ Load average, used by getloadavg(3) et al
/proc/uptime text +++ Uptime information, used by uptime(1) et al
/proc/version text + System version information
/proc/pid/psinfo text +++ Process information for ps(1), top(1)
/proc/pid/status text +++ Human-readable process key, value pairs
/proc/pid/map text ++ Memory map of the process
/proc/pid/cmdline text ++ Command line of the process
/proc/pid/environ text ++ Environment variables of the process
/proc/pid/cwd symlink + Current working directory of the process
/proc/pid/root symlink + Root directory of the process
/proc/pid/exe symlink + Executable file for the process
/proc/pid/fd/N symlink + Open file descriptors of the process
/proc/net/tcp text ++ List of TCP connections, for tcpstat(1)
/proc/net/udp text ++ List of UDP connections, for udpstat(1)


  • Generally, the number of forks and exits will outweigh the number of accesses to /proc by far. Unlike DevFS, ProcFS will therefore not be actively updated about changes to the system status that it's interested in: this would simply cause too much overhead in the common case.
  • Updating of the directory structure for PIDs (/proc/pid) should therefore be “lazy”. On every access, ProcFS' VTreeFS hooks can see if the information that it is about to send back to VFS, is still up-to-date.
  • This information has to be be obtained from PM, VFS and the kernel (using getsysinfo(2) calls to PM and VFS). This returns a table of process entries (an array of slots, NR_TASKS+NR_PROCS in total).
  • The set of PIDs can easily change between subsequent getdents calls. To make sure that every PID is returned exactly once in a directory listing, ProcFS can use processes' slot numbers for the index value of the inodes. The maximum number of indexed entries of the root directory is therefore NR_TASKS+NR_PROCS.
  • To make sure that a PID is not reused without ProcFS knowing about it, the cbdata value of the /proc/pid directory inodes can be the endpoint for that PID.
  • ProcFS can also create only the containing directory for each PID upon access of the root directory, and fill a specific PID's directory with inodes ('psinfo', 'status' etc) as that directory is accessed.


The file system side of devfs will use only a small subset of the vtreefs library, and should be very easy to implement. The /dev directory need not change compared to the way it is now: it may be completely flat and essentially offer only block-special and character-special files. The main chunk of this work is writing the infrastructure for letting device drivers add device nodes by making function calls.

General design

The general design of DevFS is as follows. A device driver requests the creation of a device node by making a call to RS. If this is allowed according to RS's policy, RS will publish (store) the information for the device node in DS, using a special key prefix. The devfs process subscribes to that special key prefix, and thereby gets a message when a new device node is published in DS. It will then retrieve the new entry (or entries), and update its virtual file system accordingly. Summarizing the process flow of creating a device node:

  Driver -> RS -> DS -> DevFS

Device nodes have a number of properties:

  • The name (e.g. “c0d3p1s0”, “null”)
  • The type (character or block)
  • The major device number
  • The minor device number
  • The file access mode (e.g. rw-r–r–)
  • The owning UID
  • The owning GID

Remember that the goal of devfs is to let device drivers create device nodes dynamically. In terms of properties, that means a device driver must be able to specify the name and the minor device number at the very least. Theoretically, device drivers should not have anything to do with the other POSIX semantics. Practically speaking, however, the driver will have to specify the file mode and the owning UID and GID. This part can and hopefully will be improved at some later time, with a much more elaborate policy specification in /etc/drivers.conf. Unfortunately, that change is estimated to be too much work for this project.

If the device driver indeed specifies the device name, minor number, UID, GID and file mode for each device, this leaves the device type and major number. Those properties are always the same for all device nodes belonging to a single device driver. It therefore makes sense not to specify them on a per-node basis. Additionally, we want to introduce as little redundancy as possible in the whole MINIX3 system. Therefore, we store these two properties on a per-driver (= per-label) basis.

Device nodes in RS

The 'service' (/bin/service) part of RS currently takes a “-dev <device>” parameter, where <device> is a file in /dev, to tell RS that it should use the major device taken from that device node. With the new DevFS infrastructure, it will be the driver itself that will create device nodes for it, so we cannot rely on the presence of a file in /dev for this anymore. Hence, 'service' needs to be changed in this respect, so that it specifies a major number directly, instead of using a file in /dev.

Additionally, the one single place that determines whether a device is a block or character special device, is currently the static /dev itself: based on the file type (block or character special), VFS determines how to talk to the device driver. With DevFS, this information will be put into /dev rather than taken from /dev, so now it becomes necessary to specify this somewhere. For now, the most logical place for that is the same place where the 'major' of the device is specified: the 'service' utility.

While retaining the '-dev' parameter for backwards compatibility, 'service' should take two new parameters: -devtype and -devnr. The first one would take 'block' or 'char' and the second would take a number. For example, given the following /dev/c0d0 file in the static /dev we have now:

  brw------- 1 root  operator  3,   0 Dec 19  2007 /dev/c0d0

Instead of starting the first at_wini instance with this command (taken from /usr/src/drivers/memory/ramdisk/rc):

  /bin/service -c up /bin/at_wini -dev /dev/c0d0 -config /etc/drivers.conf -label at_wini_0

..the starting command would become:

  /bin/service -c up /bin/at_wini -devtype block -devnr 3 -config /etc/drivers.conf -label at_wini_0

This eliminates the need for /dev/c0d0 to be present before at_wini is actually started.

The major number is already communicated from 'service' to RS (in 'RS_DEV_MAJOR' and the 'rss_major' field of 'struct rs_start'), but a new field needs to be added in both cases to let 'service' pass the “devtype” value to RS as well. For example, call them 'RS_DEV_TYPE' and 'rss_devtype'. In RS itself, a field needs to be added to 'struct rproc' to save this value: eg 'r_dev_type'. The reason for all this will be explained in the next section.

Device node tuples in DS

DS can store key-value pairs. The key is always a string; the value may be a string or a number. We will use only strings, for both key and value. We use this notation for a mapping from the key “key” to the value “value”:

  "key" => "value"

Upon successfully spawning a device driver, RS will create a string entry in DS based on this device driver's label, device type and device major number:

  "dev " <label> => <type> " " <major>

The <type> field is either a 'b' or a 'c' character (with obvious meanings), the <major> field is a decimal number in string form, and the two are separated by a space. In the future, more space-separated fields may be added at the end. For example, the first at_wini instance would have this entry:

  "dev at_wini_0" => "b 3"

Such fields specify the global properties of all device nodes created (indirectly) by that device driver.

The device nodes themselves are stored as follows:

  "node " <label> " " <name> => <minor> " " <mode> " " <uid> " " <gid>

The <label> field indicates the driver creating the node; the <name> field is the device name string; the <minor> field is the driver-assigned minor number of the device node. The <mode> field is the file mode to be used for the device node, but just the access permissions bits (i.e. NOT the block/char special type bits), in three-digit octal notation: basically, ranging from “000” (---------) to “777” (rwxrwxrwx). The <uid> and <gid> parts are decimal numbers, since currently no party is capable of doing user/group name to ID conversion safely - this too should be fixed using policy specifications later. As example, consider one of at_wini's device nodes:

  "node at_wini_0 c0d0p0s0" => "128 600 0 0"

This indicates that at_wini_0 has device node with name “c0d0p0s0” (to end up as /dev/c0d0p0s0), with minor number 128, access mask 600 (rw-------), UID 0 (root) and GID 0 (operator). The “dev” entry with a matching label name (see above) provides the remaining information about the device node (namely, the device type and the major device number).

RS call for device drivers

The following call should be specified in include/minix/rs.h, and implemented in lib/syslib/rs.c. Use include/minix/ds.h and lib/syslib/ds.c as reference examples of how to do this.

  int rs_register_node(char *name, dev_t minor, _mnx_Mode_t mode, _mnx_Uid_t uid, _mnx_Gid_t gid);

The field types of the actual message sent to RS should be defined in include/minix/com.h; there already is a section on RS there. For example, define and use these names and message fields for the message request type and fields:

message type RS_REG_NODE (RS_RQ_BASE + 8) or so
name RS_NAME_ADDR m2_p1
minor RS_DEV_MINOR m2_l1
mode RS_DEV_MODE m2_s1
uid RS_DEV_UID m2_i2
gid RS_DEV_GID m2_i3

Upon getting such an RS_REG_NODE request, RS should verify that the caller is indeed a system process (and if not, return an error), obtain the label of the calling process, and based on the label and given information, create a key and value string, use ds_publish_str to publish that string in DS, and return whatever return value that call produced, back to the device driver.

Implementation in RS

To reiterate: RS must create a proper “dev <label>” entry in DS upon spawning a device driver, and create a “node <label> <name>” entry with each RS_REG_NODE request. DS currently does not offer an interface to delete published entries, so RS need not be concerned about removing entries when a device driver is taken down. A new comment in the RS code that indicates that that is left as future work, would be great, though..

In every case, things must be set up in such a way that RS will always have published the “dev <label>” entry of a driver before making “node <label> <name>” entries on behalf of that driver. This will probably be what happens in the most straightforward implementation anyway, though.


The basic idea for DevFS is that it uses ds_subscribe() on “node .*”, and repeatedly calls ds_check_str() to get entries that have been changed. Based on that it can create device nodes. Device nodes will never be deleted as DS entries can not be deleted either.

Note that the “dev <label>” entries are not subscribed to: DevFS will pull those in on demand. Caching these within DevFS is not required in the initial DevFS version (but see below). In other words, for each “node <label> <name>” entry that DevFS retrieves using ds_check_str(), it can simply make a ds_retrieve_str() call on the corresponding “dev <label>” entry, retrieving the rest of the information that it needs to create the device node. If the ds_retrieve_str() call on the “dev <label>” entry fails for some reason, then DevFS may print a warning and ignore the original “node <label> <name>” entry.

It may be that a device driver, as a result of being restarted (e.g. due to a crash), re-registers its device nodes. DevFS can detect this by seeing that that name is already registered, with the same major device number as the original (the minor may be different the next time though!). DevFS must silently update the entry to the new values in this case.

It may also be that two device drivers register the exact same name (eg “node foo mynode” and “node bar mynode”). DevFS can detect this by seeing that the name is already registered, with a different major device number from the original. DevFS must print a warning if it happens, and may (but need not) replace the original entry with the new one.

Getting notification messages from DS (as a result of the ds_subscribe() call) can be handled by the message hook in VTreeFS.

If time permits, caching of the “dev <label>” entries should be added to DevFS, so that it stores the <label,type,major> tuples locally (possibly in an array of 256 elements, one per major; accessed by label via a hashtable) and need not call ds_retrieve_str() every time a node gets added.

Changing drivers

The next step is to change all drivers to dynamically create their own device nodes. They do this with the new rs_register_node() call. Note that for example at_wini needs to generate a different name prefix depending in which instance it is (“c0” for the first instance, “c1” for the second).


Changing the boot process



This is a very rough sketch.

  • vtreefs: one month.
  • procfs part 1: two weeks for the basic structure, the flat files, and most of the /proc/pid/ files.
  • midterm
  • procfs part 2: two weeks for rewriting ps(1) and top(1) and adding support for /proc/pid/cmdline.
  • devfs: two to three weeks.
soc/2009/procanddevfs/designdocument.txt · Last modified: 2014/11/11 19:11 (external edit)