User Tools

Site Tools


releases:3.2.1:developersguide:blockprotocol

The Block Device protocol

This page provides the official documentation of the block device driver protocol of MINIX 3. It describes the protocol used between VFS and file system servers on one side, and block drivers on the other. The current version documents the protocol used in git commit 8a0b9de and later. If you update this document because of changes to MINIX 3, please mention the commit ID of the change in the wiki comment.

General information

The following information is written for people implementing a block device driver, and people implementing a system server that talks to a block driver.

The libblockdriver and libbdev libraries

It is highly recommended that block drivers be written on top of the libblockdriver library. This library takes care of a wide range of tasks common to many or all block devices, and disk drivers in particular. The use of this library allows future protocol changes to be made in one place only. This page provides no documentation on libblockdriver itself, but tries to mention whenever libblockdriver implements a certain feature.

It is highly recommended that users of block drivers (file systems in particular) use the libbdev library. This library provides a common interface for communicating with block devices.

Instances

Each block driver instance is responsible for one or more device controllers. Writers of new block drivers are highly encouraged to have each driver instance be responsible for only one controller, which itself may have several devices attached. Finer-granular work division (e.g., one driver per attached device) typically does not yield better isolation properties, and is therefore not considered beneficial.

A machine may have multiple controllers of the same type; in this case, multiple copies of the same block driver may be started. Each copy will be given a unique instance number (0 for the first instance of that particular driver, 1 for the second, and so on).

Initialization

Upon initialization, the driver must retrieve the instance number. It is passed as one of the arguments to the driver, in the form instance=N where N is a decimal representation of the instance number. Obtaining the instance number is typically done by means of env_setargs() and env_parse(), as found in <minix/sysutil.h>.

The driver should try to initialize the controller immediately when starting up, and abort (for example, by calling panic()) if initialization of the controller fails.

Upon (successful) startup, the block driver must announce its presence in DS. This should be done by calling the blockdriver_announce() function found in libblockdriver.

Requests and replies

The block driver protocol follows a strict blocking request-reply model: a party requests service from a block driver by sending it a request message, and the block driver will respond that request with a reply message. Multiple requests may be sent at once. It is up to the driver to decide the order in which the requests are replied to. Internally, the driver may implement any form of parallelism, queuing, etcetera, as it sees fit. For the purpose of keeping track of its own requests, the caller supplies an (opaque) ID value in each request; the driver copies this ID into each corresponding reply.

Since requests may be sent synchronously or asynchronously, the block driver must use driver_receive() to receive a request along with the IPC primitive used to send the request. After handling the request, it should either use send() or sendnb() to reply to calls made using sendrec(), or use asynsend3() with the AMF_NOREPLY flag to reply to calls made using asynsend() (so that no asynchronous reply can satisfy a synchronous sendrec()). This is taken care of by libblockdriver.

The m_type message field contains the request or reply type. The only allowed reply type is BDEV_REPLY. All message names and field aliases start with BDEV_, and are defined in <minix/com.h>. For forward compatibility reasons, the caller must zero out any unused message fields in the request message, and the callee must zero out unused fields in the reply message. This is typically done through memset() to fill the entire message with zeroes before setting any fields on it.

Minor device numbers and subdevices

Every request includes a minor device number, which tells the driver which device to use. A driver may choose its own mapping from minor device numbers to devices, and possibly to subranges within devices (e.g. partitions) or whatever the driver sees fit. The rest of this document uses the term subdevice to describe each of those. One or several subdevices may refer to (a subset of) a single physical device, and each subdevice has its own minor device number.

As an illustration: the floppy driver uses the high bit of the minor device number to indicate that the device should be formatted instead of written to, and uses the rest of the bits to encode the floppy drive and partition number.

Open and close semantics

Before a requesting party is allowed to issue I/O requests to a subdevice, this subdevice must first be opened using a BDEV_OPEN request. The requesting party must issue a BDEV_CLOSE request once it is done with the subdevice. The current system infrastructure does not allow the driver to detect whether a requesting party has exited, so it is up to the requesting party to make sure that each BDEV_OPEN request is matched with a BDEV_CLOSE request before it exits.

The driver is expected to keep track of open counts on a per-device (not per-subdevice) basis. The driver should assume that if a device has a nonzero open count, the other side expects the exact same physical device to remain there. Practically, this means that a hardware hot-swap or medium change should cause all subsequent requests to that device to return an error (typically ENXIO for open requests and EIO for transfer requests) until the last party closes the device. Similarly, the driver is expected to make available such a new device or medium only and always when the device is opened initially (i.e., whenever the per-device open count is increased from 0 to 1), and depending on the device type, (re)read its partition tables (see the section on partitions).

System interaction

The driver should use the System Event Framework (SEF). This framework automatically takes care of interaction with RS.

A SIGTERM signal instructs the driver to shut down. However, after receiving this signal, the driver must not actually shut down until all its devices have been closed. Signals may be received by registering a signal handler callback function with SEF.

Driver restarts: the driver side

Drivers that crash or hang are restarted by the Reincarnation Server (RS); this applies to block drivers as well. Block drivers typically need not provide any explicit support for restarts, because they are essentially stateless. However, after such a restart, the driver will have lost information about previous open counts of devices.

A block driver must not specify a SEF callback routine for initialization after a restart. That is, it must not call sef_setcb_init_restart(). As a result, after the driver crashes, the default SEF callback routine will cause the driver to be restarted with a new endpoint.

In addition, if the driver receives a non-open request (for example, a transfer request) that has not been preceded by an open request, it must reply with an ERESTART error. This informs the caller that more (re)initialization is necessary first. This is only a workaround for what is essentially a race condition between VFS and file systems, regarding direct I/O to block-special files. Returning ERESTART is taken care of by libblockdriver, although poorly: no information about endpoints is kept or used.

Driver restarts: the caller side

Any service that uses a block driver will have to implement procedures that 1) detect and 2) recover from driver restarts.

Detection is relatively easy if synchronous communication is used. If the driver restarts during an ongoing sendrec() call, the call will be aborted with an EDEADSRCDST error. In addition, after a driver restart, any sendrec() calls to the old endpoint will result in the same error. With asynchronous communication, requests and/or replies may get lost as part of the crash, and the caller may need an additional method to find out that the driver has restarted. File systems can rely on REQ_NEW_DRIVER requests for this. Other services may have to subscribe to notifications on Data Store (DS) entries starting with the string “drv.blk.”. Each block driver updates its own entry as part of the call to blockdriver_announce().

Recovery consists of first reopening all minor devices on the new driver instance, and then reissuing any previously ongoing requests. Failure during the recovery procedure may be dealt with as the caller sees fit. In general, it is encouraged that a service supports an unbounded number of block device restarts over time, but only a limited number of block device restarts during recovery.

The libbdev library implements both detection and recovery, but is written for use by file systems. Non-filesystem services may use it, but will have to implement their own DS subscription logic on top.

Transfers

The nature of block devices prescribes that all data transfer requests be idempotent. There are four types of transfer requests: BDEV_READ, BDEV_WRITE, BDEV_GATHER, and BDEV_SCATTER. BDEV_READ reads a contiguous area of data from a subdevice, into a single buffer. BDEV_WRITE writes a contiguous area of data to the subdevice, from a single buffer. BDEV_GATHER and BDEV_SCATTER also read and write a contiguous area of data from and to the subdevice (respectively), but use a vector, where each vector element provides a buffer and the size of that buffer.

All requests make use of memory grants to provide access to the buffers. In addition, the two vector requests provide the vector itself in a read-only grant. Libblockdriver takes care of converting the single-buffer requests into single-element vector requests, and of copying in the vector. All grants are owned by the caller (the request message's m_source).

A transfer either succeeds or results in an error. Upon success, the transfer reply contains the number of bytes transferred as status. The size of the status field imposes a hard limit on transfer sizes of 2 GB (since negative values are error codes), but that is well beyond the expected size of any transfer anyway.

In general, the driver should transfer exactly the number of bytes requested. There are two cases where the transfer may be limited to a lower number of bytes: either the end of the medium or partition is reached (see the section on partitions), or the requested number of vector elements or bytes exceeds what the driver can deal with. In those cases, the reply will contain a lower number of bytes accordingly.

It is fully up to the driver to decide on restrictions for the parameters of transfers, including for example:

  • byte position alignment (e.g., alignment to the medium sector size)
  • alignment of total request size (e.g., a multiple of the medium sector size)
  • maximum total request size (e.g., 4 MB)
  • alignment of buffer size (e.g., alignment to 2 bytes)
  • alignment of physical buffer address (e.g., 2-byte alignment)
  • buffer memory specifics (e.g., each buffer must be physically contiguous in memory)

A driver may have to impose such restrictions because, for example, it performs DMA directly from or into the buffers provided by the caller, in which case the driver has to conform to whatever is required from the device's DMA engine. Driver writers are encouraged to have a (possibly slower) fallback mode in case that the requirements are not met, but this is not required. For optimal compatibility, callers are expected to provide buffers of physically contiguous memory only, aligned to at least a 2-byte boundary, and with a buffer size being a 2-byte multiple. For vector requests, this applies to each individual provided buffer. In general, the driver should lower the request size if a size limit is reached, and return an error for any other violation of its restrictions.

IOCTLs

The BDEV_IOCTL request type is used to pass I/O control (ioctl) requests to a (sub)device. A memory grant is used to pass any extra data and/or provide a buffer to store results in.

Libblockdriver supports a set of I/O control requests for block-level tracing. These are used by the btrace(8) command line tool; they are not documented here.

Other I/O control requests are custom to the block driver type. More information about ioctls for disk block drivers can be found below. Non-disk block drivers may support their own sets of ioctls.

Protocol messages

This section documents the messages used in the block driver protocol. The error lists are not exhaustive; drivers may return additional errors as they see fit.

BDEV_OPEN

Open a subdevice.

Request

< 16% >TypeBDEV_OPEN
Fields<12%>BDEV_MINOR<6%>m10_i1<16%>dev_t minor device number
BDEV_ACCESSm10_i2int access mode
BDEV_IDm10_l1long opaque request ID

Reply

< 16% >TypeBDEV_REPLY
Fields<12%>BDEV_STATUS<6%>m10_i1<16%>int OK or negative error code
BDEV_IDm10_l1long opaque request ID, echoed from the request

Error codes

< 16%>ENXIO no such device or device not ready
EACCES requested mode contains W_BIT and device is read-only
EIO I/O error or unexpected device behavior

Description

This request opens a subdevice. This must always be the first request issued by a caller, after which transfer and ioctl operations may be performed, until the caller closes the subdevice with a BDEV_CLOSE request.

The access mode may consist of a bitwise combination of R_BIT (open for reading), W_BIT (open for writing). Read-only devices should be refused to be opened for writing, but this is purely informative to the caller. The driver is not expected to enforce access bits otherwise.

BDEV_CLOSE

Close a subdevice.

Request

< 16% >TypeBDEV_CLOSE
Fields<12%>BDEV_MINOR<6%>m10_i1<16%>dev_t minor device number
BDEV_IDm10_l1long opaque request ID

Reply

< 16% >TypeBDEV_REPLY
Fields<12%>BDEV_STATUS<6%>m10_i1<16%>int OK or negative error code
BDEV_IDm10_l1long opaque request ID, echoed from the request

Error codes

< 16%>ENXIO no such device
ERESTART subdevice has not been previously opened

Description

This request closes a previously opened subdevice.

BDEV_READ, BDEV_WRITE

Perform data transfer on a subdevice, using a single buffer.

Request

< 16% >TypeBDEV_READ / BDEV_WRITE
Fields<12%>BDEV_MINOR<6%>m10_i1<16%>dev_t minor device number
BDEV_COUNTm10_i2size_t number of bytes to transfer
BDEV_GRANTm10_i3cp_grant_id_t grant (WRITE or READ) for buffer
BDEV_FLAGSm10_i4unsigned int transfer flags
BDEV_IDm10_l1long opaque request ID
BDEV_POS_LOm10_l2u32_t starting byte position (lower 32 bits)
BDEV_POS_HIm10_l3u32_t starting byte position (upper 32 bits)

Reply

< 16% >TypeBDEV_REPLY
Fields<12%>BDEV_STATUS<6%>m10_i1<16%>ssize_t number of bytes read/written, or negative error code
BDEV_IDm10_l1long opaque request ID, echoed from the request

Error codes

< 16%>ENXIO no such device
EIO device not ready or I/O error
EINVAL request requirements not met
ERESTART subdevice has not been previously opened

Description

These requests perform a sequential read from (BDEV_READ) or write to (BDEV_WRITE) the given subdevice, using a grant and size of the buffer that is used as destination or source of the data, respectively. Upon success, the number of bytes read or written is returned, which may be less than the requested number of bytes (or even zero) if the end of the subdevice was encountered during the transfer.

Drivers may use sys_safecopyfrom, sys_safecopyto, or sys_vsafecopy to perform the data copying, but they may also choose to map the buffer to its physical address for DMA using either sys_umap with the VM_GRANT pseudo-segment, or sys_vumap. Please note that it is currently impossible to validate the buffer grant's access type when using sys_umap, so use of sys_vumap is preferred.

The BDEV_FLAGS field may be set to BDEV_NOFLAGS (0), or consist of a bitwise combination of the following flags:

AliasValueMeaning
BDEV_FORCEWRITE 0x1 for write requests: do not return until the write has made it to the physical device

The driver may ignore flags that it does not recognize.

BDEV_GATHER, BDEV_SCATTER

Perform data transfer on a subdevice, using a vector of buffers.

Request

< 16% >TypeBDEV_GATHER / BDEV_SCATTER
Fields<12%>BDEV_MINOR<6%>m10_i1<16%>dev_t minor device number
BDEV_COUNTm10_i2int number of elements in the vector
BDEV_GRANTm10_i3cp_grant_id_t grant (READ) for iovec_s_t vector
BDEV_FLAGSm10_i4unsigned int transfer flags
BDEV_IDm10_l1long opaque request ID
BDEV_POS_LOm10_l2u32_t starting byte position (lower 32 bits)
BDEV_POS_HIm10_l3u32_t starting byte position (upper 32 bits)

Reply

< 16% >TypeBDEV_REPLY
Fields<12%>BDEV_STATUS<6%>m10_i1<16%>ssize_t number of bytes read/written, or negative error code
BDEV_IDm10_l1long opaque request ID, echoed from the request

Error codes

< 16%>ENXIO no such device
EIO device not ready or I/O error
EINVAL request requirements not met
ERESTART subdevice has not been previously opened

Description

These requests perform a sequential read from (BDEV_GATHER) or write to (BDEV_SCATTER) the given subdevice, using a vector of buffer grants and sizes that together make up the full request. The vector is provided by the caller using a (read-only) grant as well, and must contain between 1 and NR_IOREQS elements, inclusive. The driver will return the actual number of bytes transferred in the BDEV_STATUS field of the reply.

The vector is an array of structures of type iovec_s_t, which is defined in <minix/type.h>. Libblockdriver takes care of copying in the vector from the caller.

See above for notes on performing data copies, and for possible values of the BDEV_FLAGS field.

BDEV_IOCTL

Perform an IOCTL on the driver or a device.

Request

< 16% >TypeBDEV_IOCTL
Fields<12%>BDEV_MINOR<6%>m10_i1<16%>dev_t minor device number
BDEV_REQUESTm10_i2unsigned int I/O control request
BDEV_GRANTm10_i3cp_grant_id_t grant (READ and/or WRITE) for buffer
BDEV_IDm10_l1long opaque request ID

Reply

< 16% >TypeBDEV_REPLY
Fields<12%>BDEV_STATUS<6%>m10_i1<16%>int nonnegative result value (typically OK), or negative error code
BDEV_IDm10_l1long opaque request ID, echoed from the request

Error codes

< 16%>ENXIO no such device
EIO device not ready or I/O error
ENOTTY request not supported on this device
ERESTART subdevice has not been previously opened

Description

This request tells the driver to perform an ioctl, which may affect the entire driver or the device identified by the given subdevice. A minor device number for an opened subdevice must be provided even for requests that are driver-wide. The interpretation of ioctls is custom to the block driver type.

A grant may or may not be provided for ioctls that do not have associated data. If it is not provided, BDEV_GRANT should be set to GRANT_INVALID.

Disk drivers

The following extra information is provided for writers of disk block drivers in particular.

Partitions and subpartitions

In the case of disk drivers, subdevices can generally refer to devices, partitions on those devices, or subpartitions on those partitions. Libblockdriver provides code to read and parse (sub)partition tables, but it expects one of two common minor device numbering schemes to be used by the disk driver: the scheme used by the floppy driver, and the scheme used by hard disk drivers. The latter scheme supports up to eight devices per driver, four partitions per device, and four subpartitions per partition.

If the driver supports partitions, the driver is expected to read in and parse partition tables during the initial device open, even if no device or medium change took place. Changes to the on-device partition tables will not become visible until the device is fully closed first, or unless the DIOCSETP ioctl is used to modify partition tables in memory. When rereading partitions from the device, any previous in-memory modifications made with DIOCSETP are to be forgotten. It is recommended that partition tables be read in only during the initial device open, so as not to confuse parties that already have (sub)partitions open.

With all current drivers, it is possible to open minor device numbers that map to valid but nonexistent partitions and subpartitions. This allows a caller to use DIOCSETP to make the driver's in-memory partition information match newly written on-device partition, even when the device is also open by another party. This feature is not very useful and somewhat dangerous, and may be changed in the future. In any case, DIOCGETP should return a size of zero for nonexistent (sub)partitions, and any transfer requests on such (sub)partitions should either return EOF or an error.

Care must be taken with DIOCSETP requests of which the sum of base and size exceed the medium size. Libblockdriver currently does and can not do so; this is a design flaw.

Disk transfers

For generic disk drivers, it is recommended for compatibility reasons that total request sizes of 512-byte multiples be supported, even if the medium sector size is larger.

Disk IOCTLs

The ioctl definitions and structures related to disk drivers can be found in <sys/ioc_disk.h>.

The driver should implement the following ioctl requests: DIOCGETP, DIOCSETP, DIOCFLUSH, DIOCOPENCT. The driver may implement the following ioctl requests: DIOCGETWC, DIOCSETWC, DIOCEJECT, DIOCTIMEOUT.

The DIOCGETP ioctl may be used to obtain the base, size, and geometry of a (sub)device. Geometry data may be faked if the device does not have real geometry. The DIOCSETP ioctl sets the base and size of a subdevice; its effects are temporary and in-memory only. Both these calls make use of a struct partition structure that is defined in <minix/partition.h>.

The DIOCFLUSH ioctl tells a device to flush its write cache, returning only once this operation has completed. The DIOCSETWC ioctl disables or enables the device's write cache, depending on whether the integer value passed in is zero or nonzero, respectively. On some devices, disabling the write cache also invokes a cache flush. The DIOCGETWC ioctl retrieves the current state of the device's write cache, copying back a value of 1 if it is enabled and a value of 0 if it is disabled.

The DIOCOPENCT ioctl allows applications to request the open count of a particular device. An integer that contains the open count of the device is copied to the caller.

The DIOCEJECT ioctl tells the driver to eject the medium from a device, if possible. There is no matching call to request the device to load a medium.

The DIOCTIMEOUT ioctl sets the driver's command timeout, in clock ticks. The previous command timeout is copied back to the caller. A timeout of 0 signifies the driver default timeout. It is up to the driver to decide whether this is a device-specific operation or not.

releases/3.2.1/developersguide/blockprotocol.txt · Last modified: 2014/11/13 15:15 by lionelsambuc