POSIX write()

Take a look at:

http://www.opengroup.org/onlinepubs/000095399/functions/write.html

What a mess – different requirements for regular files, pipes, and “other devices supporting non-blocking operation”.  For pipes, there are reasons for this (atomic writes), but I think they should have been abstracted out (why can’t other devices have atomic writes? Why isn’t there a general mechanism to determine maximum size of an atomic write for a given file descriptor?).

I also notice, and I think this is particularly stupid, that if write() is interrupted by a signal before transferring any data it returns -1 rather than 0. If it transfers some data before being interrupted, it returns the amount of transferred data. Why make a special case out of 0?!! This forces increased complexity in the application, which can not assume that the return from write is equal to the number of bytes actually written, and for which -1 is in almost any other case an abortive error.

Unfortunately there is no discussion of the topic I was most interested in: atomicity/ordering of reads and writes to regular files. Consider:

  1. Process A issues a “read()” call to read a particular part of a file. The block device is currently busy so the request is queued.
  2. Process B issues a “write()” request which writes to the same part of the file as process A requested.

The question is now: can the data written by process B be returned to process A, or must the data that was in the file at the time of the read() call being issued? Also, is it allowed that process B might see part of the data that process A wrote, and part of what was in the file at the time of the read request?

The standard does say this:

After a write() to a regular file has successfully returned: Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

… However that doesn’t say when read() begins and ends, and still seems to allow that a write request issued after a read request might still affect the result from the read (if we assume that a “successful read()” refers to the point in time where read() returns with no error, which seems safe).

I’m inclined to think that the write() should cause the read() to immediately return with the data that was written. After all, unless there is explicit synchronization between the two processes, there is no way of controlling the order of events (the write() could just as well have been executed first, and in that case it seems clear that read() should return the freshly written data). In the presence of synchronization, I think it would be acceptable that processes must wait for the read to finish before allowing the other process to begin its write operation. Still, it would be nice if this formally specified.

I did eventually find:

http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html

(See “Synchronized I/O Data Integrity Completion”, “Synchronized I/O File Integrity Completion”). This applies when the file flags are various combinations of O_SYNC/O_DSYNC/O_RSYNC as specified in the open system call. The definitions go partway there:

If there were any pending write requests affecting the data to be read at the time that the synchronized read operation was requested, these write requests are successfully transferred prior to reading the data.

… However that doesn’t specify that write requests issued after the read request must not be performed before reading the data, and it’s possible that the omission is deliberate. On the other hand, the definitions talk about “successful transfer” without actually defining what “transfer” means (for a read operation, it is a “transfer of data to the application”, but for a write?). Surely O_SYNC writes must actually write the underlying media before they are successful – isn’t that the point of the mechanism? – but the standard doesn’t seem to make that explicit. (There is a hint under the definition of “Synchronized Input and Output”, but it’s very general).

Ah, but wait; I was wrong. The definition of “Successfully Transferred” states the data must be readable even after system failure or power failure. So that means it must have made it to the physical media. The problem with the standard here is that is not clear (in the definition of “synchronized I/O data integrity completion”) that “successfully transferred” has a defined meaning other than its plain old English meaning. I did check for a definition of “transferred” and there isn’t one; I only stumbled on “successfully transferred” by mistake.

Anyway, my first conclusion is unchanged: A write request should be able to change the result of an earlier read request.

The second question remains unanswered, so I’ll assume that an application must be prepared to accept that, if a write request has occurred after a read request, the read request may receive a result which includes a partial write. This would seem to be true even if O_RSYNC and friend were set. Only write requests issued earlier are required by POSIX to have been fully completed.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s