Is writing to a unix file through shell script is synchronized? - shell

i have a requirement where many threads will call same shell script to perform a work, and then will write output(data as single text line) to a common text file.
as here many threads will try to write data to same file, my question is whether unix provides a default locking mechanism so that all can not write at the same time.

Performing a short single write to a file opened for append is mostly atomic; you can get away with it most of the time (depending on your filesystem), but if you want to be guaranteed that your writes won't interrupt each other, or to write arbitrarily long strings, or to be able to perform multiple writes, or to perform a block of writes and be assured that their contents will be next to each other in the resulting file, then you'll want to lock.
While not part of POSIX (unlike the C library call for which it's named), the flock tool provides the ability to perform advisory locking ("advisory" -- as opposed to "mandatory" -- meaning that other potential writers need to voluntarily participate):
(
flock -x 99 || exit # lock the file descriptor
echo "content" >&99 # write content to that locked FD
) 99>>/path/to/shared-file
The use of file descriptor #99 is completely arbitrary -- any unused FD number can be chosen. Similarly, one can safely put the lock on a different file than the one to which content is written while the lock is held.
The advantage of this approach over several conventional mechanisms (such as using exclusive creation of a file or directory) is automatic unlock: If the subshell holding the file descriptor on which the lock is held exits for any reason, including a power failure or unexpected reboot, the lock will be automatically released.

my question is whether unix provides a default locking mechanism so
that all can not write at the same time.
In general, no. At least not something that's guaranteed to work. But there are other ways to solve your problem, such as lockfile, if you have it available:
Examples
Suppose you want to make sure that access to the file "important" is
serialised, i.e., no more than one program or shell script should be
allowed to access it. For simplicity's sake, let's suppose that it is
a shell script. In this case you could solve it like this:
...
lockfile important.lock
...
access_"important"_to_your_hearts_content
...
rm -f important.lock
...
Now if all the scripts that access "important" follow this guideline,
you will be assured that at most one script will be executing between
the 'lockfile' and the 'rm' commands.
But, there's actually a better way, if you can use C or C++: Use the low-level open call to open the file in append mode, and call write() to write your data. With no locking necessary. Per the write() man page:
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Like this:
// process-wide global file descriptor
int outputFD = open( fileName, O_WRONLY | O_APPEND, 0600 );
.
.
.
// write a string to the file
ssize_t writeToFile( const char *data )
{
return( write( outputFD, data, strlen( data ) );
}
In practice, you can write anything to the file - it doesn't have to be a NUL-terminated character string.
That's supposed to be atomic on writes up to PIPE_BUF bytes, which is usually something like 512, 4096, or 5120. Some Linux filesystems apparently don't implement that properly, so you may in practice be limited to about 1K on those file systems.

Related

Why does the memory mapped file ever need to be flushed when access is RDWR?

I was reading through one of golang's implementation of memory mapped files, https://github.com/edsrzf/mmap-go/. First he describes the several access modes:
// RDONLY maps the memory read-only.
// Attempts to write to the MMap object will result in undefined behavior.
RDONLY = 0
// RDWR maps the memory as read-write. Writes to the MMap object will update the
// underlying file.
RDWR = 1 << iota
// COPY maps the memory as copy-on-write. Writes to the MMap object will affect
// memory, but the underlying file will remain unchanged.
COPY
But in gommap test file I see this:
func TestReadWrite(t *testing.T) {
mmap, err := Map(f, RDWR, 0)
... omitted for brevity...
mmap[9] = 'X'
mmap.Flush()
So why does he need to call Flush to make sure the contents are written to the file if the access mode is RDWR?
Or is the OS managing this so it only writes when it thinks it should?
If the last option, could you please explain it in a little more detail - what i read is that when the OS is low in memory it writes to the file and frees up memory. Is this correct and does it apply only to RDWR or only to COPY?
Thanks
The program maps a region of memory using mmap. It then modifies the mapped region. The system isn't required to write those modifications back to the underlying file immediately, so a read call on that file (in ioutil.ReadAll) could return the prior contents of the file.
The system will write the changes to the file at some point after you make the changes.
It is allowed to write the changes to the file any time after the changes are made, but by default makes no guarantees about when it writes those changes. All you know is that (unless the system crashes), the changes will be written at some point in the future.
If you need to guarantee that the changes have been written to the file at some point in time, then you must call msync.
The mmap.Flush function calls msync with the MS_SYNC flag. When that system call returns, the system has written the modifications to the underlying file, so that any subsequent call to read will read the modified file.
The COPY option sets the mapping to MAP_PRIVATE, so your changes will never be written back to the file, even if you using msync (via the Flush function).
Read the POSIX documentation about mmap and msync for full details.

dd command file_operations involved?

When we are executing dd command, which write function gets called.
As per my understanding, dd command is not filesystem specific, so no file system's file_operations is involved. Please correct If I am wrong here.
I would like to know which file_operations is involved in carrying out dd operation?
That depends on what you write to.
Either it is a regular file and file system specific calls are used or it is a device and you eventually use to the target disk (or whatever) underlying driver.
http://www.makelinux.net/books/ulk3/understandlk-CHP-14-SECT-5#understandlk-CHP-14-SECT-5
The write system call does indeed end up invoking the file system specific write via the VFS layer. See the vfs_write function.

Wrapping a binary data file to self-convert to CSV?

I'm writing custom firmware for a SparkFun Logomatic V2 that records binary data to a file on a 2GB micro-SD card. The data file size will range from 100 MB to 1 GB.
The format of the binary data is in flux as the board's firmware evolves (it will actually be dynamically reconfigurable at run-time). Rather than create and maintain a separate decoder/converter program for each version of firmware/configuration, I'd much rather make the data files self-converting to CSV format by starting the data file with a Bash script that is written to the data file before data recording starts.
I know how to create a Here Document, but I suspect Bash would be unable to quickly parse and convert a gigabyte of binary data, so I'd like to make the process run much faster by having the script first compile some C code (assume GCC is present and in the path), then run the resulting program, passing the binary data to stdin.
To make the problem more concrete, assume the firmware will create binary data consisting of 4 16-bit integer values: A timestamp (unsigned) followed by 3 accelerometer axes (signed). There is no separator between records (mainly because I'm saturating the SPI interface to the uSD card).
So, I think I need a script with TWO here documents: One for the C code (parameterized by expanded Bash variables), and another for the binary data. Here's where I am so far:
#! env bash
# Produced by firmware version 0.0.0.0.0.1 alpha
# Configuration for this data run:
header_string = "Time, X, Y, Z"
column_count = 4
# Create the converter executable
# Use "<<-" to permit code to be indented for readability.
# Allow variable expansion/substitution.
gcc -xc /tmp/convertit - <<-THE_C_CODE
#include <stdio.h>
int main (int argc, char **argv) {
// Write ${header_string} to stdout
while (1) {
// Read $(column_count} shorts from stdin
// Break if EOF
// Write $(column_count} comma-delimited values to stdout
}
// Close stdout
return 0;
}
THE_C_CODE
# Pass the binary data to the converter
# Hard-quote the Here tag to prevent subsequent expansion/substitution
/tmp/convertit >./$1.csv <<'THE_BINARY_DATA'
...
... hundreds of megabytes of semi-random data ...
...
THE_BINARY_DATA
rm /tmp/convertit
exit 0
Does that look about right? I don't yet have a real data file to test this with, but I wanted to verify the idea before going much further.
Will Bash complain if the closing lines are missing? This may happen if data capture terminates unexpectedly due to a shock knocking loose the battery or uSD card. Or if the firmware borks.
Is there a faster or better method I should consider? For example, I wonder if Bash will be too slow to copy the binary data as fast as the C program can consume it: Should the C program open the data file directly?
TIA,
-BobC
You may want to have a look at makeself. It allows you to change any .tar.gz archive into a self-extracting file which is platform independent (something like a shell script that contains a here document). This will allow you to easily distribute your data and decoder. It also allows you to configure a script contained within the archive to be run when the container script is run. This way you can use makeself for packaging and inside the archive you can put your data files and decoder written in C or bash or whatever language you find suitable.
While it is possible to decode binary data using shell tools (e.g. using od), it's very cumbersome and ineffective. I'd recommend using either a C program or perl which is also likely to be found on almost any machine (check this page).

file_operations Question, how do i know if a process that opened a file for writing has decided to close it?

I'm currently writing a simple "multicaster" module.
Only one process can open a proc filesystem file for writing, and the rest can open it for reading.
To do so i use the inode_operation .permission callback, I check the operation and when i detect someone open a file for writing I set a flag ON.
i need a way to detect if a process that opened a file for writing has decided to close the file so i can set the flag OFF, so someone else can open for writing.
Currently in case someone is open for writing i save the current->pid of that process and when the .close callback is called I check if that process is the one I saved earlier.
Is there a better way to do that? Without saving the pid, perhaps checking the files that the current process has opened and it's permission...
Thanks!
No, it's not safe. Consider a few scenarios:
Process A opens the file for writing, and then fork()s, creating process B. Now both A and B have the file open for writing. When Process A closes it, you set the flag to 0 but process B still has it open for writing.
Process A has multiple threads. Thread X opens the file for writing, but Thread Y closes it. Now the flag is stuck at 1. (Remember that ->pid in kernel space is actually the userspace thread ID).
Rather than doing things at the inode level, you should be doing things in the .open and .release methods of your file_operations struct.
Your inode's private data should contain a struct file *current_writer;, initialised to NULL. In the file_operations.open method, if it's being opened for write then check the current_writer; if it's NULL, set it to the struct file * being opened, otherwise fail the open with EPERM. In the file_operations.release method, check if the struct file * being released is equal to the inode's current_writer - if so, set current_writer back to NULL.
PS: Bandan is also correct that you need locking, but the using the inode's existing i_mutex should suffice to protect the current_writer.
I hope I understood your question correctly: When someone wants to write to your proc file, you set a variable called flag to 1 and also save the current->pid in a global variable. Then, when any close() entry point is called, you check current->pid of the close() instance and compare that with your saved value. If that matches, you turn flag to off. Right ?
Consider this situation : Process A wants to write to your proc resource, and so you check the permission callback. You see that flag is 0, so you can set it to 1 for process A. But at that moment, the scheduler finds out process A has used up its time share and chooses a different process to run(flag is still o!). After sometime, process B comes up wanting to write to your proc resource also, checks that the flag is 0, sets it to 1, and then goes about writing to the file. Unfortunately at this moment, process A gets scheduled to run again and since, it thinks that flag is 0 (remember, before the scheduler pre-empted it, flag was 0) and so sets it to 1 and goes about writing to the file. End result : data in your proc resource goes corrupt.
You should use a good locking mechanism provided by the kernel for this type of operation and based on your requirement, I think RCU is the best : Have a look at RCU locking mechanism

fork() and printf()

As I understood fork() creates a child process by copying the image of the parent process.
My question is about how do child and parent processes share the stdout stream?
Can printf() function of one process be interrupted by other or not?
Which may cause the mixed output.
Or is the printf() function output atomic?
For example:
The first case:
parent: printf("Hello");
child: printf("World\n");
Console has: HeWollorld
The second case:
parent: printf("Hello");
child: printf("World\n");
Console has: HelolWorld
printf() is not guaranteed to be atomic. If you need atomicity, use write() with a string, preformatted using s*printf() etc., if needed. Even then, you should make the size of the data written using write() is not too big:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
stdout is usually line-buffered. stderr is usually unbuffered.
The behavior of printf() may vary (depending on the exact details of your OS, C compiler, etc.). However, in general printf() is not atomic. Thus interleaving (as per your 1st case) can occur

Resources