Wrapping a binary data file to self-convert to CSV? - bash

I'm writing custom firmware for a SparkFun Logomatic V2 that records binary data to a file on a 2GB micro-SD card. The data file size will range from 100 MB to 1 GB.
The format of the binary data is in flux as the board's firmware evolves (it will actually be dynamically reconfigurable at run-time). Rather than create and maintain a separate decoder/converter program for each version of firmware/configuration, I'd much rather make the data files self-converting to CSV format by starting the data file with a Bash script that is written to the data file before data recording starts.
I know how to create a Here Document, but I suspect Bash would be unable to quickly parse and convert a gigabyte of binary data, so I'd like to make the process run much faster by having the script first compile some C code (assume GCC is present and in the path), then run the resulting program, passing the binary data to stdin.
To make the problem more concrete, assume the firmware will create binary data consisting of 4 16-bit integer values: A timestamp (unsigned) followed by 3 accelerometer axes (signed). There is no separator between records (mainly because I'm saturating the SPI interface to the uSD card).
So, I think I need a script with TWO here documents: One for the C code (parameterized by expanded Bash variables), and another for the binary data. Here's where I am so far:
#! env bash
# Produced by firmware version 0.0.0.0.0.1 alpha
# Configuration for this data run:
header_string = "Time, X, Y, Z"
column_count = 4
# Create the converter executable
# Use "<<-" to permit code to be indented for readability.
# Allow variable expansion/substitution.
gcc -xc /tmp/convertit - <<-THE_C_CODE
#include <stdio.h>
int main (int argc, char **argv) {
// Write ${header_string} to stdout
while (1) {
// Read $(column_count} shorts from stdin
// Break if EOF
// Write $(column_count} comma-delimited values to stdout
}
// Close stdout
return 0;
}
THE_C_CODE
# Pass the binary data to the converter
# Hard-quote the Here tag to prevent subsequent expansion/substitution
/tmp/convertit >./$1.csv <<'THE_BINARY_DATA'
...
... hundreds of megabytes of semi-random data ...
...
THE_BINARY_DATA
rm /tmp/convertit
exit 0
Does that look about right? I don't yet have a real data file to test this with, but I wanted to verify the idea before going much further.
Will Bash complain if the closing lines are missing? This may happen if data capture terminates unexpectedly due to a shock knocking loose the battery or uSD card. Or if the firmware borks.
Is there a faster or better method I should consider? For example, I wonder if Bash will be too slow to copy the binary data as fast as the C program can consume it: Should the C program open the data file directly?
TIA,
-BobC

You may want to have a look at makeself. It allows you to change any .tar.gz archive into a self-extracting file which is platform independent (something like a shell script that contains a here document). This will allow you to easily distribute your data and decoder. It also allows you to configure a script contained within the archive to be run when the container script is run. This way you can use makeself for packaging and inside the archive you can put your data files and decoder written in C or bash or whatever language you find suitable.
While it is possible to decode binary data using shell tools (e.g. using od), it's very cumbersome and ineffective. I'd recommend using either a C program or perl which is also likely to be found on almost any machine (check this page).

Related

How to handle for loop with large objects in Rstudio?

I have a for loop with large objects. According to my trial-and-error, I can only load the large object once. If I load the object again, I would be returned the error "Error: cannot allocate vector of size *** Mb". I tried to overcome this issue by removing the object at the end of the for loop. However, I am still returned the error "Error: cannot allocate vector of size 699.2 Mb" at the beginning of the second run of the for loop.
My for loop has the following structure:
for (i in 1:22) {
VeryLargeObject <- ...i...
...
.
.
.
...
rm(VeryLargeOjbect)
}
The VeryLargeObjects ranges from 2-3GB. My PC has RAM of 16Gb, 8 cores, 64-bit Win10.
Any solution on how I can manage to complete the for loop?
The error "cannot allocate..." likely comes from the fact that rm() does not immediately free memory. So the first object still occupies RAM when you load the second one. Objects that are not assigned to any name (variable) anymore get garbage collected by R at time points that R decides for itself.
Most remedies come from not loading the entire object into RAM:
If you are working with a matrix, create a filebacked.big.matrix() with the bigmemory package. Write your data into this object using var[...,...] syntax like a normal matrix. Then, in a new R session (and a new R script to preserve reproducibility), you can load this matrix from disk and modify it.
The mmap package uses a similar approach, using your operating system's ability to map RAM pages to disk. So they appear to a program like they are in ram, but are read from disk. To improve speed, the operating system takes care of keeping the relevant parts in RAM.
If you work with data frames, you can use packages like fst and feather that enable you to load only parts of your data frame into a variable.
Transfer your data frame into a data base like sqlite and then access the data base with R. The package dbplyr enables you to treat a data base as a tidyverse-style data set. Here is the RStudio help page. You can also use raw SQL commands with the package DBI
Another approach is to not write interactively, but to write an R script that processes only one of your objects:
Write an R script, named, say processBigObject.R that gets the file name of your big object from the command line using commandArgs():
#!/usr/bin/env Rscript
#
# Process a big object
#
# Usage: Rscript processBigObject.R <FILENAME>
input_filename <- commandArgs(trailing = TRUE)[1]
output_filename <- commandArgs(trailing = TRUE)[2]
# I'm making up function names here, do what you must for your object
o <- readBigObject(input_filename)
s <- calculateSmallerSummaryOf(o)
writeOutput(s, output_filename)
Then, write a shell script or use system2() to call the script multiple times, with different file names. Because R is terminated after each object, the memory is freed:
system2("Rscript", c("processBigObject.R", "bigObject1.dat", "bigObject1_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject2.dat", "bigObject2_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject3.dat", "bigObject3_result.dat"))
...

Windows (ReFS,NTFS) file preallocation hint

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

Is writing to a unix file through shell script is synchronized?

i have a requirement where many threads will call same shell script to perform a work, and then will write output(data as single text line) to a common text file.
as here many threads will try to write data to same file, my question is whether unix provides a default locking mechanism so that all can not write at the same time.
Performing a short single write to a file opened for append is mostly atomic; you can get away with it most of the time (depending on your filesystem), but if you want to be guaranteed that your writes won't interrupt each other, or to write arbitrarily long strings, or to be able to perform multiple writes, or to perform a block of writes and be assured that their contents will be next to each other in the resulting file, then you'll want to lock.
While not part of POSIX (unlike the C library call for which it's named), the flock tool provides the ability to perform advisory locking ("advisory" -- as opposed to "mandatory" -- meaning that other potential writers need to voluntarily participate):
(
flock -x 99 || exit # lock the file descriptor
echo "content" >&99 # write content to that locked FD
) 99>>/path/to/shared-file
The use of file descriptor #99 is completely arbitrary -- any unused FD number can be chosen. Similarly, one can safely put the lock on a different file than the one to which content is written while the lock is held.
The advantage of this approach over several conventional mechanisms (such as using exclusive creation of a file or directory) is automatic unlock: If the subshell holding the file descriptor on which the lock is held exits for any reason, including a power failure or unexpected reboot, the lock will be automatically released.
my question is whether unix provides a default locking mechanism so
that all can not write at the same time.
In general, no. At least not something that's guaranteed to work. But there are other ways to solve your problem, such as lockfile, if you have it available:
Examples
Suppose you want to make sure that access to the file "important" is
serialised, i.e., no more than one program or shell script should be
allowed to access it. For simplicity's sake, let's suppose that it is
a shell script. In this case you could solve it like this:
...
lockfile important.lock
...
access_"important"_to_your_hearts_content
...
rm -f important.lock
...
Now if all the scripts that access "important" follow this guideline,
you will be assured that at most one script will be executing between
the 'lockfile' and the 'rm' commands.
But, there's actually a better way, if you can use C or C++: Use the low-level open call to open the file in append mode, and call write() to write your data. With no locking necessary. Per the write() man page:
If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Like this:
// process-wide global file descriptor
int outputFD = open( fileName, O_WRONLY | O_APPEND, 0600 );
.
.
.
// write a string to the file
ssize_t writeToFile( const char *data )
{
return( write( outputFD, data, strlen( data ) );
}
In practice, you can write anything to the file - it doesn't have to be a NUL-terminated character string.
That's supposed to be atomic on writes up to PIPE_BUF bytes, which is usually something like 512, 4096, or 5120. Some Linux filesystems apparently don't implement that properly, so you may in practice be limited to about 1K on those file systems.

Writing small amount of data to large number of files on GlusterFS 3.7

I'm experimenting with 2 Gluster 3.7 servers in 1x2 configuration. Servers are connected over 1 Gbit network. I'm using Debian Jessie.
My use case is as follows: open file -> append 64 bytes -> close file and do this in a loop for about 5000 different files. Execution time for such loop is roughly 10 seconds if I access files through mounted glusterfs drive. If I use libgfsapi directly, execution time is about 5 seconds (2 times faster).
However, the same loop executes in 50ms on plain ext4 disk.
There is huge performance difference between Gluster 3.7 end earlier versions which is, I believe, due to the cluster.eager-lock setting.
My target is to execute the loop in less than 1 second.
I've tried to experiment with lots of Gluster settings but without success. dd tests with various bsize values behave like that TCP no-delay option is not set, although from Gluster source code it seems that no-delay is default.
Any idea how to improve the performance?
Edit:
I've found a solution that works in my case so I'd like to share it in case anyone else faces the same issue.
The root cause of the problem is the number of roundtrips between client and Gluster server during execution of open/write/close sequence. I don't know exactly what is happening behind but timing measurements shows exactly that pattern. Now, the obvious idea would be to "pack" open/write/close sequence into a single write function. Roughly, the C prototype of such function would be:
int write(const char* fname, const void *buf, size_t nbyte, off_t offset)
But, there is already such API function glfs_h_anonymous_write in libgfapi (thanks goes to Suomya from Gluster mailing group). Kind of hidden thing there is the file identifier which is not plain file name, but something of type struct glfs_object. Clients obtain an instance of such object through API calls glfs_h_lookupat/glfs_h_creat. The point here is that glfs_object representing filename is "stateless" in a sense that corresponding inode is left intact (not ref counted). One should think of glfs_object as plain filename identifier and use it as you would use filename (actually, glfs_object stores plain pointer to corresponding inode without ref counting it).
Finally, we should use glfs_h_lookupat/glfs_h_creat once and write many times to the file using glfs_h_anonymous_write.
That way I was able to append 64 bytes to 5000 files in 0.5 seconds, which is 20 times faster than using mounted volume and open//write/close sequence.

How do I convert an Intel HEX file to raw data like memory view?

I want to make boot loader code for AVR, which can update firmware over the air.
Now I am able to write to the application area using some fixed data. I have a hex file of the new firmware to be updated. How do I convert that hex file to raw data so that I can update the application using that raw data?
If you're using WinAVR for compilation you may do this using included avr-objcopy:
C:\WinAVR-20100110\bin> avr-objcopy.exe -I ihex -O binary input_file.hex output.bin
If you're developing on Linux, there's a package, avr-binutils, with the avr-objcopy program.
You may use some tool (http://hex2bin.sourceforge.net/ or another hex2bin converter) or write your own hex parser that may have some caveats when coming to files > 64 KB.
As you pointed out, the hex file is encoded in Intel Hex format. You have to extract the flash data from the data records. Each record (line) holds up to 16 bytes (common, but may vary) of data.
Note that that there are different record types and some may introduce an address offset, depending on how the flash data is distributed. The Wiki description should be enough to get the concept.

Resources