Ruby i/o performance - reading file char by char - ruby

Short version:
how to read from STDIN (or a file) char by char while maintaining high performance using Ruby? (though the problem is probably not Ruby specific)
Long version:
While learning Ruby I'm designing a little utility that has to read from a piped text data, find and collect numbers in it and do some processing.
cat huge_text_file.txt | program.rb
input > 123123sdas234sdsd5a ...
output > 123123, 234, 5, ...
The text input might be huge (gigabytes) and it might not contain newlines or whitespace (any non-digit char is a separator) so I did a char by char reading (though I had my concerns about the performance) and it turns out doing it this way is incredibly slow.
Simply reading char by char with no processing on a 900Kb input file takes around 7 seconds!
while c = STDIN.read(1)
end
If I input data with newlines and read line by line, same file is read 100x times faster.
while s = STDIN.gets
end
It seems like reading from a pipe with STDIN.read(1) doesn't involve any buffering and every time read happens, hard drive is hit - but shouldn't it be cached by OS?
Doesn't STDIN.gets read char by char internally until it encounters '\n'?
Using C, I would probably read data in chunks though I would I have to deal with numbers being split by buffer window but that doesn't look like an elegant solution for Ruby. So what is the proper way of doing this?
P.S Timing reading the same file in Python:
for line in f:
line
f.close()
Running time is 0.01 sec.
c = f.read(1)
while c:
c = f.read(1)
f.close()
Running time is 0.17 sec.
Thanks!

This script reads the IO object word by word, and executes the block every time 1000 words have been found or the end of the file has been reached.
No more than 1000 words will be kept in memory at the same time. Note that using " " as separator means that "words" might contain newlines.
This scripts uses IO#each to specify a separator (a whitespace in this case, to get an Enumerator of words), lazy to avoid doing any operation on the whole file content and each_slice to get an array of batch_size words.
batch_size = 1000
STDIN.each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
Instead of using cat and |, you could also read the file directly :
batch_size = 1000
File.open('huge_text_file.txt').each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
With this code, no number will be split, no logic is needed, it should be much faster than reading the file char by char and it will use much less memory than reading the whole file into a String.

Related

ReadFile truncating console input data containing multibyte characters, how to get correct input?

I was trying to implement a unified input interface using Windows API function ReadFile for my application, which should be able to handle both console input and redirection. It didn't work as expected with console input containing multibyte (like CJK) characters.
According to Microsoft Documentation, for console input handles, ReadFile just behaves like ReadConsoleA. (FYI, results are encoded in console's current code page, so A family console functions are acceptable. And there's no ReadFileW as ReadFile works on bytes.) The third and fourth arguments in ReadFile is nNumberOfBytesToRead and lpNumberOfBytesRead respectively, but they are nNumberOfCharsToRead and lpNumberOfCharsRead in ReadConsole. To find out the exact mechanism, I did the following test:
BYTE buf[8];
DWORD len;
BOOL f = ReadFile(in, buf, 4, &len, NULL);
if (f) {
// Print buf, len
ReadConsoleW(in, buf, 4, &len, NULL); // check count of remaining characters
// Print len
}
For input like 字, len is set to 4 first (character plus CRLF), indicating the arguments are counting bytes.
For 文字 or a字, len keeps 4 and only the first 4 bytes of buf are used at first, but the second read does not get the CRLF. Only when more than 3 characters are input will the second read get unread LF, then CR. It means that ReadFile is actually consuming up to 4 logical characters, and discarding the part of input after the first 4 bytes.
The behavior of ReadConsoleA is identical to ReadFile.
Obviously, this is more likely to be a bug than design. I did some searches and found a related feedback dating back to 2009. It seems that ReadConsoleA and ReadFile did read data fully from console input, but as it was inconsistent with ReadFile specifications and could cause severe buffer overflow that threatened system processes, Microsoft did a makeshift repair, by simply discarding excess bytes, ignoring support for multibyte charsets. (This is an issue about the behavior after that, limiting buffer to 1 byte.)
Currently the only practical solution I have come up with to make input correct is to check whether the input handle is a console, and process it differently using ReadConsoleW if so, which adds complexity to the implementation. Are there other ways to get it correct?
Maybe I could still keep ReadFile, by providing a buffer large enough to hold any input at one time. However, I don't have any ideas on how to check or set the input buffer size. (I can only enter 256 characters (254 plus CRLF) in my application on my computer, but cmd.exe allows to enter 8,192 characters, so this is really a problem.) It will also be helpful if more information about this can be provided.
Ps.: Maybe _getws could also help, but this question is about Windows API, and my application needs to use some low-level console functions.

Print lines around position in the file

I'm importing a big csv (5gb) file to the BiqQuery and I had information about an error in the file and its position — specified as a byte offset from the start of the file (for example, 134683757). I'd like to look at lines around this error position.
Some example lines of the file:
field1, field2, field3
abc, bcd, efg
...
dge, hfr, kdf,
dgj, "a""a", fbd # in this line is an invalid csv element and I get error, let's say on the position 134683757
skd, frd, lqw
...
asd, fij, fle
I need some command to show lines around error like
dge, hfr, kdf,
dgj, "a""a", fbd
skd, frd, lqw
I tried sed and awk but I didn't find any simple solution.
It was definitely not clear from the original version of the question that you only got a byte offset from the start of the file.
You need to get a better position from the software generating the error; the developer was lazy in reporting an unusable number. It is reasonable to request a line number (and preferably offset within the line), rather than (or as well as) the byte offset from the start.
Assuming that the number is a byte position in the file, that gets tricky. Most Unix utilities work with lines (of variable length). I'd be tempted to write some C code to do the job, but that might be beyond you (and no shame in that).
Failing that, your best is likely the dd command. If the number reported is 134683757, then I'd guess that your lines are probably not more than 1 KiB each (adjust numbers if they're bigger, or smaller), and then use:
dd if=big.csv of=extract.csv bs=1 skip=$((134683757 - 3 * 1024)) count=6144
echo >> extract.csv
You'd then look at extract.csv. The raw dd output probably won't have a newline at the end of the last line (the echo >>extract.csv fixes that). The output will probably start part way through a record and end part way through another record. However, you're likely to have the relevant information, as well as some irrelevant information. As I said, adjust the numbers to suit your exact situation.
The trickiest part is identifying exactly where the byte offset is in the file you get. With custom C code, that can be provided easily (more easily). With the output from dd, you have to do the calculation yourself.
awk -v offset=$((134683757 - 3 * 1024)) '
{ printf "%9d: %s\n", offset, $0; offset += length($0) + 1 }
' extract.cvs
That takes the starting offset from the dd command, and prefixes the (remnants of) the first line with that number and the data; it then adds the length to the offset plus one for the newline that wasn't counted, and continues to the end of the file. That gives you the start offset for each line in the extracted data. You can see where your actual start was by looking at the offsets — you should be able to identify which record that was.
You could use a variant of this Awk script that reads the whole file line by line, and tracks the offset (as well as the line numbers) and prints the data when it gets to the vicinity of where you have the problem.
In times long past, I had to deal with data from 1/2 inch mag tapes (those big circular tapes you see in old movies) where the files generated on a mainframe seemed sanely formatted for the first few tens of megabytes, but then the format changed to some alternative format for a few megabytes, and then reverted to the original format once more. I never did find out why; I just learned how to deal with it. Trial and error!

File seek with two-byte characters

I'm writing small log parser, which should find some tags in files.
Files are large (512mb) and have the following structure:
[2018.07.10 00:30:03:125] VersionInfo\886
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingTime\16
...some data...
[2018.07.10 00:30:03:109][TraceID: 8HRWSI105YVO91]->IncomingData\397
...some data...
[2018.07.10 00:30:03:749][TraceID: 8HRWSI105YVO91]->OutgoingData\26651
...somedata...
Each block IncomingTime, IncomingData, OutgoingData, etc. has block size (characters count, not bytes) at the end 886, 16, 397, 26651. Some blocks are very large and can't be read without large buffer (if i use bufio). I want to skip unnecessary blocks using file.Seek.
The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?
The problem is that file.Seek needs bytes length and i've only characters count (block may have unicode data with two-byte charcters). Is there any chance to skip blocks using characters count?
That's actually impossible. As you've described the file format, both of the following are possible:
...VersionInfo\1
[ 20 ]
...VersionInfo\1
[ C2 A0 ]
If you've just read the newline and you know you need to read one character, you know it's somewhere between 1 and 2 bytes (UTF-8 characters can go up to 4 bytes even) but not which, and blindly launching forward some number of bytes without inspecting the intermediate data won't work. The pathological case is a larger block, where the first half has many multi-byte characters and the last half has text that happens to look like one of your entry headers.
With this file format you're forced to read it a character at a time.

Open CL Kernel - every workitem overwrites global memory?

I'm trying to write a kernel to get the character frequencies of a string.
First, here is the code I have for kernel right now:
_kernel void readParallel(__global char * indata, __global int * outdata)
{
int startId = get_global_id(0) * 8;
int maxId = startId + 7;
for (int i = startId; i < maxId; i++)
{
++outdata[indata[i]];
}
}
The variable inData holds the string in the global memory, and outdata is an array of 256 int values in the global memory. Every workitem reads 8 symbols from the string and should increase the appropriate ASCII-code in the array. The code compiles and executes, but outdata contains less occurrences overall than the number of characters in inData. I think the problem is that workitems overwrites the global memory. It would be nice if you can give me some tips to solve this.
By the way,. I am a rookie in OpenCL ;-) and, yes, I looked for solutions in other questions.
You are experiencing the effects of your uses of global memory not being atomic (C++-oriented description of what those are or another description by the Intel TBB folks). What happens, chronologically, is:
Some workgroup "thread" loads outData[123] into some register r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" increments r1
... lots of work, reading and writing, happens, including on
outData[123]...
The same workgroup "thread" writes r1 to outData[123]
So, the value written to outData[123] "throws away" the updates during the time period between the read and the write (I'm ignoring the possibility of parallel writes corrupting each other rather than one of them winning out).
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use work-item-specific, warp-specific and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.
On an unrelated note, and as #huseyintugrulbuyukisik correctly points out, your code uses signed char values to index the array. To fix that, do one of the following:
reinterpret those char's as unsigned chars for array indices (and reinterpret back when reading the array.
upcast the char values to a larger integral type and add 128 to get an offset into the outArray.
Define your kernel to only support ASCII characters (no higher than 127), in which case you can ignore this issue (although that will be a potential crasher if you get invalid input.
If you only care about the frequency of printable characters (but can also have non-printing characters in the input), you could perform a run-time check before counting a character.

Wrapping a binary data file to self-convert to CSV?

I'm writing custom firmware for a SparkFun Logomatic V2 that records binary data to a file on a 2GB micro-SD card. The data file size will range from 100 MB to 1 GB.
The format of the binary data is in flux as the board's firmware evolves (it will actually be dynamically reconfigurable at run-time). Rather than create and maintain a separate decoder/converter program for each version of firmware/configuration, I'd much rather make the data files self-converting to CSV format by starting the data file with a Bash script that is written to the data file before data recording starts.
I know how to create a Here Document, but I suspect Bash would be unable to quickly parse and convert a gigabyte of binary data, so I'd like to make the process run much faster by having the script first compile some C code (assume GCC is present and in the path), then run the resulting program, passing the binary data to stdin.
To make the problem more concrete, assume the firmware will create binary data consisting of 4 16-bit integer values: A timestamp (unsigned) followed by 3 accelerometer axes (signed). There is no separator between records (mainly because I'm saturating the SPI interface to the uSD card).
So, I think I need a script with TWO here documents: One for the C code (parameterized by expanded Bash variables), and another for the binary data. Here's where I am so far:
#! env bash
# Produced by firmware version 0.0.0.0.0.1 alpha
# Configuration for this data run:
header_string = "Time, X, Y, Z"
column_count = 4
# Create the converter executable
# Use "<<-" to permit code to be indented for readability.
# Allow variable expansion/substitution.
gcc -xc /tmp/convertit - <<-THE_C_CODE
#include <stdio.h>
int main (int argc, char **argv) {
// Write ${header_string} to stdout
while (1) {
// Read $(column_count} shorts from stdin
// Break if EOF
// Write $(column_count} comma-delimited values to stdout
}
// Close stdout
return 0;
}
THE_C_CODE
# Pass the binary data to the converter
# Hard-quote the Here tag to prevent subsequent expansion/substitution
/tmp/convertit >./$1.csv <<'THE_BINARY_DATA'
...
... hundreds of megabytes of semi-random data ...
...
THE_BINARY_DATA
rm /tmp/convertit
exit 0
Does that look about right? I don't yet have a real data file to test this with, but I wanted to verify the idea before going much further.
Will Bash complain if the closing lines are missing? This may happen if data capture terminates unexpectedly due to a shock knocking loose the battery or uSD card. Or if the firmware borks.
Is there a faster or better method I should consider? For example, I wonder if Bash will be too slow to copy the binary data as fast as the C program can consume it: Should the C program open the data file directly?
TIA,
-BobC
You may want to have a look at makeself. It allows you to change any .tar.gz archive into a self-extracting file which is platform independent (something like a shell script that contains a here document). This will allow you to easily distribute your data and decoder. It also allows you to configure a script contained within the archive to be run when the container script is run. This way you can use makeself for packaging and inside the archive you can put your data files and decoder written in C or bash or whatever language you find suitable.
While it is possible to decode binary data using shell tools (e.g. using od), it's very cumbersome and ineffective. I'd recommend using either a C program or perl which is also likely to be found on almost any machine (check this page).

Resources