How to Find Exact Row in Log File - algorithm

If you have a big log file, billions of lines long. The files have some columns, like IP addresses: xxx.xxx.xxx.xxx.
How can I find exact one line quickly, like if I want to find 123.123.123.123.
A naive line-by-line search seems too slow.

If you don't have any other information to go on (such as a date range, assuming the file is sorted), then line-by-line search is your best option. Now, that doesn't mean you need to read in lines. Also, it might be more efficient for you to search backwards because you know the entry is recent.
The general approach (for searching backwards) is this:
Declare a buffer. You will read chunks of the file at a time into this buffer as fast as possible (preferably by using low-level operating system calls that can read directly without any buffering/caching).
So you seek to the end of your file minus the size of your buffer and read that many bytes.
Now you search forwards through your buffer for the first newline character. Remember that offset for later, as it represents a partial line. Starting at next line, you search forward to the end of the buffer looking for your string. If it has to be in a certain column but other columns could contain that value, then you need to do some parsing.
Now you continue to search backwards through your file. You seek to the last position you read from minus the chunk size plus the offset that you found when you searched for a newline character. Now, you read again. If you like you can move that partial line to the end of the buffer and read fewer bytes but it's not going to make a huge difference if your chunks are large enough.
And you continue until you reach the beginning of the file. There is of course a special case when the number of bytes to read is less than the chunk size (namely, you don't ignore the first line). I assume that you won't reach the beginning of the file because it seems clear that you don't want to search the entire thing.
So that's the approach when you have no idea where the value is. If you do have some idea on ordering, then of course you probably want to do a binary search. In that case you can use smaller chunk sizes (enough to at least catch a full line).

You really need to search for some regularity in the file and exploit that, Barring that, then if you have more processors you could split the file into sections and search in parallel - assuming I/O would not then be a bottleneck.

Related

Dividing a file to several chunks

Let's assume that we have a file of 100k lines or ~2gB and we want to split it in 10 chunks of 10k lines each, so that the chunks can then be processed in parallel. Is there any way to create pointers in the starting line of each of the 10 chunks, without needing to traverse the whole file ? I was thinking of somehow dividing the file with regards to its size, so that the pointers are created every 200mB. Is this even feasible ?
Yes, of course. But you need to make some assumptions and accept that your chunks will not be exact.
Either assume a standard line length or scan a few lines and measure it. Then you multiply that by the number of lines you are aiming for and just hope it's a good estimate.
Or if you just want 10 chunks take the file size and divide by 10.
So then you jump to that point in the file, either by using lseek and read, pread, or mmap. Then you scan forward until you find the end of a line and the start of the next.
It won't be exact line counts unless you actually count every line. But it will be pretty close.
I was bored and curious so check this out:
https://github.com/zlynx/linesection

How to write a compression algorithm?

I need some help coming up with a simple compression algorithm.
I have two lists of unsigned shorts - one for input, and one for output. The input list starts with a few thousand values, and the output list starts empty.
I'm trying to replace repetitive runs of the same value in the input with a 'decompression instruction' value in the output.
I want it to scan the next 2-15 values ahead of the input position, then scan 2-120 values behind the input position, and the best match found would then be added to the output as a single value rather than the entire run. This value essentially is a 'decompression instruction', and is equal to 2*(a+(b*512)+8192), where 'a' is the distance scanned back and 'b' is the distance scanned forward. All such values would therefore fall into the 16384-32767 range. If no match was found, then the value at the input position is copied literally.
This would yield an output where, in order to decompress it in the future, all values between 16384 and 32767 are read as decompression instructions, and all other values are copied literally.
It doesn't need to compress the data as efficiently as possible - it only needs to compress until the output is 6650 or less in length.
While I realize there are numerous compression routines already available that will do a much better job than this would, I need this exact routine for a specific purpose. I just really can't seem to make this work properly.
If there are any good algorithm writers out there, I'd love to hear from you.
If you have many repeated values, then simply subtract from every value (except the first) the value that precedes it. You will end up with long runs of zeros. Then compress with a standard compression routine, such as zlib, or gzip on the command line. After decompression, it is then simple to undo the subtractions to recover the original data.

Searching a file non-sequentially

Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?
You can use tail -n +N file | grep to begin a grep at a given line offset.
You can combine head with tail to search over just a fixed range.
However, this still must scan the file for end of line characters.
In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.
For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.
You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.
If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.
If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.
Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.
if your lines are fixed length, you can use dd to read a particular section of the file:
dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands
Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size.
Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.
The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)

search 4-5 bytes sequence in big file

I have file ~ 1.5GB
I need to find in this file 3 billion sequences of bytes. One sequence may be 4 or 5 bytes.
Find the first position, or to make sure that such a sequence in the file no.
How to do it fastest?
RAM limit on computer - 4GB
Use grep. It's highly optimized for finding things in large files.
If that's not an option, read about the Boyer-Moore algorithm it uses and implement it yourself. It'll take a lot of tweaking to reproduce the same speed grep has though.
Use Preprocessing.
I think you should just create an Index, make a run through the file, recording the first instance of every unique 4 byte sequence. Store the 4 byte sequence and the first occurring position in a different file, sorted by the byte sequence.
Using a simple binary search on the Index file will efficiently find your sequence.
You could be more clever and use hashing to reduce the search to O(1).
Check out the Searchlight search engine.
This program allows multiple sequences of up to 10 ASCII bytes to be stored within a single file. You then point it at a file, directory, file of filenames, file of directory names, arraylist of filenames or an arraylist of directory names and away it goes!!
Furthermore, it reports the file byte position/offset of each sequence found.

Best way to store 1 trillion lines of information

I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.

Resources