How would I split a text file in thirds using the Terminal and if do not know exactly how many lines of text there are? I know it is around 3000 or so.
Do you mean splitting the total number of lines into three?
if yes, then write a small script that would do
- count the number of lines. wc -l should do that.
- divide the number by 3 and read the exact number of lines from the text file and write it to 3 buffers, for first two buffers write exact number of lines and write the rest of the lines to 3rd buffer/file since the total number of lines is not always proper multiple of 3. Hope that helps.
Related
If I had a program to generate 1 billion numbers and output it to a text file. What is the best way to check if that file containing enough number?
For example my output text file should be like below
1
2
....
1001
1002
1003
...
1000000000
The auditor program should check if the file contains 1 billion rows and those number must be in ascending order
The normal approach would be just to read through the file and check this. The following pseudo-code(1) would be one approach:
set expected to zero
for each line in file:
add one to expected
if line not equal to expected:
exit with failure indication
if expected is not one billion:
exit with failure indication
exit with success indication
This allows you to indicate failure immediately that a bad line is found, the only case where you have to process the entire file is if it's valid.
That algorithm is, of course, if your numbers need to be consecutive as well as ascending, which appears to be the case.
There are potential optimisations that could be used to pre-check things (such as the file being of a minimum size that will allow it to contain all the numbers). But, until you've got the functionality done correctly, I'd be holding off doing those.
(1) Keep in mind that there are real-world issues you'll need to contend with when implementing in a real language. This includes things such as opening the file, reading each line into a variable, and converting between strings and integers.
I haven't covered that in detail since this question is tagged algorithm.
Let's assume that we have a file of 100k lines or ~2gB and we want to split it in 10 chunks of 10k lines each, so that the chunks can then be processed in parallel. Is there any way to create pointers in the starting line of each of the 10 chunks, without needing to traverse the whole file ? I was thinking of somehow dividing the file with regards to its size, so that the pointers are created every 200mB. Is this even feasible ?
Yes, of course. But you need to make some assumptions and accept that your chunks will not be exact.
Either assume a standard line length or scan a few lines and measure it. Then you multiply that by the number of lines you are aiming for and just hope it's a good estimate.
Or if you just want 10 chunks take the file size and divide by 10.
So then you jump to that point in the file, either by using lseek and read, pread, or mmap. Then you scan forward until you find the end of a line and the start of the next.
It won't be exact line counts unless you actually count every line. But it will be pretty close.
I was bored and curious so check this out:
https://github.com/zlynx/linesection
If you have a big log file, billions of lines long. The files have some columns, like IP addresses: xxx.xxx.xxx.xxx.
How can I find exact one line quickly, like if I want to find 123.123.123.123.
A naive line-by-line search seems too slow.
If you don't have any other information to go on (such as a date range, assuming the file is sorted), then line-by-line search is your best option. Now, that doesn't mean you need to read in lines. Also, it might be more efficient for you to search backwards because you know the entry is recent.
The general approach (for searching backwards) is this:
Declare a buffer. You will read chunks of the file at a time into this buffer as fast as possible (preferably by using low-level operating system calls that can read directly without any buffering/caching).
So you seek to the end of your file minus the size of your buffer and read that many bytes.
Now you search forwards through your buffer for the first newline character. Remember that offset for later, as it represents a partial line. Starting at next line, you search forward to the end of the buffer looking for your string. If it has to be in a certain column but other columns could contain that value, then you need to do some parsing.
Now you continue to search backwards through your file. You seek to the last position you read from minus the chunk size plus the offset that you found when you searched for a newline character. Now, you read again. If you like you can move that partial line to the end of the buffer and read fewer bytes but it's not going to make a huge difference if your chunks are large enough.
And you continue until you reach the beginning of the file. There is of course a special case when the number of bytes to read is less than the chunk size (namely, you don't ignore the first line). I assume that you won't reach the beginning of the file because it seems clear that you don't want to search the entire thing.
So that's the approach when you have no idea where the value is. If you do have some idea on ordering, then of course you probably want to do a binary search. In that case you can use smaller chunk sizes (enough to at least catch a full line).
You really need to search for some regularity in the file and exploit that, Barring that, then if you have more processors you could split the file into sections and search in parallel - assuming I/O would not then be a bottleneck.
How to get the line count of a large file, at least 5G. the fastest approach using shell.
Step 1: head -n filename > newfile // get the first n lines into newfile,e.g. n =5
Step 2: Get the huge file size, A
Step 3: Get the newfile size,B
Step 4: (A/B)*n is approximately equal to the exact line count.
Set n to be different values,done a few times more, then get the average.
The fastest approach is likely to be wc -l.
The wc command is optimized to do exactly this kind of thing. It's very unlikely that anything else you can do (other than doing it on more powerful hardware) is going to be any faster.
Yes, counting lines in a 5 gigabyte text file is slow. It's a big file.
The only alternative would be to store the data in some different format in the first place, perhaps a database, perhaps a file with fixed-length records. Converting your 5 gigabyte text file to some other format is going to take at least as wrong as running wc -l on it, but it might be worth it if you're going to be counting lines a lot. It's impossible to say what the tradeoffs are without more information.
I have file ~ 1.5GB
I need to find in this file 3 billion sequences of bytes. One sequence may be 4 or 5 bytes.
Find the first position, or to make sure that such a sequence in the file no.
How to do it fastest?
RAM limit on computer - 4GB
Use grep. It's highly optimized for finding things in large files.
If that's not an option, read about the Boyer-Moore algorithm it uses and implement it yourself. It'll take a lot of tweaking to reproduce the same speed grep has though.
Use Preprocessing.
I think you should just create an Index, make a run through the file, recording the first instance of every unique 4 byte sequence. Store the 4 byte sequence and the first occurring position in a different file, sorted by the byte sequence.
Using a simple binary search on the Index file will efficiently find your sequence.
You could be more clever and use hashing to reduce the search to O(1).
Check out the Searchlight search engine.
This program allows multiple sequences of up to 10 ASCII bytes to be stored within a single file. You then point it at a file, directory, file of filenames, file of directory names, arraylist of filenames or an arraylist of directory names and away it goes!!
Furthermore, it reports the file byte position/offset of each sequence found.