which one is suitable datastructure - data-structures

Two files each of size terabytes. A file comparison tool compares ith line of file1 with
i th line of file2. if they are same it prints. which datastructure is suitable.
B-tree
linked list
hash tables
none of them

You need to be able to buffer up at LEAST a line at a time. Here's one way:
While neither file is at EOF:
Read lines A and B from files one and two (each)
If lines are identical, print one of them
Translate into suitable programming language, and problem is solved.
Note that no fancy data structures are involved.

the simple logic is read one line at a time from the file and match..
It's like
While line1 is not equal to EOF file1 and line2 is not equal to EOF file2:
Compare line1 and line2
Btw you have to be sure how much maximum character a line can contain so u can change buffer size accordingly..
Otherwise try bigdata concept spark framework to make your work easier.

Related

bash: Split binary file at predefined positions

I have binary files which contain data structures of various length. I would like to save these blocks of data into separate files. The size of each block is known. The split command can, well, split a file but it does not stop after the first block of data. It slices the file into pieces of equal size.
Therefore, my current solution is to split and cat the remainder of the file back together, iterating my way through the data. This is very clumsy and may even fail in certain circumstances.
What is the best way to slice a binary file precisely at certain positions?
You can use two independent dd commands. One to seek arbitrarily, and another to copy arbitrary lengths.
SEEK=501
BYTES=387
dd if=yourfile bs=$SEEK skip=1 | dd bs=$BYTES count=1 > lump.bin
Note: Although counter-intuitive to what you are actually try to do, keep the blocksize high and the count low for best performance. What I mean is, if you want 8192 bytes, use bs=8192 count=1 rather than bs=1 count=8192.
I would need offset and block size as independant parameters.
Using the iflag operand may help you here. It is available in dd (coreutils) since 2012, and you can specify different units for bs, skip and count.
Note that skip and count must either be in bytes or match the bs value.
An example usage will be:
dd if=infile.bin of=outfile.bin bs=4K skip=1234 count=20 iflag=skip_bytes,count_bytes
which would copy 20 bytes from input file, starting from byte 1235.
You can refer here for more detailed usage: https://askubuntu.com/a/1178771

Avoiding parsing when loading a file

Suppose that I have the following file (input.txt):
1 2 sometext1
2 3 sometext2
3 4 sometext3
4 5 sometext4
i.e. a tab-delimited file where each line is made of two strings representing an Integer and a third string representing arbitrary text.
This file is the input for a PigLatin script:
input = load 'input.txt' as (a:int, b:int, c:chararray);
My assumption is that Pig is going to waste time parsing the text file to produce the corresponding integers. Am I correct?
I would like to store in a binary file the binary representation of the three integers.
How can I make Pig understanding such binary file? Should I simply extend the LoadFunc or do I need to use the BinStorage?
How much time are you afraid to waste here? Assuming the rest of your script does anything meaningful or that your files are large enough (so that IO would be serious) the parsing effort will be negligible compared with everything else

Is there a way to split a large file into chunks with random sizes?

I know you can split a file with split, but for test purposes I would like to split a large file into chunks whose sizes differ. Is this possible?
Alternatively, if the above-mentioned file is a zip, is there a way to split it into volumes of unequal sizes?
Any suggestions welcome! Thanks!
So the general question that you're asking is: how can I compute N random integers that sum to S? Specifically, S is the size of your file and N is how many smaller files that you want to break it into.
For example, assume that you want to split your file into 4 parts. If a, b, c, and d are four random numbers, then:
a + b + c + d = X
a/X + b/X + c/X + d/X = 1
S*a/X + S*b/X + S*c/X + S*d/X = S
Giving us four random numbers that sum to S, the size of your file.
Which means you'd want to write a script that:
Computes N random numbers (any random numbers).
Computes X as the sum of those random numbers.
Multiplies each of those random numbers by S/X (and makes sure you're left with integers greater than 0 that sum to S)
Splits the original file into pieces using the generated random numbers as sizes, using whatever tool you want.
This is a little much for a shell script, but would be pretty straight forward in something like Perl.
since you tagged the question only with shell. so I supposed you want to handle it only with shell script and those common linux command/tools.
As far as I know there is no existing tool/cmd can split file randomly. To split file, we can consider to use split, dd
Both tools support options like, how big (size) split-ed file should be or how many files do you want to split. let's say, we use dd/split first split your file into 500 parts, each file has same size. so we have:
foo.zip.001
foo.zip.002
foo.zip.003
...
foo.zip.500
then we take this file list as input, to do merge (cat). This step could be done by awk or shell script.
for example we can build a set of cat statements like:
cat foo.zip.001, foo.zip.002 > part1
cat foo.zip.003, foo.zip.004, foo.zip.005 > part2
cat foo.zip.006, foo.zip.007, foo.zip.008, foo.zip.009 > part3
....
run the generated cat statements, you got final part1-n, each part has different size.
for example like:
kent$ seq -f'foo.zip.%g' 20|awk 'BEGIN{i=k=2}NR<i{s=s sprintf ("%s,",$0);next}{k++;i=(NR+k);print "cat "s$0" >part"k-2;s="" }'
cat foo.zip.1,foo.zip.2 >part1
cat foo.zip.3,foo.zip.4,foo.zip.5 >part2
cat foo.zip.6,foo.zip.7,foo.zip.8,foo.zip.9 >part3
cat foo.zip.10,foo.zip.11,foo.zip.12,foo.zip.13,foo.zip.14 >part4
cat foo.zip.15,foo.zip.16,foo.zip.17,foo.zip.18,foo.zip.19,foo.zip.20 >part5
but how is the performance you have to test on your own...at least this should work for your requirement.

How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

Lets say the input file is:
Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ
I have to create the following output:
Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1
The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.
How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.
To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?
uniq -c filename | perl -lane 'print "#F[1..$#F] $F[0]"'
The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.
As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.
The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.
In most cases this operation will be completely I/O bound. (Especially using well-designed C++)
Given that, its likely the only bottleneck you need to care about is the disk.
I think you will find this to be relevant:
mmap() vs. reading blocks
Ben Collins has a very good answer comparing mmap to standard read/write.
Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.
Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).
Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

How to shuffle the lines in a file without reading the whole file in advance?

What's a good algorithm to shuffle the lines in a file without reading the whole file in advance?
I guess it would look something like this: Start reading the file line by line from the beginning, storing the line at each point and decide if you want to print one of the lines stored so far (and then remove from storage) or do nothing and proceed to the next line.
Can someone verify / prove this and/or maybe post working (perl, python, etc.) code?
Related questions, but not looking at memory-efficient algorithms:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
How can I randomize the lines in a file using standard tools on Red Hat Linux?
How can I print the lines in STDIN in random order in Perl?
I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.
I am not familiar with perl, or python, but can demonstrate with php.
<?php
$offsets = array();
$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
if (fgetc($f) == "\n") $offsets[] = ftell($f);
}
shuffle($offsets);
foreach ($offsets as $offset)
{
fseek($f, $offset);
echo fgets($f);
}
fclose($f);
?>
The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):
Determine the filesize
Create a list of offsets and lengths already written to stdout
Loop until bytes_written == filesize
Seek to a random offset that is not already in your list of already written values
Back up from that seek to the previous newline or start of file
Display that line, and add it to the list of offsets and lengths written
Go to 3.
The following algorithm is linear in the number of lines in your input file.
Preprocessing:
Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.
Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).
Main
Create a new file
Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.
Continue as in step 2 for all the other lines.
So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.
You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)
Edit
Just an example in python:
import sys
import random
CACHE_SIZE = 23
lines = {}
for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
i = random.randint(0, CACHE_SIZE-1)
old = lines.get(i)
if old:
print old,
lines[i] = l
for ignored, p in lines.iteritems():
print p,

Resources