How to shuffle the lines in a file without reading the whole file in advance? - algorithm

What's a good algorithm to shuffle the lines in a file without reading the whole file in advance?
I guess it would look something like this: Start reading the file line by line from the beginning, storing the line at each point and decide if you want to print one of the lines stored so far (and then remove from storage) or do nothing and proceed to the next line.
Can someone verify / prove this and/or maybe post working (perl, python, etc.) code?
Related questions, but not looking at memory-efficient algorithms:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
How can I randomize the lines in a file using standard tools on Red Hat Linux?
How can I print the lines in STDIN in random order in Perl?

I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.
I am not familiar with perl, or python, but can demonstrate with php.
<?php
$offsets = array();
$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
if (fgetc($f) == "\n") $offsets[] = ftell($f);
}
shuffle($offsets);
foreach ($offsets as $offset)
{
fseek($f, $offset);
echo fgets($f);
}
fclose($f);
?>
The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):
Determine the filesize
Create a list of offsets and lengths already written to stdout
Loop until bytes_written == filesize
Seek to a random offset that is not already in your list of already written values
Back up from that seek to the previous newline or start of file
Display that line, and add it to the list of offsets and lengths written
Go to 3.

The following algorithm is linear in the number of lines in your input file.
Preprocessing:
Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.
Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).
Main
Create a new file
Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.
Continue as in step 2 for all the other lines.
So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.

You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)
Edit
Just an example in python:
import sys
import random
CACHE_SIZE = 23
lines = {}
for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
i = random.randint(0, CACHE_SIZE-1)
old = lines.get(i)
if old:
print old,
lines[i] = l
for ignored, p in lines.iteritems():
print p,

Related

how to improve performance to run faster (python)

Right now, my program takes more than 10mins LOL try to display all the possible words (if those words are in the file) that can be created from the given letters. In that file, it has more than 4000+ words
How to make my program run faster by using recursion, and not using any libraries because i'm new to it.
if user input letters: b d o s y
then it will look up all the possible words in that file to create:
b
d
boy
boys
by
the code:
words = set()
def found(word, file):
## Reads through file and tries
## to match given word in a line.
with open(file, 'r') as rf:
for line in rf.readlines():
if line.strip() == word:
return True
return False
def scramble(r_letters, s_letters):
## Output every possible combination of a word.
## Each recursive call moves a letter from
## r_letters (remaining letters) to
## s_letters (scrambled letters)
if s_letters:
words.add(s_letters)
for i in range(len(r_letters)):
scramble(r_letters[:i] + r_letters[i+1:], s_letters + r_letters[i])
thesarus = input("Enter the name of the file containing all of the words: ")
letters = input("Please enter your letters separated by a space: ")
word = ''.join(letters.split(' '))
scramble(word, '')
ll = list(words)
ll.sort()
for word in ll:
if found(word, thesarus):
print(word)
You program runs slow because your algorithm is inefficient.
Since you require in the question to use recursion (to generate all the possible combinations), you could improve at least in how you search in the file.
Your code open the file and search a single word reading it for every word. This is extremely inefficient.
First solution it comes into my mind is to read the file once and save each word in a set()
words_set = {line.strip() for line in open('somefile')}
or also (less concise)
words_set = set()
with open('somefile') as fp:
for line in fp:
words_set.add(line.strip())
Then, you just do
if word in words_set:
print(word)
I think there could be more efficient ways to do the whole program, but they don't require recursion.
Update
For the sake of discussion, I think it may be useful to also provide an algorithm which is better.
Your code generates all possible combinations, even if those are not likely to be part of a dictionary, in addition to the inefficient search in the file for each word.
A better solution involves storing the words in a more efficient way, such that it is much easier to tell if a particular combination exists or not. For example, you don't want to visit (in the file) all the words composed by characters not present in the list provided by the user.
There is a data structure which I believe to be quite effective for this kind of problems: the trie (or prefix tree). This data structure can be used to store all the thesaurus file, in place of the set that I suggested above.
Then, instead of generating all the possible combinations of letters, you just visit the tree with all the possible letters to find all the possible valid words.
So, for example, if your user enters h o m e x and you have no word starting with x in your thesaurus, you will not generate all the permutations starting with x, such as xe, xo, xh, xm, etc, saving a large amount of computations.

Splitting files with redundancy

I need a cmd tool or an algorithm description to do the following:
Given a file of any size (size < 1GB) I need to split the file into n parts (n may be any reasonable number) and 1 file with metadata.
Metadata is something similar to "parity" byte in RAID systems, that will allow me to restore the whole file if one part of the file is lost. So, the metadata is kind of 'extra' or redundant info that helps to restore original file.
Other n parts should not have any extra information.
For now, I tries to use pacrhive and par2 tools but they don't implement the idea I described above.
Also I tried to get into Forward error corrections (FEC) and Reed-Solomon codes but also didn't succeed there.
Thanks for any help.
You can apply something almost identical to RAID 5 (although that's for hard drives, the same logic applies).
So, send the 1st byte to the 1st file, the 2nd to the 2nd, ..., the nth to the nth, the (n+1)th to the 1st, the (n+2)th to the 2nd, etc.
Then just put the parity of each bit across the n files into the last file, i.e.:
1st bit of last file = 1st bit of 1st file
XOR 1st bit of 2nd file
XOR ...
XOR 1st bit of nth file
2nd bit of last file = XOR of 2nd bits of all the other files as above
This will allow you to restore any file (including the 'metadata' file) by just calculating the parity of the rest of files.

Is there a way to split a large file into chunks with random sizes?

I know you can split a file with split, but for test purposes I would like to split a large file into chunks whose sizes differ. Is this possible?
Alternatively, if the above-mentioned file is a zip, is there a way to split it into volumes of unequal sizes?
Any suggestions welcome! Thanks!
So the general question that you're asking is: how can I compute N random integers that sum to S? Specifically, S is the size of your file and N is how many smaller files that you want to break it into.
For example, assume that you want to split your file into 4 parts. If a, b, c, and d are four random numbers, then:
a + b + c + d = X
a/X + b/X + c/X + d/X = 1
S*a/X + S*b/X + S*c/X + S*d/X = S
Giving us four random numbers that sum to S, the size of your file.
Which means you'd want to write a script that:
Computes N random numbers (any random numbers).
Computes X as the sum of those random numbers.
Multiplies each of those random numbers by S/X (and makes sure you're left with integers greater than 0 that sum to S)
Splits the original file into pieces using the generated random numbers as sizes, using whatever tool you want.
This is a little much for a shell script, but would be pretty straight forward in something like Perl.
since you tagged the question only with shell. so I supposed you want to handle it only with shell script and those common linux command/tools.
As far as I know there is no existing tool/cmd can split file randomly. To split file, we can consider to use split, dd
Both tools support options like, how big (size) split-ed file should be or how many files do you want to split. let's say, we use dd/split first split your file into 500 parts, each file has same size. so we have:
foo.zip.001
foo.zip.002
foo.zip.003
...
foo.zip.500
then we take this file list as input, to do merge (cat). This step could be done by awk or shell script.
for example we can build a set of cat statements like:
cat foo.zip.001, foo.zip.002 > part1
cat foo.zip.003, foo.zip.004, foo.zip.005 > part2
cat foo.zip.006, foo.zip.007, foo.zip.008, foo.zip.009 > part3
....
run the generated cat statements, you got final part1-n, each part has different size.
for example like:
kent$ seq -f'foo.zip.%g' 20|awk 'BEGIN{i=k=2}NR<i{s=s sprintf ("%s,",$0);next}{k++;i=(NR+k);print "cat "s$0" >part"k-2;s="" }'
cat foo.zip.1,foo.zip.2 >part1
cat foo.zip.3,foo.zip.4,foo.zip.5 >part2
cat foo.zip.6,foo.zip.7,foo.zip.8,foo.zip.9 >part3
cat foo.zip.10,foo.zip.11,foo.zip.12,foo.zip.13,foo.zip.14 >part4
cat foo.zip.15,foo.zip.16,foo.zip.17,foo.zip.18,foo.zip.19,foo.zip.20 >part5
but how is the performance you have to test on your own...at least this should work for your requirement.

Big data sort and search

I have two files of data, 100 char lines each. File A: 108 lines, file B: 106 lines. And I need to find all the strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 108 records.
I'm waiting for your suggestions on this.
You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:
Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.
When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.
You can slightly improve on #michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.
For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:
#!/usr/bin/python3
import sys
if __name__=='__main__':
b = open(sys.argv[2],'r')
bs = set()
for l in b:
bs.add(l.strip())
b.close()
a = open(sys.argv[1],'r')
for l in a:
l = l.strip()
if l in bs:
bs.remove(l)
for x in bs:
print(x)
I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:
25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
60298 60298 509244

which one is suitable datastructure

Two files each of size terabytes. A file comparison tool compares ith line of file1 with
i th line of file2. if they are same it prints. which datastructure is suitable.
B-tree
linked list
hash tables
none of them
You need to be able to buffer up at LEAST a line at a time. Here's one way:
While neither file is at EOF:
Read lines A and B from files one and two (each)
If lines are identical, print one of them
Translate into suitable programming language, and problem is solved.
Note that no fancy data structures are involved.
the simple logic is read one line at a time from the file and match..
It's like
While line1 is not equal to EOF file1 and line2 is not equal to EOF file2:
Compare line1 and line2
Btw you have to be sure how much maximum character a line can contain so u can change buffer size accordingly..
Otherwise try bigdata concept spark framework to make your work easier.

Resources