Big data sort and search

Big data sort and search - sorting

I have two files of data, 100 char lines each. File A: 108 lines, file B: 106 lines. And I need to find all the strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 108 records.
I'm waiting for your suggestions on this.

You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:
Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.
When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.

You can slightly improve on #michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.

For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:
#!/usr/bin/python3
import sys
if __name__=='__main__':
b = open(sys.argv[2],'r')
bs = set()
for l in b:
bs.add(l.strip())
b.close()
a = open(sys.argv[1],'r')
for l in a:
l = l.strip()
if l in bs:
bs.remove(l)
for x in bs:
print(x)
I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:
25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
60298 60298 509244

Related

how to improve performance to run faster (python)

Right now, my program takes more than 10mins LOL try to display all the possible words (if those words are in the file) that can be created from the given letters. In that file, it has more than 4000+ words
How to make my program run faster by using recursion, and not using any libraries because i'm new to it.
if user input letters: b d o s y
then it will look up all the possible words in that file to create:
b
d
boy
boys
by
the code:
words = set()
def found(word, file):
## Reads through file and tries
## to match given word in a line.
with open(file, 'r') as rf:
for line in rf.readlines():
if line.strip() == word:
return True
return False
def scramble(r_letters, s_letters):
## Output every possible combination of a word.
## Each recursive call moves a letter from
## r_letters (remaining letters) to
## s_letters (scrambled letters)
if s_letters:
words.add(s_letters)
for i in range(len(r_letters)):
scramble(r_letters[:i] + r_letters[i+1:], s_letters + r_letters[i])
thesarus = input("Enter the name of the file containing all of the words: ")
letters = input("Please enter your letters separated by a space: ")
word = ''.join(letters.split(' '))
scramble(word, '')
ll = list(words)
ll.sort()
for word in ll:
if found(word, thesarus):
print(word)

You program runs slow because your algorithm is inefficient.
Since you require in the question to use recursion (to generate all the possible combinations), you could improve at least in how you search in the file.
Your code open the file and search a single word reading it for every word. This is extremely inefficient.
First solution it comes into my mind is to read the file once and save each word in a set()
words_set = {line.strip() for line in open('somefile')}
or also (less concise)
words_set = set()
with open('somefile') as fp:
for line in fp:
words_set.add(line.strip())
Then, you just do
if word in words_set:
print(word)
I think there could be more efficient ways to do the whole program, but they don't require recursion.
Update
For the sake of discussion, I think it may be useful to also provide an algorithm which is better.
Your code generates all possible combinations, even if those are not likely to be part of a dictionary, in addition to the inefficient search in the file for each word.
A better solution involves storing the words in a more efficient way, such that it is much easier to tell if a particular combination exists or not. For example, you don't want to visit (in the file) all the words composed by characters not present in the list provided by the user.
There is a data structure which I believe to be quite effective for this kind of problems: the trie (or prefix tree). This data structure can be used to store all the thesaurus file, in place of the set that I suggested above.
Then, instead of generating all the possible combinations of letters, you just visit the tree with all the possible letters to find all the possible valid words.
So, for example, if your user enters h o m e x and you have no word starting with x in your thesaurus, you will not generate all the permutations starting with x, such as xe, xo, xh, xm, etc, saving a large amount of computations.

Is there a way to split a large file into chunks with random sizes?

I know you can split a file with split, but for test purposes I would like to split a large file into chunks whose sizes differ. Is this possible?
Alternatively, if the above-mentioned file is a zip, is there a way to split it into volumes of unequal sizes?
Any suggestions welcome! Thanks!

So the general question that you're asking is: how can I compute N random integers that sum to S? Specifically, S is the size of your file and N is how many smaller files that you want to break it into.
For example, assume that you want to split your file into 4 parts. If a, b, c, and d are four random numbers, then:
a + b + c + d = X
a/X + b/X + c/X + d/X = 1
S*a/X + S*b/X + S*c/X + S*d/X = S
Giving us four random numbers that sum to S, the size of your file.
Which means you'd want to write a script that:
Computes N random numbers (any random numbers).
Computes X as the sum of those random numbers.
Multiplies each of those random numbers by S/X (and makes sure you're left with integers greater than 0 that sum to S)
Splits the original file into pieces using the generated random numbers as sizes, using whatever tool you want.
This is a little much for a shell script, but would be pretty straight forward in something like Perl.

since you tagged the question only with shell. so I supposed you want to handle it only with shell script and those common linux command/tools.
As far as I know there is no existing tool/cmd can split file randomly. To split file, we can consider to use split, dd
Both tools support options like, how big (size) split-ed file should be or how many files do you want to split. let's say, we use dd/split first split your file into 500 parts, each file has same size. so we have:
foo.zip.001
foo.zip.002
foo.zip.003
...
foo.zip.500
then we take this file list as input, to do merge (cat). This step could be done by awk or shell script.
for example we can build a set of cat statements like:
cat foo.zip.001, foo.zip.002 > part1
cat foo.zip.003, foo.zip.004, foo.zip.005 > part2
cat foo.zip.006, foo.zip.007, foo.zip.008, foo.zip.009 > part3
....
run the generated cat statements, you got final part1-n, each part has different size.
for example like:
kent$ seq -f'foo.zip.%g' 20|awk 'BEGIN{i=k=2}NR<i{s=s sprintf ("%s,",$0);next}{k++;i=(NR+k);print "cat "s$0" >part"k-2;s="" }'
cat foo.zip.1,foo.zip.2 >part1
cat foo.zip.3,foo.zip.4,foo.zip.5 >part2
cat foo.zip.6,foo.zip.7,foo.zip.8,foo.zip.9 >part3
cat foo.zip.10,foo.zip.11,foo.zip.12,foo.zip.13,foo.zip.14 >part4
cat foo.zip.15,foo.zip.16,foo.zip.17,foo.zip.18,foo.zip.19,foo.zip.20 >part5
but how is the performance you have to test on your own...at least this should work for your requirement.

Prolog - Getting element from a list of lists

I am having trouble figuring out how to access a single character from a list of strings without using recursion, but instead backtracking.
For example I have this list of Strings and I want to be able to return a single character from one of these strings ('.' 'o', '*'). The program I am working on is treating it as rows and columns. It is a fact in my database that looks like this:
matrix(["...o....",
".******.",
"...o....",
".*...*..",
"..o..*..",
".....*..",
".o...*..",
"....o..o"].
I have the predicate:
get(Row,Col,TheChar) :-
that takes a row and column number (with index starting at 1) and returns the entry (TheEntry) at that specific row and column.
I have a feeling my predicate head might not be build correctly but I'm really more focused on just how to go through each String in the list character by character without recursion and returning that.
I am new to prolog and am having major difficulty with this.
Any help at all would be greatly appreciated!
Thank you!

An implementation of get/3 might look like this:
get(Row,Col,TheChar) :-
matrix(M),
nth(Row,M,RowList),
nth(Col,RowList,TheChar).
Note that TheChar is unified to a character code e.g.
| ?- get(1,4,X).
X = 111
If you want to get see the character you can for instance use atom codes, e.g.
| ?- get(4,2,X), atom_codes(CharAtom,[X]).
X = 42
CharAtom = *
Hope this helps.

using your matrix representation, you could do something like this:
cell(X,Y,Cell) :-
matrix(Rows) ,
Matrix =.. [matrix|Rows] ,
arg(X,Matrix,Cols) ,
Row =.. [row|Cols] ,
arg(Y,Row,Cell)
.
The use of =.. to construct terms on the fly might be a hint that your matrix representation isn't the best. You might consider different representations for your matrix.
Assuming a "standard" matrix with fixed-length rows, you could represent the matrix
A B C D
E F G H
I J K L
in a couple of different ways:
A single string, if the cell values can be represented as a single character and your prolog supports real strings (rather than string-as-list-of-char-atoms):
"ABCDEFGHIJKL"
Lookup is simple and zero-relative (e.g., the first row and the first column are both numbered 0):
( RowLength * RowOffset ) + ColOffset
gives you the index to the appropriate character in the atom. Retrieval consists of a simple substring operation. This has the advantages of speed and simplicity.
a compound term is another option:
matrix( rows( row('A','B','C','D') ,
row('E','F','G','H') ,
row('I','J','K','L')
)
).
Lookup is still simple:
cell(X,Y,Matrix,Value) :-
arg(X,Matrix,Row) ,
arg(Y,Matrix,Cell)
.
A third option might be to use the database to represent your matrix more directly using the database predicates asserta, assertz, retract , retractall , recorda, recordz, recorded, erase. You could build a structure of facts, for instance in the database along the lines of:
matrix( Matrix_Name ).
matrix_cell( Matrix_Name , RowNumber , ColumnNumber , Value ).
This has the advantage of allowing both sparse (empty cells don't need to be represented) and jagged (rows can vary in length) representations.
Another option (last resort,you might say) would be to jump out into a procedural language, if your prolog allows that, and represent the matrix in a more...matrix-like manner. I had to do that once: we ran into huge performance problems with both memory and CPU once the data model got past a certain size. Our solution was to represent the needed relation as a ginormous array of bits, which was trivial to do in C (and not so much in Prolog).
I'm sure you can come up with other methods of representing matrices as well.
TMTOWTDI (Tim-Toady or "There's More Than One Way To Do It") as they say in the Perl community.

How to shuffle the lines in a file without reading the whole file in advance?

What's a good algorithm to shuffle the lines in a file without reading the whole file in advance?
I guess it would look something like this: Start reading the file line by line from the beginning, storing the line at each point and decide if you want to print one of the lines stored so far (and then remove from storage) or do nothing and proceed to the next line.
Can someone verify / prove this and/or maybe post working (perl, python, etc.) code?
Related questions, but not looking at memory-efficient algorithms:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
How can I randomize the lines in a file using standard tools on Red Hat Linux?
How can I print the lines in STDIN in random order in Perl?

I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.
I am not familiar with perl, or python, but can demonstrate with php.
<?php
$offsets = array();
$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
if (fgetc($f) == "\n") $offsets[] = ftell($f);
}
shuffle($offsets);
foreach ($offsets as $offset)
{
fseek($f, $offset);
echo fgets($f);
}
fclose($f);
?>
The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):
Determine the filesize
Create a list of offsets and lengths already written to stdout
Loop until bytes_written == filesize
Seek to a random offset that is not already in your list of already written values
Back up from that seek to the previous newline or start of file
Display that line, and add it to the list of offsets and lengths written
Go to 3.

The following algorithm is linear in the number of lines in your input file.
Preprocessing:
Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.
Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).
Main
Create a new file
Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.
Continue as in step 2 for all the other lines.
So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.

You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)
Edit
Just an example in python:
import sys
import random
CACHE_SIZE = 23
lines = {}
for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
i = random.randint(0, CACHE_SIZE-1)
old = lines.get(i)
if old:
print old,
lines[i] = l
for ignored, p in lines.iteritems():
print p,

How would you sort 1 million 32-bit integers in 2MB of RAM?

Please, provide code examples in a language of your choice.
Update:
No constraints set on external storage.
Example: Integers are received/sent via network. There is a sufficient space on local disk for intermediate results.

Split the problem into pieces small enough to fit into available memory, then use merge sort to combine them.

Sorting a million 32-bit integers in 2MB of RAM using Python by Guido van Rossum

1 million 32-bit integers = 4 MB of memory.
You should sort them using some algorithm that uses external storage. Mergesort, for example.

You need to provide more information. What extra storage is available? Where are you supposed to store the result?
Otherwise, the most general answer:
1. load the fist half of data into memory (2MB), sort it by any method, output it to file.
2. load the second half of data into memory (2MB), sort it by any method, keep it in memory.
3. use merge algorithm to merge the two sorted halves and output the complete sorted data set to a file.

This wikipedia article on External Sorting have some useful information.

Dual tournament sort with polyphased merge
#!/usr/bin/env python
import random
from sort import Pickle, Polyphase
nrecords = 1000000
available_memory = 2000000 # number of bytes
#NOTE: it doesn't count memory required by Python interpreter
record_size = 24 # (20 + 4) number of bytes per element in a Python list
heap_size = available_memory / record_size
p = Polyphase(compare=lambda x,y: cmp(y, x), # descending order
file_maker=Pickle,
verbose=True,
heap_size=heap_size,
max_files=4 * (nrecords / heap_size + 1))
# put records
maxel = 1000000000
for _ in xrange(nrecords):
p.put(random.randrange(maxel))
# get sorted records
last = maxel
for n, el in enumerate(p.get_all()):
if el > last: # elements must be in descending order
print "not sorted %d: %d %d" % (n, el ,last)
break
last = el
assert nrecords == (n + 1) # check all records read

Um, store them all in a file.
Memory map the file (you said there was only 2M of RAM; let's assume the address space is large enough to memory map a file).
Sort them using the file backing store as if it were real memory now!

Here's a valid and fun solution.
Load half the numbers into memory. Heap sort them in place and write the output to a file. Repeat for the other half. Use external sort (basically a merge sort that takes file i/o into account) to merge the two files.
Aside:
Make heap sort faster in the face of slow external storage:
Start constructing the heap before all the integers are in memory.
Start putting the integers back into the output file while heap sort is still extracting elements

As people above mention type int of 32bit 4 MB.
To fit as much "Number" as possible into as little of space as possible using the types int, short and char in C++. You could be slick(but have odd dirty code) by doing several types of casting to stuff things everywhere.
Here it is off the edge of my seat.
anything that is less than 2^8(0 - 255) gets stored as a char (1 byte data type)
anything that is less than 2^16(256 - 65535) and > 2^8 gets stored as a short ( 2 byte data type)
The rest of the values would be put into int. ( 4 byte data type)
You would want to specify where the char section starts and ends, where the short section starts and ends, and where the int section starts and ends.

No example, but Bucket Sort has relatively low complexity and is easy enough to implement

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio