as I'm well aware, using AppendTo[] for large lists isn't recommended as the function gets progressively slower with each appending done.
Lots of suggestions talk about using Reap[] and Sow[], however the ones I found don't deal with nested lists.
My question is relatively simple, how to substitute AppendTo[] with Reap[], Sow[] as directly as possible?
Specific information regarding my problem: I would like to append data in form of
data = {{x},{y}}
to a list, so after a few iterations, the list would look something like
list = {{{x1},{y1}},{{x2},{y2}},{{x3},{y3}}}
My solution works for only 1 iteration, and then breaks apart because of increasing use of Flatten[], so it's obviously a non-working solution. Of course, I'm open to other alternatives, faster than AppendTo[].
Thanks!
My question may sound a bit strange:
I know that hash functions used for encryption must have the characteristic to greatly change the outputs even for small changes of one bit due to some sort of avalanche effect.
Does an unefficent hash exist with the characteristic that for similar strings it produces similar outputs?
If the answer is yes, can you tell me if there is a way to manipulate this aspect (the definition of "similar") with underlying, pre-known relations between the characters of a text?
These kind of hashing functions are called locality sensitive.
Yes, it exists. As you expect the definition of similar must be given. But it depends on your application - I can just give an example.
Suppose your strings are domains and you want to hash all subdomains into one bucket. Then you can reverse strings like:
finance.yahoo.com => com.yahoo.finance
sport.yahoo.com => com.yahoo.sport
user.mail.yahoo.com => com.yahoo.mail.user
and hash only the first two parts: com.yahoo, dropping the rest. Your hash function can look like this (in python):
def hash(url):
return any_other_hash_function(".".join(url.split(".")[::-1][:2]))
Your question is not strange at all. You can find similar approach in Google's map-reduce or BigTable (and in many many other systems) for keeping similar things altogether, so that one can speed up computations.
The example I gave is for strings, but you can use similar approach for other objects. Here the idea is just to separate items into groups and hash the group id (high-level domain name).
I looking for a algorithm, function or technique that can take a string and convert it to a number. I would like the algorithm or function to have the following properties:
Identical string yields the same calculated value
Similar strings would yield similar values (similar can be defined as similar in meaning or similar in composition)
Capable of handling strings of variable length
I read an article several years ago that gives me hope that this can be achieved. Unfortunately, I have been unable to recall the source of the article.
Similar in composition is pretty easy, I'll let somebody else tackle that.
Similar in meaning is a lot harder, but fun :), I remember reading an article about how a neural network was trained to construct a 2D "semantic meaning graph" of a whole bunch of english words, where the distance between two words represented how "similar" they are in meaning, just by training it on wikipedia articles.
You could do the same thing, but make it one-dimensional, that will give you a single continuous number, where similar words will be close to each other.
Non-serious answer: Map everything to 0
Property 1: check. Property 2: check. Property 3: check.
But I figure you want dissimilar strings to get different values, too. The question then is, what is similar and what is not.
Essentially, you are looking for a hash function.
There are a lot of hash functions designed with different objectives. Crypographic hashes for examples are pretty expensive to compute, because you want to make it really hard to go backwards or even predict how a change to the input affects the output. So they try really hard to violate your condition 2. There are also simpler hash functions that mostly try to spread the data. They mostly try to ensure that close input values are not close to each other afterwards (but it is okay if it is predictable).
You may want to read up on Wikipedia:
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_substrings
(Yes, it has a section on "Finding similar substrings" via Hashing)
Wikipedia also has a list of hash functions:
https://en.wikipedia.org/wiki/List_of_hash_functions
There is a couple of related stuff for you. For example minhash could be used. Here is a minhash-inspired approach for you: Define a few random lists of all letters in your alphabet. Say I have the letters "abcde" only for this example. I'll only use two lists for this example. Then my lists are:
p1 = "abcde"
p2 = "edcba"
Let f1(str) be the index in p1 of the first letter in my test word, f2(str) the first letter in p2. So the word "bababa" would map to 0,3. The word "ababab" also. The word "dada" would make to 0,1, while "ce" maps to 2,0. Note that this map is invariant to word permutations (because it treats them as sets) and for long texts it will converge to "0,0". Yet with some fine tuning it can give you a pretty fast chance of finding candidates for closer inspection.
Fuzzy hashing (context triggered piecewise hashing) may be what you are looking for.
Implemenation: ssdeep
Explanation of the algorithm: Identifying almost identical files using context triggered piecewise hashing
I think you're probably after a hash function, as numerous posters have said. However, similar in meaning is also possible, after a fashion: use something like Latent Dirichlet Allocation or Latent Semantic Analysis to map your word into multidimensional space, relative to a model trained on a large collection of text (these pre-trained models can be downloaded if you don't have access to a representative sample of the kind of text you're interested in). If you need a scalar value rather than multi-dimensional vector (it's hard to tell, you don't say what you want it for) you could try a number of things like the probability of the most probable topic, the mean across the dimensions, the index of the most probable topic, etc. etc.
num = 0
for (byte in getBytes(str))
num += UnsignedIntValue(byte)
This would meet all 3 properties(for #2, this works on the strings binary composition).
I am working on the parallelization an algorithm, which roughly does the following:
Read several text documents with a total of 10k words.
Create an objects for every word in the text corpus.
Create a pair between all word-objects (yes, O(n)). And return the most frequent pairs.
I would like to parallelize the 3. step by creating the pairs between the first 1000 word-objects the rest on the fist machine, the second 1000 word-objects on the next machine, etc.
My question is how to pass the objects created in the 2. step to the Mapper? As far as I am aware I would require input files for this and hence would need to serialize the objects (though haven't worked with this before). Is there a direct way to pass the objects to the Mapper?
Thanks in advance for the help
Evgeni
UPDATE
Thank you for reading my question before. Serialization seems to be the best way to solve this (see java.io.Serializable). Furthermore, I have found this tutorial useful to read data from serialized objects into hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).
How about parallelize all steps? Use your #1 text documents as input to your Mapper. Create the object for every word in the Mapper. In the Mapper your key-value pair will be the word-object pair (or object-word depending on what you are doing). The Reducer can then count the unique pairs.
Hadoop will take care of bringing all the same keys together into the same Reducer.
Use twitter protobufs ( elephant-bird ) . Convert each word into a protobuf object and process it however you want. Also protobufs are much faster and light compared to default java serialization. Refer Kevin Weil's presentation on this. http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
I was asked to stay away from HashMap or any sort of Hashing.
The question went something like this -
Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?
Why is using Maps a bad idea for such a scenario?
What changes would you make to sell your solution to Amazon?
A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).
In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.
In this case I can see only some as already proposed in other answers:
Sorted array (doing a binary search)
Hasmap
trie
With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:
"1"
"2"
"3"
...
"123"
"124"
"1234567"
Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities.
Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .
While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.
Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.
A B-Tree in my opinion. Does that still count as a Map?
Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.
Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.
What you might do in the response to such question is to explain that since
you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.
So, let's say you have M = 10^20 products with their numbers and descriptions.
You can partition this set to the groups of N subsets.
Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.
To illustrate this idea, consider a smaller example of only 20 elements.
I would like you to imagive the file system with directories "1", "2", "3", "4".
In each directory you store the product descriptions as files in the following way:
folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20
Then your search would only need two steps to find the file.
First, you search for a correct folder by dividing 20 / 5 (your M/N).
Then, you use the given ID to read the product description stored in a file.
This is just a very rough description, however, the idea is very intuitive.
So, perhaps this is what your interviewer wanted to hear.
As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.
Best/efficient for what? Would have been my answer.
E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.
Of course the solution is pretty useless for any real application, but so is the question.
There is an interesting alternative to B-Tree: Radix Tree
I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.
If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.
You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.
I wouldn't do it this way, but perhaps that is the answer they were looking for.
Asaf
I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.
Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.
The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.
All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.
If the size is limited sometimes it's faster to use a sorted list.
When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.
On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.
But as Artyom has pointed out maybe a simple linear search would be much faster in this case.
On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.
There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.
Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.
But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)
Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.
20 decimal PRODUCT IDs, along with Product Description
Simple linear search would be very good...
I would create one simple array with ids. And other array with data.
Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.
I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.
They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.
Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.
I think what he wanted you to do, and
I'm not saying it's a good idea, is to
use the computer memory space.
If you use a 64-bit (virtual) memory
address, and assuming you have all the
address space for your data (which is
never the case) you can store a
one-byte value.
Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?
log2(10^20) = 66.43.
Here's a slightly evil proposal.
OK, 2^64 bits can fit inside a memory space.
Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?)
Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.
Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.
Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)
Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.
Of course paging will start to be a bitch at some point!
Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.