Parse multiple array similarity query - algorithm

I am working on an algorithm that will compare 2 objects, object 1 and object 2. Each object has attributes which are 5 different arrays, array A, B, C, D, and E.
In order for the two objects to be a match, at least one item from Object 1 A must be in Object 2 A AND Object 1 B must be in Object 2 B, etc through object E must be similar. With a higher number of matches in each array A-E, the higher of a score The match will produce.
Am I going to have to pull Object 1 and object 2 then do an n^2 complexity search on each array to determine which ones exist in both arrays? Then I would go about serving a score by how many matches there were in each array, then add them up and the total would give me the score.
I feel like there has to be a better option for this, especially for Parse.com
Maybe I am going about this problem all wrong, can someone PLEASE help me with this problem. I would provide some code for this one, but I have not started the code yet because I cannot wrap my head around the best way to design it. The two object database are in place already though.
Thanks!
As I said, I may be thinking of this problem in the wrong way. If I am unclear about anything that I am trying to do, let me know and I will update accordingly.

Simplest solution:
Copy all elements from some array object1 to hash table (unordered map), and thereafter iterate array in the 2nd object, and lookup presence in the map. Thus, time complexity is O(N).
Smart solution:
Keep elements in all objects not in the "naive arrays", but in the arrays, structured as hash tables with double hashing algorithm. If so, all arrays in an objects1, 2, already will be pre-indexed, and what is you needed - iterate array, contains less number of elements, and match elements vs longest pre-indexed array.

Related

How to join two already sorted arrays into one sorted array in M$ Flow efficiently

Microsoft Flow doesn't support any sort function for arrays or lists.
For my problems I can use sort function within ODATA request to have some data presorted by the databases I'm accessing. In my case, I want to have a list of all start and end dates from a sharepoint calendar in a single array.
I can pull all dates sorted by the start date and I can pull all dates sorted by the end date into separate arrays. Now I have two sorted arrays which I want to join into a single array.
There are very few possibilites in iterating over an array. But the task has some properties which could ease the problem.
Two arrays,
both presorted by the same property as the desired final arrays.
same size.
Perhaps I'm missing some feature of the ODATA-request or there's a simple workaround. I'd prefer not to use REST-api or messing around with the JSON or manually, but if there's really an elegant solution I won't reject it.
I have a solution, but I don't think it is a good one.
Prerequesites are the two already sorted arrays and two additional arrays.
Let's call the two sorted arrays I have extracted from the sharepoint list array A and B.
And let's call the additional arrays array S1 and S2.
Then I set up a foreach-loop on array B.
Within that loop I filter array A for all elements lesser or equal to the current item of array B.
The output of the filter operation is saved to array S1.
current item of array B is appendet to array S1.
Again filter Array A for all elements, but this time for greater than the current item of array B.
save the output of the filter operation to array S2.
make a union from S1 and S2.
save the output of the union expression to array A.
As every element of array A has to be copied n times for a n-element array, the effort for processing two arrays of n elements is not quite optimal, especially if you consider both arrays already sorted in advance.
n² comparisons
2n²+n copy operations (not taking into account the imperfections of the implementation of flow)
If I'd implement a complete sort from scratch it would perform better, I think, but I also think, there must be other means to join two presorted arrays of compatible content.

Intersection between two collection

I was asked this interview question. Best and efficient way to get the intersection between two collections, one very big and the other small. Java Collection basically
With no other information what you can do is to save time when performing the n x m comparisons as follows:
Let big and small be the collections of size n and m.
Let intersection be the intersecting collection (empty for now).
Let hashes be the collection of hashes of all elements in small.
For each object in big let h be its hash integer.
For each hash value in the hashes collection, ifhash = h, then compare object with the element of small whose hash is hash. If they are equal, add object to intersection.
So, the idea is to compare hashes instead of objects and only compare objects if their hashes coincide.
Note that the additional collection for the hashes of small is acceptable because of the size of this supposedly small collection. Notice also that the algorithm computes n + m hash values and comparatively few object comparisons.
Here is the code in Smalltalk
set := small asSet.
^big select: [:o | set includes: o]
The Smalltalk code is very compact because the message includes: sent to set works as described in step 5 above. It first compares hashes and then objects if needed. The select: is also a very compact way to express the selection in step 5.
UPDATE
If we enumerate all the comparisons between the elements of both collections we would have to consider n x m pairs of objects, which would account for a complexity of order O(nm) (big-O notation). On the other hand, if we put the small collection into a hashed one (as I did in the Smalltalk example) the inner testing that happens every time we have to check if small includes an object of big will have a complexity of O(1). And given that hashing the small collection is O(m), the total complexity of this method would be O(n + m).
Let's call the two collections large and small, respectively.
The Java Collection is an Abstract class - you cannot actually use it directly - you have to use one of Collection's concrete sub-classes. For this problem, you can use Sets. A Set has only unique elements, and it has a method contains(Object o). And it's subclass, SortedSet, is created in ascending order, by default.
So copy small into a Set. It's now got no duplicate values. Copy large into a second Set, and this way we can use its contains() method. Create a third Set called intersection, to hold the intersection results.
for-each element in small check if large.contains(element_from_small) Every time you find a match, intersection.add(element_from_small)
At the end of the run through small, you'll have the intersection of all objects in both original collections, with no duplicates. If you want it ordered, copy it into a SortedSet and it'll then be in ascending order.

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

Data structure to represent piecewise continuous range?

Say that I have an integer-indexed array of length 400, and I want to drop out a few elements from the beginning, lots from the end, and something from the middle too, but without actually altering the original array. That is, instead of looping through the array using indices {0...399}, I want to use a piecewise continuous range such as
{3...15} ∪ {18...243} ∪ {250...301} ∪ {305...310}
What is a good data structure to describe this kind of index ranges? An obvious solution is to make another "index mediator" array, containing mappings from continuos zero-based indexing to the new coordinates above, but it feels quite wasteful, since almost all elements in it would be simply sequential numbers, with just a few occasional "jumps". Besides, what if I find that, oh, I want to modify the range a bit? The whole index array would have to be rebuilt. Not nice.
A few points to note:
The ranges never overlap. If a new range is added to the data structure, and it overlaps with existing ranges, the whole thing should get merged. That is, if I add to the above example the range {300... 308}, it should instead replace the last two ranges with {250...310}.
It should be quite cheap to simply loop through the whole range.
It should also be relatively cheap to query a value directly: "Give me the original index corresponding to the 42nd index in the mapped coordinates".
It should be possible (though maybe not quite cheap) to work other way round: "Give me the mapped coordinate corresponding to 42 in the original coordinates, or tell if it's mapped at all."
Before rolling my own solution, I'd like to know if there exists a well-known data structure that solves this class of problems elegantly.
Thanks!
Seems like an array or list of integer pairs would be the best data structure. Your choice as to whether the second integer of the pair is a end point or a count from the first integer.
Edit: On further reflection, this problem is exactly what a database index has to do. If the integer pairs don't have to be in numeric order, you can handle splits easier. If the number sequence has to remain in order, you need a data structure that allows you to add integer pairs to the middle of the array or list.
A split would be having to change the (6, 12) integer pair to (6, 9) (11, 12), when 10 is removed, as an example.
Besides, what if I find that, oh, I want to modify the range a bit? The whole index array would have to be rebuilt. Not nice.
True. Perhaps one integer pair needs to change. Worst case, you'd have to rebuild the entire array or list.

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Resources