Document retrieval with unwanted words

Document retrieval with unwanted words - algorithm

I am building a data structure that helps indexing a collection of S documents of total length n, such that it supports the following query: Given two words P1 and P2, count all the documents that contain P1 but not P2. I want the answer to be complete (not to miss results).
I've built a generalized suffix tree and pick every sqrt(n)-th leaf and its ancestors (and delete every one-childed node). For each internal node v I pre-calculate the answer for the query against node u.
But with this, if the query contains words that appear in the tree in nodes v and u, I can have the answer in O(1), but what can I do when the words are not in one of the nodes that we picked?
I can do it easily by keeping a O(n2) data structure with pre-processing and having all the possible answers ready for O(1) time retrieval, but the goal is to build this data structure in O(n) space and make the queries as efficient as possible.

It sounds like an inverted index would still be useful to you. It's a map of words onto ordered lists of documents containing them. The documents need to have a common, total ordering, and it is in this order in which they appear in their per-word buckets.
Assuming your n is total length of the corpus in word occurrences (and not vocabulary size), it can be constructed in O(n log n) time and linear space.
Given P1 and P2, you make two separate queries to get the documents containing the two terms respectively. Since the two lists share a common ordering, you can do a linear merge-like algorithm and select just those documents containing P1 but not P2:
c1 <- cursor to first element of list of docs containing P1
c2 <- cursor to first element of list of docs containing P2
results <- [] # our return value
while c1 not exhausted
if c2 exhausted or *c1 < *c2
results.append(c1++)
else if *c1 == *c2
c1++
c2++
else # *c1 > *c2
c2++
return results
Notice every pass of the loop iterates at least one cursor; it runs in linear time in the sum of the sizes of the two initial queries. Since only things from the c1 cursor enter results, we know all results contain P1.
Finally, note we always advance only the "lagging" cursor: this (and the total document ordering) guarantees that if a document appears in both initial queries, there will be a loop iteration where both cursors point to that document. When this iteration occurs, the middle clause necessarily kicks in and the document is skipped by advancing both cursors. Thus documents containing P2 necessarily do not get added to results.
This query is an example of a general class called Boolean queries; it's possible to extend this algorithm to cover most any boolean expression. Certain queries break the efficiency of the algorithm (by forcing it to walk over the entire vocabulary space) but basically so long as you don't negate each term (i.e. don't ask for not P1 and not P2) you're fine. See this for an in-depth treatment.

Related

Algorithmic Puzzle: Distinct nodes in a subtree

I am trying to solve this question:
You are given a rooted tree consisting of n nodes. The nodes are
numbered 1,2,…,n, and node 1 is the root. Each node has a color.
Your task is to determine for each node the number of distinct colors
in the subtree of the node.
The brute force solution is to store a set for each node and them cumulatively merge them in a depth first search. That would run in n^2, not very efficient.
How do I solve this (and the same class of problems) efficiently?

For each node,
Recursively traverse the left and right nodes.
Have each call return a HashSet of color.
At each node, merge the left child set, the right child set .
Update the count for the current node in a HashMap.
Add the color of current node and return the set.
Sample C# code:
public Dictionary<Node, Integer> distinctColorCount = new ...
public HashSet<Color> GetUniqueColorsTill (TreeNode t) {
// If null node, return empty set.
if (t == null) return new HashSet<Color>();
// If we reached here, we are at a non-null node.
// First get the set from its left child.
var lSet = GetUniqueColorsTill(t.Left);
// Second get the set from its right child.
var rSet = GetUniqueColorsTill(t.Right);
// Now, merge the two sets.
// Can be a little clever here. Merge smaller set to bigger set.
var returnSet = rSet;
returnSet.AddAll(lSet);
// Put the count for this node in the dictionary.
distinctColorCount[t] = returnSet.Count;
// Finally, add the color of current node and return.
returnSet.Add(t.Color);
return returnSet;
}
You can figure out the complexity exactly as #user58697 commented on your question using the Master Theorem. This is another answer from me written long time ago that explains Master Theorem, if you need a refresher.
c#

First of all, you'd want to change tree into a list. This technique is often called 'Euler Tour'.
Basically you make an empty list and run DFS. If you visit a node first or last time, push it's color at the end of the list. In this way you'll get list of length 2 * n, where n is equal to number of nodes. It's easy to see that in the list, all colors corresponding to node's children are between its first and last occurrence. Now instead of tree and queries 'how many different colors are there in node's subtree' you have list and queries 'how many different colors are there between index i-th and j-th'. That actually makes things a lot easier.
First idea -- MO's technique O(n sqrt(n)):
I will describe it briefly, I strongly recommend searching up MO's technique, it is well explained in many sources.
Sort all your queries (remainder, they look like this: given pair (i, j) find all distinct numbers in sub-array from index i to index j) by their start. Make sqrt(n) buckets, place query starting from index i to bucket number i / sqrt(n).
For each bucket we will answer the queries separately. Sort all queries in the bucket by their end. Now start processing the first one (the query which end is most to the left) using brute force (iterate over the subarray, store numbers in set/hashset/map/whatever, get size of the set).
Now to process the next one, we shall add some numbers at the end (next query ends farther than the previous one!) and, unfortunately, do something about its start. We'd need to either delete some numbers from the set (if the next query's start > old query start) or add some numbers from the beginning (if the next query's start < old query start). However, we may do it using brute force too, since all queries have start in the same segment of sqrt(n) indices! In total we get O(n sqrt(n)) time complexity.
Second idea -- check this out, O(n log n): Is it possible to query number of distinct integers in a range in O(lg N)?

Neo4J - Finding the widest path on very large graphs

I have created a very large directional weighted graph, and I'm trying to find the widest path between two points.
each edge has a count property
Here is a small portion of the graph:
I have found this example and modified the query, so the path collecting would be directional like so:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, EXTRACT(c IN RELATIONSHIPS(p) | c.count) AS counts
UNWIND(counts) AS b
WITH p, MIN(b) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1
This query seems to require an enormous amount of memory, and fails even on partial data.
Update: for classification, the query is running until running out of memory.
I've found this link, that combines the use of spark and neo4j. Unfortunately Mazerunner for Neo4j, does not support "widest path" algorithm out of the box. What would be the right approach to run the "widest path" query on a very large graph?

The reason your algorithm is taking a long time to run is because (a) you have a big graph, (b) your memory parameters probably need tweaking (see comments) and (c) you're enumerating every possible path between ENTRY and EXIT. Depending on what your graph is structured like, this could be a huge number of paths.
Note that if you're looking for the broadest path, then broadest is the largest/smallest weight on an edge. This means that you're probably computing and re-computing many paths you can ignore.
Wikipedia has good information on this algorithm you should consider. In particular:
It is possible to find maximum-capacity paths and minimax paths with a
single source and single destination very efficiently even in models
of computation that allow only comparisons of the input graph's edge
weights and not arithmetic on them.[12][18] The algorithm maintains a
set S of edges that are known to contain the bottleneck edge of the
optimal path; initially, S is just the set of all m edges of the
graph. At each iteration of the algorithm, it splits S into an ordered
sequence of subsets S1, S2, ... of approximately equal size; the
number of subsets in this partition is chosen in such a way that all
of the split points between subsets can be found by repeated
median-finding in time O(m). The algorithm then reweights each edge of
the graph by the index of the subset containing the edge, and uses the
modified Dijkstra algorithm on the reweighted graph; based on the
results of this computation, it can determine in linear time which of
the subsets contains the bottleneck edge weight. It then replaces S by
the subset Si that it has determined to contain the bottleneck weight,
and starts the next iteration with this new set S. The number of
subsets into which S can be split increases exponentially with each
step, so the number of iterations is proportional to the iterated
logarithm function, O(logn), and the total time is O(m logn).[18] In
a model of computation where each edge weight is a machine integer,
the use of repeated bisection in this algorithm can be replaced by a
list-splitting technique of Han & Thorup (2002), allowing S to be
split into O(√m) smaller sets Si in a single step and leading to a
linear overall time bound.
You should consider implementing this approach with cypher rather than your current "enumerate all paths" approach, as the "enumerate all paths" approach has you re-checking the same edge counts for as many paths as there are that involve that particular edge.
There's not ready-made software that will just do this for you, I'd recommend taking that paragraph (and checking its citations for further information) and then implementing that. I think performance wise you can do much better than your current query.

Some thoughts.
Your query (and the original example query) can be simplified. This may or may not be sufficient to prevent your memory issue.
For each matched path, there is no need to: (a) create a collection of counts, (b) UNWIND it into rows, and then (c) perform a MIN aggregation. The same result could be obtained by using the REDUCE function instead:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1;
(I assume that the count property value is an int. 2147483647 is the max int value.)
You should create an index (or, perhaps more appropriately, a uniqueness constraint) on the name property of the Vertex label. For example:
CREATE INDEX ON :Vertex(name)
EDITED
This enhanced version of the above query might solve your memory problem:
MERGE (t:Temp) SET t.count = 0, t.widest_path = NULL
WITH t
OPTIONAL MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH t, p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
WHERE count > t.count
SET t.count = count, t.widest_path = NODES(p)
WITH COLLECT(DISTINCT t)[0] AS t
WITH t, t.count AS count, t.widest_path AS `Widest Path`
DELETE t
RETURN `Widest Path`, count;
It creates (and ultimately deletes) a temporary :Temp node to keep track of the currently "winning" count and (the corresponding path nodes). (You must make sure that the label Temp is not otherwise used.)
The WITH clause starting with COLLECT(DISTINCT t) uses aggregation of distinct :Temp nodes (of which there is only 1) to ensure that Cypher only keeps a single reference to the :Temp node, no matter how many paths satisfy the WHERE clause. Also, that WITH clause does NOT include p, so that Cypher does not accumulate paths that we do not care about. It is this clause that might be the most important in helping to avoid your memory issues.
I have not tried this out.

Check if word can be made out of given letters fast

I have some letters and frequency counts. And I have a very long list of words (1M say).
Suppose I have A-1, B-1, D-1 ("at most one A, at most one B, at most one D"), then I can make "BAD", but not "RAD"
Can I know in which words can be made out of those letters, in logarithmic time, or something like that, instead of iterating through all words and looking at the counts of each letter in the word?
What data structure can be used for these words? A trie maybe? I'm unaware of them. It would also be great if I can store letters required for each word with it. Please help!

Here's a (literal) sketch of a data structure.
[root]
----- | -----
A1 A2 B1 ...
----/- ---|--- -\----
B1 C1 [a] B1 B2 C1 C1 C2 D2 ...
It's a tree, where the leaf nodes are the words in the word list. The words at a leaf node are composed exactly of the bag of letters consisting of the path from the root to that node. Non-leaf nodes are labelled with a letter and a count. A child of a node must either be a leaf (a word) or have a letter strictly later in the alphabet. So, to get to "cat", you go down the path A1,C1,T1, and cat (and act) will be a child of T1. At each node, you traverse the children which have count ≤ your input count (so for the bag A3, C1, T2, you would traverse any node labelled A1,A2,A3, C1, T1 or T2).
The traversal takes O(n) time in the worst case (every word matches), but on average takes substantially less. For a small input bag, it will only traverse a few nodes. For a large input bag, it traverses many nodes, but it will also find many words.
The tree contains at most one node per letter in the wordlist, so it will have size at most proportional to the length of the wordlist.
This is a time- and space- efficient structure which can be computed and stored relatively easily -- it won't take much more space than your wordlist, and queries pretty fast.

If you need words that have all the letters, I've done something like that before (my crossword cheat program, I'm ashamed to say).
I took a dictionary file and preprocessed it so each line had the letters sorted, followed by the word itself, like:
aaadkrrv:aardvark
Then, if you have the letters ardvkraa, sort that, then look for the lines containing that string before the colon. I used grep since O(n) was good enough but you could easily put all the lines into a balanced binary tree to give you O(log n) complexity.
That won't help much if you're after words that use only some of the letters but it's not clear whether that's what you wanted.

I can't say I can grasp the problem you present 100% from your description, but from what I see, you can do the following:
You index your list of words. For example, 'B1' is one index, which will contain a list of entries which contain not more than one letter B, or otherwise fulfill the requirement of the problem you are solving. You can also have "composite" indices, like 'A1B1' along the same lines. Given the budget of time you can afford for indexing, you can create pretty deep hashes. If you are using an alphabet with 26 letters and want to hash 4-letter combinations, it's only 14,950 indexes, and if it's 3 letters, it's a meager 2,600. Indices can be built during one iteration over the list, so their creation is linear. Once you are past this stage, a large part of your lookups will be logarithmic. In my example, your 4 letter word lookup will be a single fetch. Of course, for longer letter combinations you use the index first, then iterate.

A Hash that finds results that are close

I have a table that has about 10000 entries each entry has almost 100 boolean values. A user checkboxes a bunch of the booleans and hopes to get a result that matches their request. If that record doesn't exist, I want to show them maybe 5 records that are close(have only 1 or two values different). Is there a good hash system or data structure that can help me find these results.

Bitmap indices. Google for the paper if you want the complete background, it's not easy but worth a read. Basically build bitmpas for your boolean values like this:
010110101010
110100010100
000101001100
And then just XOR your filter through them, sort by number of matches, return. Since all operations are insanely fast (about one cycle per element, and the data structure uses (edit) 100 bits of memory per element), this will usually work even though it's linear.
Addendum: How to XOR. (fixed a bug)
000101001100 source
000101001010 target
000000000110 result of XOR
int n = 0; if (v) do { n++; } while (v &= (v-1)); return(n);
The two 1's tell you that there are 2 errors and m-2 matches, where m is the number of bits.

What you describe is a nearest neighbor search: based on a record, find the 5 closest records based on an arbitrary distance function (such as the number of different values).
A hashing function intentionally discards any information except "these values are equal", so it's not really the way to go.
Consider using instead a data structure optimized for nearest neighbor searching, such as a kd-tree or vp-tree. If there's a high probability that a record already exists in the list, you could first use a hash table to look for it, and then fall back on the kd-tree if it does not exist.

This builds on the answer from Kdansky.
Create a dynamic array of entries.
Create a cache.
for each lookup,
check the cache.
return the cache entry if the value exists.
otherwise for each value in the dynamic array,
if hamming distance is less than threshold add to the result list
cache the value against the result
return the result
to find the hamming distance:
xor and find the hamming weight http://en.wikipedia.org/wiki/Hamming_weight

Superset Search

I'm looking for an algorithm to solve the following in a reasonable amount of time.
Given a set of sets, find all such sets that are subsets of a given set.
For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.
I have found two solutions that are adequate:
Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.
Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.
What's your solution?

The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).
For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).
If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.

First I would construct 2 data structures, S and E.
S is an array of sets (set S has the N subsets).
S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)
E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.
// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...
Now, 2 new structures, set L and array C.
I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.
// count subset's elements that are in E
foreach e in L:
foreach idx in E[e]:
C[idx] = C[idx] + 1
Finally,
for i in C:
if C[i] == S[i].Count()
// S[i] subset exists in D

Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.
Here's Wiki on full text search.
EDIT: Ok, I got that backwards.
You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.
Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio