I'm planning to do a relatively large XPath query using msxml. Is there a maximum length for a query that msxml enforces?
Background: From some external input my code will create a number of xpath-queries and I am interested in the result of all those or-ed together:
myObject.SelectNodes(subQuery1 +"|"+ subQuery2 +"|" + subQuery3 + "|" + ...)
I even don't know how many subqueries there will be at compile time, so I can' predict how long the query string will get.
I'd rather avoid calling SelectNodes multiple times for each subquery for I fear performance to be worse (COM Marshaling Overhead into MSXML, handling of several result trees as opposed to a single tree, etc.)
I've not come across any such limit and I doubt you could reach any limit before the shear cost of running the query would make it unfeasable anyway.
Related
I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))
The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.
I have a query like this as a key component of my application:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group)
RETURN node
There is an index on :GroupType(Name)
In a database of roughly 10,000 elements this query uses nearly 1 million database hits. Here is the PROFILE of the query:
However, this slight variation of the query which performs an identical search is MUCH faster:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)-[:MEMBER_OF]->(group)
RETURN node
The only difference is the node:NodeType match and the relationship match are merged into a single MATCH instead of a MATCH ... WHERE. This query uses 1/70th of the database hits of the previous query and is more than 10 times faster, despite performing an identical search:
I thought Cypher treated MATCH ... WHERE statements as single search expressions, so the two queries should compile to identical operations, but these two queries seem to be performing vastly different operations. Why is this?
I would like to start by saying that this is not actually a Cypher problem. Cypher describes what you want, not how to get it, so the performance of this query will very vastly between say, Neo4J 3.1.1 and Neo4J 3.2.3.
As the one executing the Cypher is the one that decides how to do this, the real question is "Why doesn't the Neo4J Cypher planner not treat these the same?"
Ideally, both of these Cyphers should be equivalent to
MATCH (node:NodeType)-[:MEMBER_OF]->(group:GroupType{name:"String"})
RETURN node
because they should all produce the same results.
In reality, there are a lot of subtle nuances with dynamically parsing a query that has very many 'equivalent' expressions. But a subtle shift in context can change that equivalence, say if you did this adjustment
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group) OR SIZE(group.members) = 1
RETURN node
Now the two queries are almost nothing alike in their results. In order to scale, the query planner must make decision shortcuts to come up with an efficient plan as quickly as possible.
In sort, the performance depends on what the server you are throwing it at is running because coming up with an actionable lookup strategy for a language that lets you ask for ANYTHING/EVERYTHING is hard!
RELATED READING
Optimizing performance
What is Cypher?
MATCH ... WHERE <pattern> isn't the same as MATCH <pattern>.
The first query performs the match, then uses the pattern as a filter to perform for all built up rows.
You can see in the query plan that what's happening is a cartesian product between your first match results and all :NodeType nodes. Then for each row of the cartesian product, the WHERE checks to see if the the :GroupType node on that row has is connected to the :NodeType node on that row by the given pattern (this is the Expand(Into) operation).
The second query, by contrast, expands the pattern from the previously matched group nodes, so the nodes considered from the expansion are far less in number and almost immediately relevant, only requiring a final filter to ensure that those nodes are :NodeType nodes.
EDIT
As Tezra points out, Cypher operates by having you define what you want, not how to get it, as the "how" is the planner's job. In the current versions of Neo4j (3.2.3), my explanation stands, in that the planner interprets each of the queries differently and generates different plans for each, but that may be subject to change as Cypher evolves and the planner improves.
In these cases, you should be running PROFILEs on your queries and tuning accordingly.
Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.
While thinking about the design of various applications I might like to build some day, in several cases I have had a need to fan out a stream of incoming events based on whether or not they match a large selection of full text search queries provided by users.
A simple example of this problem is the implementation of a tool like Twitter streaming search: given many thousands of new tweets every second, efficiently select only the streaming subscribers whose search query is likely to match an incoming tweet.
A statement of the problem would be something like, "inverse full text search", where the full text is the query, and the search results are the search queries that would match that text.
For single term queries an implementation is obvious: simply tokenize the incoming document, then search a map of term->(list of subscribers), but things become more difficult when boolean queries are possible. In fact the problem is more general than full text search, but it is simplest understood in that context. There are many other examples where a large set of boolean terms need combined some way to optimize cost of evaluating them.
For example, imagine 3 search subscriptions:
Google AND Glass
Google AND Analytics
((Glass AND Google) NOT Knol) OR Twitter
One possibility is to parse the query into a tree, then visit each node, extracting the term, and using the "map of term" approach, however this would require re-evaluating the subscribers query against the incoming document for each term. With enough subscribers, this is going to start getting slow very quickly.
Instead I am wondering if there is a well documented approach to rewrite the queries perhaps into a single query, where the result can be evaluated once, and tree nodes are annotated with a list of subscriber queries known to either exactly or almost certainly match any document that that point in the tree.
For example, the above queries might be rewritten so that a map of term->(query tree) exists, such as:
Google -> (Analytics[2]
Glass[1,3])
Twitter -> ([3])
Is there any existing publicly documented system that does something like this? Ideally the solution would allow incrementally adding and removing subscribers, without some expensive step to rewrite the entire structure.
One way to do this is with a simple dictionary that maps terms to queries. So given these four queries:
Query1: Google AND Glass
Query2: Google AND Analytics
Query3: ((Glass AND Google) NOT Knol) OR Twitter
Query4: Quick AND red AND fox
You build a dictionary, keyed by the term:
Google: Query1, Query2, Query3
Glass: Query1, Query3
Analytics: Query2
Knol: Query3
Twitter: Query3
Quick: Query4
red: Query4
fox: Query4
Now, consider a sentence like "The red glass on the knol is from Google."
Parse each word and look it up in the dictionary. For each word in the dictionary, add its list of queries to your master list of queries. Also, for every word that is found in the dictionary, add it to a hash table of relevant words. At the end of this step you'll have two structures: the list of queries to check, and the list of relevant words:
Queries list: Query1, Query2, Query3, Query4
Relevant words: Google, Glass, Knol, red
Now it's a matter of processing each query, checking to see if the words are in the relevant words list.
For Query1, for example, you'd check to see if the relevant words list contains Google and Glass.
The complexity of this isn't too bad. You have an O(1) lookup for each parsed word in the text. For each query identified during the parsing phase, you have some number, N, O(1) lookups against the relevant words hash table. There's some very small amount of logic involved in doing the Boolean evaluation, but most queries will be simple "all words" or "any word" type queries (i.e. "this AND that", or "this OR that").
The nice thing about this model is that it's pretty easy to farm out to multiple processors. You can parse the words in a single thread, pushing them to a concurrent queue. Multiple threads service that queue, doing the lookups and building their own lists of queries that need to be checked. When all those lookups are done, you merge the queries lists from the multiple threads and again put them on a concurrent queue that multiple threads can service.
Say you have a million queries, averaging five words each (which would likely be a big average). Absolute worst case here is that some text comes in that contains at least one word from each query. So you have a list of a million queries to check in pass 2. At worst, that's 5 million dictionary lookups.
The first pass of this algorithm is O(n), where n is the number of words in the incoming text. That will create a list of k queries. The second pass is O(km), where m is the average number of words per query.
The beauty of this approach is its simplicity, and it will perform well for moderately large numbers of queries, depending on the size of the text you're feeding it. There is a potentially faster way, but it's much more involved.
Rather than building a dictionary that maps terms to queries, you use a modified Aho-Corasick string search algorithm that is very similar to what the Unix fgrep program uses to match multiple regular expressions in a single pass over the text. The details of that are way beyond my ability to explain in a short note here. You might want to track down an old Dr. Dobb's Journal article called something like "Parallel Pattern Matching and fgrep", which as I recall had a reasonably good explanation of how this is done. (A quick search didn't find the article text, but you might have better luck.) You'll also want to read the original Aho-Corasick paper: Efficient String Matching: an Aid to Bibliographic Search. That discusses parallel pattern matching literal strings, but the basic idea works for matching regular expressions or Boolean search queries.
If you can parse your query into boolean expressions, what you have is a set of rules, with the input variables the presence or absence of terms in the search text. For each search text you could use parsing + table lookup or Aho-Corasick to work out which terms are present and then use an implementation of the Rete algorithm such as http://en.wikipedia.org/wiki/Drools to work out which rules to fire given that input.
(Alternately, you could batch up your input texts, build a small text search database from them, and then run your queries. My guess is that this stops being stupidly inefficient when you can afford to wait long enough between query runs for the text search database size to be comparable with the size of the combined queries).
Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.