Speed up search - performance

I have collection with around 2.5million records. Currently I'm using nodejs with mongoose. I'm open to any suggestions which would improve efficiency of current search.
Artist collection schema is something like that:
var Artist = new mongoose.Schema({
realname: String,
names: Array
});
User passes string for example: Michael Jackson was contacted by police or Michael Jackson-Jillie bean. Now I have to find artist/singer/person so the only logical thing I could find was: Iterate through all documents in collection, check if any of names is in given string if YES we have match -> stop loop.
But this is very memory inefficient since it has to load whole collection to memory so nodejs can iterate through and check all records.
It seems that retrieving whole collection takes most of time. Any way to speed this up? In mongoshell db.collection.find() is pretty fast. But using mongoose in nodejs takes waaay too long.
Why is .find() so slow used with mongoose?
Are there any other databases for designed for this purpose?
How would you solve this problem more efficiently?
Could map reduce be for any use here?

What you need is a full-text search. Possible options:
Use MongoDB text index. (new in 2.4)
Use an addition full text index. In my opinion elasticsearch is very performant and easy to understand for beginners.

Related

How can I improve Marklogic 7 performance on the following: /*[fn:name()="something"]

I have a basic query:
/*[fn:name()="something"]
(1) Marklogic 7 is taking multiple seconds, is there an index that I can add to make this query faster?
(2) Which in memory limits should be increased to improve performance?
(3) Are there other ways to improve the performance with a different query but get exactly the same result?
Try using fn:node-name instead. I believe that's optimized. You'll need to handle namespaces properly, and that's part of why it can be optimized while fn:name can't.
/*[fn:node-name()=fn:QName("","something")]
The following two xPaths should be the exact same:
/theNameOfMyElement
and
/*[fn:name()="theNameOfMyElement"]
The latter is adding an unnecessary and costly qualifier. First off, the * has to search for everything, not just elements. Several other problems exist with that approach.
If my first query is still taking a long time, use cts:search, which is much faster as it searches against indexes. The queries above can be written like this:
cts:search(/theNameOfMyElement, ())
Where the second parameter (empty sequence) can be a qualifying cts:query.
If namespaces are giving you fits, you can just do:
/*:theNameOfMyElement
/*[fn:name()="something"] seems like very bad practice to me. Use /something instead.
EDIT
After seeing that the other answer got accepted, I've been trying to think of what scenario you must be trying to solve if his solution worked and mine didn't. I'm still very certain that there is a faster way by just using xPath the way it was designed to work.
After some thought, I've decided your "real" scenario must either involve a dynamic element name or you may be trying to see if the element name matches one of a sequence of names.
I have drawn up a sample with it's output provided below, that demonstrates how you could still use both without using the qualifier based on fn:node-name
let $xml as element(sample) := <sample>
<wrapper>
<product>
<entry>
<red>$1.00</red>
<yellow>$3.00</yellow>
<blue>$4.50</blue>
</entry>
</product>
</wrapper>
</sample>
let $type as xs:string := "product"
return $xml/wrapper/xdmp:unpath($type)/entry/(red|yellow)
(: Returns
<red>$1.00</red>
<yellow>$3.00</yellow>
:)
In addition to the other good suggestions, consider applying pagination. MarkLogic can identify interesting content from indexes fast, but pulling up actual content from disk is relatively slow. And sending over all results at once could mean trying to hold all results (potentially billions) in memory before sending a reply over the wire. Pagination allows pulling up results in batches, which keeps memory usage low, and potentially allows parallelization as well.
HTH!

Is it possible to get items from DynamoDB where the primary key ends with a given string?

Is it possible, using the AWS Ruby SDK (or just DynamoDB in general), to get an item or items from a table that uses a primary key only, and where that primary key ends with a certain string?
I haven't come across anything in the docs that explicitly answers this question, either in the ruby ddb docs or the general docs for ddb. I'm not saying the question is not answered, but if it is, I can't find it.
If it is possible, could someone provide an example for ruby or link to the docs where an example exists?
Although #Ryan is correct and this can be done with query, just bear in mind that you're doing a "full-table-scan" here. That might be OK for a one-time job but probably not the best practice for a routine task (and of course not as a part of your API calls).
If your use-case involves quickly finding objects based on their suffix in a specific field, consider extracting this suffix (assuming it's a fixed-size suffix) as another field and have a secondary index on that one. If you want to query arbitrary length suffixes, I would create a lookup table and update it with possible suffixes (or some of them, to save some calls, and then filter when querying).
It looks like you would want to use the Query method on the SDK to find the items your looking for. It seems that "EndsWith" is not available as a comparison operator in the SDK though. So you would need to use CONTAINS and then check your results locally.
This should lead to the best performance, letting DynamoDb do the initial heavy lifting and then further pruning the results once you receive them.
http://docs.aws.amazon.com/sdkforruby/api/Aws/DynamoDB/Client.html#query-instance_method

In XQuery, how do I avoid deteriorating performance in a paged simple query?

I have a ML database with a few tens of thousands of documents in it, and a query that returns some simple calculated values for either all or a subset of those documents. The document count has grown to the point that the "all documents" option no longer reliably runs without timing out, and is only going to get worse as the document count grows. The obvious solution is for the client application to use the other form and paginate the results. It's an offline batch process, so overall speed isn't an issue - we'd just like to keep individual requests sane.
The paged version of the query is very simple:
declare namespace ns = "http://some.namespace/here"
declare variable $fromCount external;
declare variable $toCount external;
<response> {
for $doc in fn:doc()/ns:entity[$fromCount to $toCount]
return
<doc> omitted for brevity </doc>
} </response>
The problem is that the query is slower the further through the document set the requested page is; presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response.
One wrinkle is that there are other types of document in the database, so just using fn:doc isn't a realistic option (although, they are in different directories, so xdmp:directory() might be an option; something I'll look into.)
There also isn't currently an index on the ns:entity element; would that help? It's always the root-node of a document, and the documents are quite large, so I'm concerned about the size of the index. Also, (the slow part of) this query isn't interested in the value of the element, just that it exists.
I thought about using the search: api for it's built-in paging, but it seems overkill for a query that is intended to match all documents; surely it's possible to manually construct the query that search:search() would build internally.
It seems like what I really need is an efficient list of all root-nodes of a certain type in the database. Does Marklogic maintain such a thing? If Not would an index solve the problem?
Edit: It turns out that the answer in my case is use the xdmp:directory() option, since ML apparently stores a fast, in-memory list of all documents. Still, if there is a more general solution to the problem, it's bound to be of interest, so I'll leave the question here.
Your analysis is correct:
presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response
The usual answer is cts:search plus the unfiltered option. You found that xdmp:directory was faster, but you should still be able to measure pagination times as O(n) even if the scale is smaller. See http://docs.marklogic.com/guide/performance/unfiltered#chapter - basically the database is guarding against returning false positives, unless you tell it not to.
Another approach might be to use cts:uris and its limit option, but this might require managing pagination state in terms of start values rather than page counts. For example if the last item on page 1 was "cat", you would use "cat" as arg2 when calling cts:uris for the next page. You could still use pagination start-stop values, too. That would still be O(n) - but at a much smaller scale.

MongoDB find and remove - the fastest way

I have a quick question, what is the fast way to grab and delete an object from a mongo collection. Here is the code, I have currently:
$cursor = $coll->find()->sort(array('created' => 1))->limit(1);
$obj = $cursor->getNext();
$coll->remove(array('name' => $obj['name']));
as you can see above it grabs one document from the database and deletes it (so it isn't processed again). However fast this may be, I need it to perform faster. The challenge is that we have multiple processes doing this and processing what they have found BUT sometimes two or more of the processes grab the same document therefore making duplicates. Basically I need to make it so a document can only be grabbed once. So any ideas would be much appreciated.
Peter,
It's hard to say what the best solution is here without understanding all the context - but one approach which you could use is findAndModify. This will query for a single document and return it, and also apply an update to it.
You could use this to find a document to process and simultaneously modify a "status" field to mark it as being processed, so that other workers can recognize it as such and ignore it.
There is an example here that may be useful:
http://docs.mongodb.org/manual/reference/command/findAndModify/
Use the findAndRemove function as documented here:
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html
The findAndRemove function retrieve and object from the mongo database and delete it in a single (atomic) operation.
findAndRemove(query, sort[, options], callback)
The query object is used to retrieve the object from the database (see collection.find())
The sort parameter is used to sort the results (in case many where found)
I make a new answer to remark the fact:
As commented by #peterscodeproblems in the accepted answer. The native way to this in mongodb right now is to use the
findAndModify(query=<document>, remove=True)
As pointed out by the documentation.
As it is native, and atomic, I expect this to be the faster way to do this.
I am new to mongodb and not entirely sure what your query is trying to do, but here is how I would do it
# suppose database is staging
# suppose collection is data
use staging
db.data.remove(<your_query_criteria>)
where is a map and can contain any search criteria you want
Not sure if this would help you.

Manually specifying how to build an index?

I'm looking into Thinking Sphinx for it's potential to solve an indexing problem. It looks like it has a very specific API for telling it what fields to index on a model. I don't like having this layer of abstraction in my way without being able to sidestep it. The thing is I don't trust Sphinx to be able to interpret my model properly as this model could have any conceivable property. Basically, I want to encode JSON in a RDBMS. In a way, I'm looking to make an RDBMS behave like MongoDB (RDBMSes have features I don't want to do without). If TS or some other index could be made to understand my models this could work. Is it possible to manually provide key/value pairs to TS?
"person.name.first" => "John", "person.name.last" => "Doe", "person.age" => 32,
"person.address" => "123 Main St.", "person.kids" => ["Ed", "Harry"]
Is there another indexing tool that could be used from Ruby to index JSON?
(By the way, I have explored a wide variety of NoSQL databases. I am trying to address a very specific set of requirements.)
As Matchu has pointed out in the comments, Sphinx usually interacts directly with the database. This is why Thinking Sphinx is built like it is.
However, Sphinx (but not Thinking Sphinx) can also accept XML data formats - so if you want to go down that path, feel free. You're going to have to understand the underlying Sphinx structure much more deeply than you would if using a normal relational database/ActiveRecord and Thinking Sphinx approach. Riddle may be useful for building a solution, but you'll still need to understand Sphinx itself first.
Basically, when you're specifying what you want to index--that is, when you want to build your own index--you're using the Map part of Map/Reduce. CouchDB supports exactly this. The only problem I ran into with Couch is that I want to query other document objects as the basis of my Map/Reduce since those documents would contain metadata about how I want to build my indexes. This goes against the grain of Map/Reduce however as you have to map a document in isolation with no external data. If you need external data it would instead be denormalized into your documents.

Resources