MongoDB find and remove - the fastest way - performance

I have a quick question, what is the fast way to grab and delete an object from a mongo collection. Here is the code, I have currently:
$cursor = $coll->find()->sort(array('created' => 1))->limit(1);
$obj = $cursor->getNext();
$coll->remove(array('name' => $obj['name']));
as you can see above it grabs one document from the database and deletes it (so it isn't processed again). However fast this may be, I need it to perform faster. The challenge is that we have multiple processes doing this and processing what they have found BUT sometimes two or more of the processes grab the same document therefore making duplicates. Basically I need to make it so a document can only be grabbed once. So any ideas would be much appreciated.

Peter,
It's hard to say what the best solution is here without understanding all the context - but one approach which you could use is findAndModify. This will query for a single document and return it, and also apply an update to it.
You could use this to find a document to process and simultaneously modify a "status" field to mark it as being processed, so that other workers can recognize it as such and ignore it.
There is an example here that may be useful:
http://docs.mongodb.org/manual/reference/command/findAndModify/

Use the findAndRemove function as documented here:
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html
The findAndRemove function retrieve and object from the mongo database and delete it in a single (atomic) operation.
findAndRemove(query, sort[, options], callback)
The query object is used to retrieve the object from the database (see collection.find())
The sort parameter is used to sort the results (in case many where found)

I make a new answer to remark the fact:
As commented by #peterscodeproblems in the accepted answer. The native way to this in mongodb right now is to use the
findAndModify(query=<document>, remove=True)
As pointed out by the documentation.
As it is native, and atomic, I expect this to be the faster way to do this.

I am new to mongodb and not entirely sure what your query is trying to do, but here is how I would do it
# suppose database is staging
# suppose collection is data
use staging
db.data.remove(<your_query_criteria>)
where is a map and can contain any search criteria you want
Not sure if this would help you.

Related

Is it possible to get items from DynamoDB where the primary key ends with a given string?

Is it possible, using the AWS Ruby SDK (or just DynamoDB in general), to get an item or items from a table that uses a primary key only, and where that primary key ends with a certain string?
I haven't come across anything in the docs that explicitly answers this question, either in the ruby ddb docs or the general docs for ddb. I'm not saying the question is not answered, but if it is, I can't find it.
If it is possible, could someone provide an example for ruby or link to the docs where an example exists?
Although #Ryan is correct and this can be done with query, just bear in mind that you're doing a "full-table-scan" here. That might be OK for a one-time job but probably not the best practice for a routine task (and of course not as a part of your API calls).
If your use-case involves quickly finding objects based on their suffix in a specific field, consider extracting this suffix (assuming it's a fixed-size suffix) as another field and have a secondary index on that one. If you want to query arbitrary length suffixes, I would create a lookup table and update it with possible suffixes (or some of them, to save some calls, and then filter when querying).
It looks like you would want to use the Query method on the SDK to find the items your looking for. It seems that "EndsWith" is not available as a comparison operator in the SDK though. So you would need to use CONTAINS and then check your results locally.
This should lead to the best performance, letting DynamoDb do the initial heavy lifting and then further pruning the results once you receive them.
http://docs.aws.amazon.com/sdkforruby/api/Aws/DynamoDB/Client.html#query-instance_method

Need help rewriting XQuery to avoid expanded tree cache full error in MarkLogic

I am new to XQuery and MarkLogic.
I am trying to update documents in MarkLogic and get the extended tree cache full error.
Just to get the work done I have increased the expanded tree cache but that is not recommended.
I would like to tune this query so that it does not need to simultaneously cache as much XML.
Here is my query
I have uploaded my query as an image because it was not so pretty when I pasted it on the editor. If any one knows a better way please suggest.
Thanks in advance.
Expanded tree cache errors can be caused by executing queries that select too many XML nodes at once. In your example, this is likely the culprit: /tx:AttVal[tx:AttributeName/text()=$attributeName].
It's possible that calling text() is the source of your problem (and text() probably not what you mean anyway - see this blog), causing MarkLogic to evaluate that function on all these nodes, and that by simply using /tx:AttVal[tx:AttributeName=$attributeName] it may solve your problem.
Next I would consider an adding a path range index on /tx:AttVal/tx:AttributeName and query those nodes using cts:search and cts:path-range-query. This will be substantially faster than just XPath without a range index. It's also possible to use XPath with a range index: MarkLogic will automatically optimize the XPath expression to use the range index; however, there can be reasons it doesn't optimize the expression correctly, and you would want to check that using xdmp:plan.
Also note that the general best practice recommendation for XML in MarkLogic is to use "semantic XML". E.g., when you mean an attribute, use an attribute: <some-node AttributeName=AttVal>. MarkLogic's indexes are optimized out of the box for semantic XML design. However, if you don't have an option but to work with XML that's not, then that's what path range indexes were designed for.
I've just solved exactly this scenario. There are two things I did
I put the node-replace and node-insert type calls (that is any calls that modify the XML structure into a separate module and then called that module using xdmp:invoke, passing in any parameters required, like this
let $update := xdmp:invoke("/app/lib/update-attribute-node.xqy",
(xs:QName("newValue"), $new),
{xdmp:modules-database()})
The reason why this works is that the call to xdmp:invoke happens in it's own transaction and once it completes, the memory is cleared up. If you don't do this then, each time you call the update or insert function, it will not actually do the write, until the end in a single transaction meaning your memory will fill up pretty quickly.
Any time I needed to loop over paths in MarkLogic (or documents or whatever they are called - I've only been using MarkLogic for a few days) and there are a large number of them I processed them only a few at a time like below. I came up with an elaborate way of skipping and taking only a batch of documents at a time, but you can do it in any number of ways.
let $whatever:= xdmp:directory("/whatever/")[$start to $end]
I also put this into a separate module so that it is processed immediately and not in a single transaction.
Putting all expensive calls into separate modules and taking only a subset of large data sets at a time helped me solve my expanded tree cache full errors.

Speed up search

I have collection with around 2.5million records. Currently I'm using nodejs with mongoose. I'm open to any suggestions which would improve efficiency of current search.
Artist collection schema is something like that:
var Artist = new mongoose.Schema({
realname: String,
names: Array
});
User passes string for example: Michael Jackson was contacted by police or Michael Jackson-Jillie bean. Now I have to find artist/singer/person so the only logical thing I could find was: Iterate through all documents in collection, check if any of names is in given string if YES we have match -> stop loop.
But this is very memory inefficient since it has to load whole collection to memory so nodejs can iterate through and check all records.
It seems that retrieving whole collection takes most of time. Any way to speed this up? In mongoshell db.collection.find() is pretty fast. But using mongoose in nodejs takes waaay too long.
Why is .find() so slow used with mongoose?
Are there any other databases for designed for this purpose?
How would you solve this problem more efficiently?
Could map reduce be for any use here?
What you need is a full-text search. Possible options:
Use MongoDB text index. (new in 2.4)
Use an addition full text index. In my opinion elasticsearch is very performant and easy to understand for beginners.

How to uniquly identify an two objects in same page having same url

I Have two objects in same page but with different locations(tabs), I want to verify those objects each a part ...
i cant uniquely any of objects because the have same properties.
These objects clearly are unique to a point because they have completely different text, this means that you will be able to create an object to match only one of them. My suggestion would be to look for the object by using its text property, one of them will always have "Top Ranking" the other you wil need to turn into a regular expression for the text and will be something "Participants (\d+)".
I am assuming that this next answer is unlikely to be possible so saved it for after the answer you are likely to use but the best solution would of course be to get someone with access to give these elements ids for you to search for. This will in the long term be much easier for you to maintain and not using text will allow this test to run in any language.
Manaysah, do these objects have different indexes? Use the object spy and determine which index they have, the ordinal identifier index may be a solution to your problem. You could also try adding an innertext object property if possible, using a wildcard for the number inside the () as it appears dynamic.
try using xpath for the objects...xpath will definitely be different

In XQuery, how do I avoid deteriorating performance in a paged simple query?

I have a ML database with a few tens of thousands of documents in it, and a query that returns some simple calculated values for either all or a subset of those documents. The document count has grown to the point that the "all documents" option no longer reliably runs without timing out, and is only going to get worse as the document count grows. The obvious solution is for the client application to use the other form and paginate the results. It's an offline batch process, so overall speed isn't an issue - we'd just like to keep individual requests sane.
The paged version of the query is very simple:
declare namespace ns = "http://some.namespace/here"
declare variable $fromCount external;
declare variable $toCount external;
<response> {
for $doc in fn:doc()/ns:entity[$fromCount to $toCount]
return
<doc> omitted for brevity </doc>
} </response>
The problem is that the query is slower the further through the document set the requested page is; presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response.
One wrinkle is that there are other types of document in the database, so just using fn:doc isn't a realistic option (although, they are in different directories, so xdmp:directory() might be an option; something I'll look into.)
There also isn't currently an index on the ns:entity element; would that help? It's always the root-node of a document, and the documents are quite large, so I'm concerned about the size of the index. Also, (the slow part of) this query isn't interested in the value of the element, just that it exists.
I thought about using the search: api for it's built-in paging, but it seems overkill for a query that is intended to match all documents; surely it's possible to manually construct the query that search:search() would build internally.
It seems like what I really need is an efficient list of all root-nodes of a certain type in the database. Does Marklogic maintain such a thing? If Not would an index solve the problem?
Edit: It turns out that the answer in my case is use the xdmp:directory() option, since ML apparently stores a fast, in-memory list of all documents. Still, if there is a more general solution to the problem, it's bound to be of interest, so I'll leave the question here.
Your analysis is correct:
presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response
The usual answer is cts:search plus the unfiltered option. You found that xdmp:directory was faster, but you should still be able to measure pagination times as O(n) even if the scale is smaller. See http://docs.marklogic.com/guide/performance/unfiltered#chapter - basically the database is guarding against returning false positives, unless you tell it not to.
Another approach might be to use cts:uris and its limit option, but this might require managing pagination state in terms of start values rather than page counts. For example if the last item on page 1 was "cat", you would use "cat" as arg2 when calling cts:uris for the next page. You could still use pagination start-stop values, too. That would still be O(n) - but at a much smaller scale.

Resources