I am currently migrating from couchdb to mongodb, still learning this stuff and have a problem anytime I do some queries with sorting in my webpage.
I am using codeigniter as the framework and using alexbilbie lib for mongodb php library.
So, here is my problem:
I intend to do queries from sensors in a room that updated each second (thus, already saved thousands docs in collection) and to get each latest sensor value I use this query from model:
function mongoGetDocLatestDoc($sensorid){
$doc = $this->mongo_db->where(array('SensorId'=>$sensorid))->limit(1)->order_by(array('_id'=>'DESC'))->get('my_mongo');
return $doc;
}
if I called this with my controller, it took a lot of time to process the query and even worse if I change the sort by timestamp. and it is double the latency each time I called this again for the second sensor, let alone I have more than 10 sensors that need this query in the same page. Am I doing it wrong or there is some more efficient way to get the latest data from collection?
edit:
#Sammaye : I tried making an index based on your suggestion and here is the explain generated after I executed the query:
"cursor" : "BtreeCursor timestamp_desc",
"nscanned" : 326678,
"nscannedObjects" : 326678,
"n" : 50,
"millis" : 4402,
"nYields" : 7,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"timestamp" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
}
as per comparison, this explain the first query without using index (that executed faster) :
"cursor" : "BasicCursor",
"nscanned" : 385517,
"nscannedObjects" : 385517,
"n" : 50,
"millis" : 1138,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
Okay thank you all for your great response :) Here is my finding after trying with indexes:
Update the mongodb to the latest version. I found this slightly improve my query. I am updating from default 2.0.3 that provided as default in ubuntu 12.04.02 to mongodb version 2.4.3
Build indexes/compound indexes based exactly on how you mostly needed in your query. as an example, in my question, my query based on SensorId and _id: DESCENDING so the best strategy to optimize my query would be something like :
db.your_collection.ensureIndex({ "SourceId" : 1, "_id" : -1 },{ "name" : "sourceid_idx", "background" : true });
or in other case if I needed it based on timestamp:
db.your_collection.ensureIndex({ "SourceId" : 1, "timestamp" : -1 },{ "name" : "source_idx", "background" : true });
I found a very good explanation about mongodb indexes here
Hope this will help other people that stumbled upon similar problem...
Related
I have a CosmosDb - MongoDb collection that I'm using purely as a key/value store for arbitrary data where the _id is the key for my collection.
When I run the query below:
globaldb:PRIMARY> db.FieldData.find({_id : new BinData(3, "xIAPpVWVkEaspHxRbLjaRA==")}).explain(true)
I get this result:
{
"_t" : "ExplainResponse",
"ok" : 1,
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.FieldData",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 106,
"totalKeysExamined" : 0,
"totalDocsExamined" : 3571,
"executionStages" : {
},
"allPlansExecution" : [ ]
},
"serverInfo" : #REMOVED#
}
Notice that the totalKeysExamined is 0 and the totalDocsExamined is 3571 and the query took over 106ms. If i run without .explain() it does find the document.
I would have expected this query to be lightning quick given that the _id field is automatically indexed as a unique primary key on the collection. As this collection grows in size, I only expect this problem to get worse.
I'm definitely not understanding something about the index and how it works here. Any help would be most appreciated.
Thanks!
Scenario:
I have an index with a bunch of multi-tenant data in Elasticsearch 6.x. This data is frequently deleted (via _delete_by_query) and populated by the tenants.
When issuing a _delete_by_query request with wait_for_completion=false, supplying a query JSON to delete a tenants' data, I am able to see generic task information via the _tasks API. Problem is, with a large number of tenants, it is not actively clear who is deleting data at any given time.
My question is this:
Is there a way I can view the query for which the _delete_by_query task is operating on? Or can I attach an additional param to the URL that is cached in the task to differentiate them?
Side note: looking at the docs: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/tasks.html I see there is a description field in the _tasks API response that has the query as a String, however, I do not see that level of detail in my description field:
"description" : "delete-by-query [myindex]"
Thanks in advance
One way to identify queries is to add the X-Opaque-Id HTTP header to your queries:
For instance, when deleting all tenant data for (e.g.) User 3, you can issue the following command:
curl -XPOST -H 'X-Opaque-Id: 3' -H 'Content-type: application/json' http://localhost:9200/my-index/_delete_by_query?wait_for_completion=false -d '{"query":{"term":{"user": 3}}}'
You then get a task ID, and when checking the related task document, you'll be able to identify which task is/was deleting which tenant data thanks to the headers section which contains your HTTP header:
"_source" : {
"completed" : true,
"task" : {
"node" : "DB0GKYZrTt6wuo7d8B8p_w",
"id" : 20314843,
"type" : "transport",
"action" : "indices:data/write/delete/byquery",
"status" : {
"total" : 3,
"updated" : 0,
"created" : 0,
"deleted" : 3,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "delete-by-query [deletes]",
"start_time_in_millis" : 1570075424296,
"running_time_in_nanos" : 4020566,
"cancellable" : true,
"headers" : {
"X-Opaque-Id" : "3" <--- user 3
}
},
I am developing an app, where I retrieve data from a firebase realtime database.
One the one hand, I got my objects. There will be around 10000 entries when it is finished. A user can select for each property like "Blütenfarbe" (flower color) 1 (or more) characteristics, where he will then get the result plants, on which these constraints are true. Each property has 2-10 characteristics.
Is querying here powerful enough, to get fast results ? If not, my thought would be to also setup container for each characteristic and put every ID in that, when it is a characteristic of that plant.
This is my first project, so any tip for better structure is welcome. I don't want to create this database and realize afterwards, that it is not well enough structured.
Thanks for your help :)
{
"Pflanzen" : {
"Objekt" : {
"00001" : {
"Belaubung" : "Sommergrün",
"Blütenfarbe" : "Gelb",
"Blütezeit" : "Februar",
"Breite" : 20,
"Duftend" : "Ja",
"Frosthärte" : "Ja",
"Fruchtschmuck" : "Nein",
"Herbstfärbung" : "Gelb",
"Höhe" : 20,
"Pflanzengruppe" : "Laubgehölze",
"Standort" : "Sonnig",
"Umfang" : 10
},
"00002" : {
"Belaubung" : "Sommergrün",
"Blütenfarbe" : "Gelb",
"Blütezeit" : "März",
"Breite" : 25,
"Duftend" : "Nein",
"Frosthärte" : "Ja",
"Fruchtschmuck" : "Nein",
"Herbstfärbung" : "Ja",
"Höhe" : 10,
"Pflanzengruppe" : "Nadelgehölze",
"Standort" : "Schatten",
"Umfang" : 10
},
"Eigenschaften" : {
"Belaubung" : {
"Sommergrün" : [ "00001", "00002" ],
"Wintergrün" : ["..."]
},
"Blütenfarbe" : {
"Braun": ["00002"],
"Blau" : [ "00001" ]
},
}
}
}
}
GET /myindex/voc/100/_termvectors?pretty=true
{
"fields":["fields.bodyText"],
"term_statistics" : true,
"filter" : {
"min_doc_freq" : 50,
"max_doc_freq" : 60
}
}
This API returns only part of the results.
Is there something like
"from" : 0, "size" : 10,
as in the _search API pagination?
Yes there is, something like from which represents from which index you got to start the search, and the size represents the number of hits you wanted to return.
So if you have something like this:
"from" : 0, "size" : 10,
It'll return your first ten results from the result set. This could be helpful.
I have some problem with a Restfull interface of mongoDB.
I have submitted this query --> http://127.0.0.1:28017/db/collection/?limit=0(I used limit = 0 because I want to find all my result with an ajax request),
and the result in terms of number of rows is "total_rows" : 38185.
But if in my shell if I execute db.collection.count() the result was 496519.
Why I have these difference? Is it possible to get the same result with an ajax request?
Thanks in advance for your help.
I'm sure that results were not impacted by numbers of rows nor directly MongoDB, but indicates be Webserver (at time created to tasks admin). It’s possibly to be the payload size of response break by Webserver something like HTTP error 413 (entity to larger).
In my tests i see entries in log as "[websvr] killcursors: found 1 of 1". This will kill opened cursor between the client (in the case web server) and MongoDB. Most drivers not need call OP_KILL_CURSORS because the MongoDB define a timeout of 10 minutes by default.
Go back for tests i conclude that size payload of response of web server (built-in MongoDB) is limited 38~40MB. Let me show my analyze.
I created a collections with 1,260,000 documents. In REST web interface make query that results total_rows: 379,677 (or avgObjSize * total_rows = 38MB).
db.manyrows.stats()
{
"ns" : "forum.manyrows",
"count" : 1260000,
"size" : 125101640,
"avgObjSize" : 99.28701587301587,
"storageSize" : 174735360,
"numExtents" : 12,
"nindexes" : 1,
"lastExtentSize" : 50798592,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 48753488,
"indexSizes" : {
"_id_" : 48753488
},
"ok" : 1
}
======
web output
{"total_rows" : 379677 , "query" : {} , "millis" : 6793}
Continuing... dropped/removed some documents of collection to fit 38MB. Do new query results in all documents thats results 379642 of 379642 or 38MB.
> db.manyrows.stats()
{
"ns" : "forum.manyrows",
"count" : 379678,
"size" : 38172128,
"avgObjSize" : 100.53816128403543,
"storageSize" : 174735360,
"numExtents" : 12,
"nindexes" : 1,
"lastExtentSize" : 50798592,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 12329408,
"indexSizes" : {
"_id_" : 12329408
},
"ok" : 1
}
===
web output
{"total_rows" : 379678 , "query" : {} , "millis" : 27325}
New samples with other collections: Results 39MB with (“avgObjSize": 3440.35 * "total_rows": 11395 = 39MB)
> db.messages.stats()
{
"ns" : "enron.messages",
"count" : 120477,
"size" : 414484160,
"avgObjSize" : 3440.3592386928626,
"storageSize" : 518516736,
"numExtents" : 14,
"nindexes" : 2,
"lastExtentSize" : 140619776,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 436434880,
"indexSizes" : {
"_id_" : 3924480,
"body_text" : 432510400
},
"ok" : 1
}
=== web output:
{
"total_rows" : 11395 ,
"query" : {} ,
"millis" : 2956
}
You can try make query with a microframework like Bottle.