MongoDB geospatial query with sort - performance issues - performance

I have query (which is very slow ~2,5s):
db.markers.find({ latlng: { '$within': { '$box': [ [ -16, -140 ], [ 75, 140 ] ] } } }).sort({_id: -1}).limit(1000)
When I run explain for this query I get
{
"cursor" : "GeoBrowse-box",
"isMultiKey" : false,
"n" : 1000,
"nscannedObjects" : 242331,
"nscanned" : 242331,
"nscannedObjectsAllPlans" : 242331,
"nscannedAllPlans" : 242331,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 1383,
"nChunkSkips" : 0,
"millis" : 2351,
"indexBounds" : {
"latlng" : [ ]
},
"lookedAt" : NumberLong(262221),
"matchesPerfd" : NumberLong(242331),
"objectsLoaded" : NumberLong(242331),
"pointsLoaded" : NumberLong(0),
"pointsSavedForYield" : NumberLong(0),
"pointsChangedOnYield" : NumberLong(0),
"pointsRemovedOnYield" : NumberLong(0),
"server" : "xx:27017"
}
When I remove sort({_id: -1}) explain gives me (fast query 5 milis):
{
"cursor" : "GeoBrowse-box",
"isMultiKey" : false,
"n" : 1000,
"nscannedObjects" : 1000,
"nscanned" : 1000,
"nscannedObjectsAllPlans" : 1000,
"nscannedAllPlans" : 1000,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 5,
"indexBounds" : {
"latlng" : [ ]
},
"lookedAt" : NumberLong(1000),
"matchesPerfd" : NumberLong(1000),
"objectsLoaded" : NumberLong(1000),
"pointsLoaded" : NumberLong(0),
"pointsSavedForYield" : NumberLong(0),
"pointsChangedOnYield" : NumberLong(0),
"pointsRemovedOnYield" : NumberLong(0),
"server" : "xx:27017"
}
I have 2d index on latlng, desc index on _id and compound indexes.
db.markers.ensureIndex({latlng: '2d', _id:-1})
db.markers.ensureIndex({ latlng: '2d' })
db.markers.ensureIndex({ _id: -1 })
What I want to achieve is to get markers from a particular area sorted from newest.
Any ideas or suggestions how to do a lot less than 2.5 seconds??
If someone wants to do their own tests
var i = 0,
lat = 0,
lng = 0;
for (i; i < 260000; i++) {
lat = parseFloat(Math.min(-90 + (Math.random() * 180), 90).toFixed(6));
lng = parseFloat(Math.min(-180 + (Math.random() * 360), 180).toFixed(6));
collection.insert({latlng: [lat, lng]}, function () {});
}
collection.find({ latlng: { '$within': { '$box': [ [ -90, -180 ], [ 90, 180 ] ] } } }, {latlng: 1, _id: 1 }).sort({_id: -1}).limit(1000).explain()
On my local machine I receives (~ 2,6s):
{
"cursor" : "GeoBrowse-box",
"isMultiKey" : false,
"n" : 1000,
"nscannedObjects" : 260000,
"nscanned" : 260000,
"nscannedObjectsAllPlans" : 260000,
"nscannedAllPlans" : 260000,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 1612,
"nChunkSkips" : 0,
"millis" : 2613,
"indexBounds" : {
"latlng" : [ ]
},
"lookedAt" : NumberLong(260000),
"matchesPerfd" : NumberLong(260000),
"objectsLoaded" : NumberLong(260000),
"pointsLoaded" : NumberLong(0),
"pointsSavedForYield" : NumberLong(0),
"pointsChangedOnYield" : NumberLong(0),
"pointsRemovedOnYield" : NumberLong(0),
"server" : "xx:27017"
}
Thx

Do you actually have the following three indexes defined on your collection?
db.markers.ensureIndex({ latlng: '2d', _id:-1 })
db.markers.ensureIndex({ latlng: '2d' })
db.markers.ensureIndex({ _id: -1 })
The geospatial indexing docs advise against creating multiple geo indexes on the same collection. Although MongoDB will allow it, the behavior may be undesirable. My guess for your case is that the non-compound {latlng: '2d'} may have been selected for use instead of the compound index. The explain() output doesn't really help us here, since it simply reports GeoBrowse-box instead of the index name; however, I would suggest manually hinting that the cursor use the compound index and see if the results improve. Alternatively, simply get rid of the non-compound index, so {latlng: '2d', _id:-1} because the obvious and only choice for the query optimizer.
Lastly, the {_id: -1} index is redundant and can be removed. Per the compound index documentation, direction is only relevant when dealing with indexes comprised of multiple fields. For a single-key index, we can walk the index backwards or forwards easily enough. Since MongoDB already creates an {_id: 1} index for us by default, it's more efficient to simply rely on that.
Now, with indexing out of the way: one caveat with your query is that limits are applied to the geospatial query component before sorting by non-geo criteria (_id in your case). I believe this means that, while your results will indeed be sorted by _id, that sort may not be considering all documents within the matched bounds. This is mentioned in the compound index bit of the documentation, which references SERVER-4247 as a pending solution.
Edit: Following up with your benchmark
I populated the example data, which are 260k random points between ±90 and ±180. I then ran your query:
db.markers.find(
{ latlng: { $within: { $box: [[-90, -180], [90, 180]] }}},
{ latlng: 1, _id: 1 }
).sort({_id: -1}).limit(1000).explain()
That took 1713ms (I'll use that as a baseline of comparison instead of your time of 2351ms). I'll also note that the query matched all 260k documents, and scanned the same number of index entries. It appears the limit didn't factor in until the _id sort, which is not what I would have expected based on the note here. I then tweaked the query a bit to examine some other cases:
Original query without the _id sort and limit: nscanned is 260k and time is 1470ms.
Original query without the _id sort: nscanned is 1000 and time is 9ms.
Original query without the limit: nscanned is 260k and time is 2567ms.
I also wanted to test sorting on an unindexed field alone to simulate what might happen for the _id sort after a geo match; however, I couldn't use _id since the default index will always exist. To do this, I deleted the compound geo index and then sorted by the latlng object. This resulted in nscanned of 260k and a time of 1039ms. If I add a limit of 1000, the time was 461ms.
If we add that to the 1470ms above (geo query without a sort and limit), it's very close to the original query without a limit, which was 2567ms. Likewise, if we add 461ms (limited sort) to 1470ms, it's near the original benchmark result of 1713ms. Based on that correlation, I'd wager that the _id sort in your benchmark isn't taking advantage of the compound index at all.
In any event, one other reason the benchmark is slow is due to a very wide geo match. Tighter bounds would definitely result in less data to sort, even with that sort being unindexed. That said, I do think SERVER-4247 would help you, since it would likely process the non-geo sort first before performing the geo match.

Are your indexes using compound keys?
db.markers.ensureIndex({latlng: '2d', _id:-1})

Related

Nifi: MergeRecord doesn't wait and group up json files to one batch

I met the problem with Apache NiFi.
I have about 100.000k+ json files looks like:
[ {
"client_customer_id" : 8385419410,
"campaign_id" : "11597209433",
"resourceName" : "customers/8385419410/adGroupAds/118322191652~479093457035",
"campaign" : "11597209433",
"clicks" : "0",
"topImpressionPercentage" : 1,
"videoViews" : "0",
"conversionsValue" : 0,
"conversions" : 0,
"costMicros" : "0",
"ctr" : 0,
"currentModelAttributedConversions" : 0,
"currentModelAttributedConversionsValue" : 0,
"engagements" : "0",
"absoluteTopImpressionPercentage" : 1,
"activeViewImpressions" : "0",
"activeViewMeasurability" : 0,
"activeViewMeasurableCostMicros" : "0",
"activeViewMeasurableImpressions" : "0",
"allConversionsValue" : 0,
"allConversions" : 0,
"averageCpm" : 0,
"gmailForwards" : "0",
"gmailSaves" : "0",
"gmailSecondaryClicks" : "0",
"impressions" : "2",
"interactionRate" : 0,
"interactions" : "0",
"status" : "ENABLED",
"ad.resourceName" : "customers/8385419410/ads/479093457035",
"ad.id" : "479093457035",
"adGroup" : "customers/8385419410/adGroups/118322191652",
"device" : "DESKTOP",
"date" : "2020-11-25"
} ]
Before saving it to database one by one, i want to create batch with 1,000-10,000 elements in one json and then save it to DB to increase speed.
MergeRecord settings:
What did i expect: MergeRecord waiting some time to group up json to create batch with 1000-10000 elements in one json, and then send this batch to PutDatabaseRecord processor.
Actual behaviour: MergeRecord instantly sending json's to PutDatabaseRecord one by one without grouping and joining them.
1/10 flows files will contain several json files as one file, as u can see on screenshot by their size. But seems like these settings of processor don't apply to all files:
I don't understand where's the problem. MergeRecord settings or json files? This is really slow behaviour and my data (1.5 Gb) will be stored in 1 day probably.
The only way I could replicate this was to use a random table.name for each of the flow files, which would cause each file to be in it's own bin, rapidly overfilling your "Maximum Number of Bins", and causing each file to be sent as a separate flow file. If you have more than 10 tables, I would increase that setting.
My only other suggestion would be to play around with the Run Schedule and Run Duration of the MergeRecord Processor (on the scheduling tab). If you set the run schedule to 2 minutes (for example), the processor will run once every two minutes and try to merge as many of the files in the queue as it can.

Performance - ID for Mongo: BSON or String

Background
I was doing some tests to see which would be the best for a primary key. I assumed that BSON would be better than a string. When I run some tests though, I'm getting about the same results. Am I doing something wrong here or can someone confirm that this is correct?
About my tests
I have created 200k records with 2 mongoid models. I ran everything in ruby benchmark. I did three main queries, a find(id) query, a where(id: id)query and a where(:id.in => array_of_ids). All of which gave me pretty similar response times.
Benchmark.bm(10) do |x|
x.report("String performance") { 100.times { ModelString.where(id: '58205ae41d41c81c5a0289e5').pluck(:id) } }
x.report("BSON performance") { 100.times { ModelBson.where(id: '581a1d271d41c82fc3030a34').pluck(:id) } }
end
Here are my models in Mongoid:
class ModelBson
include Mongoid::Document
end
class ModelString
include Mongoid::Document
field :_id, type: String, pre_processed: true, default: ->{ BSON::ObjectId.new.to_s }
end
Benchmark Results
ID miss "find" query
user system total real
String performance 0.140000 0.070000 0.210000 ( 2.187263)
BSON performance 0.280000 0.060000 0.340000 ( 2.308928)
ID hit "find" query
user system total real
String performance 0.280000 0.060000 0.340000 ( 2.392995)
BSON performance 0.190000 0.060000 0.250000 ( 2.245230)
100 IDs "in" query hit
String performance 0.850000 0.110000 0.960000 ( 9.221822)
BSON performance 0.770000 0.060000 0.830000 ( 8.055971)
db.collection.stats
{
"ns" : "model_bsons",
"count" : 199221,
"size" : 9562704,
"avgObjSize" : 48,
"numExtents" : 7,
"storageSize" : 22507520,
"lastExtentSize" : 11325440,
"paddingFactor" : 1,
"paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
"userFlags" : 1,
"capped" : false,
"nindexes" : 1,
"indexDetails" : {
},
"totalIndexSize" : 6475392,
"indexSizes" : {
"_id_" : 6475392
},
"ok" : 1
}
{
"ns" : "model_strings",
"count" : 197680,
"size" : 9488736,
"avgObjSize" : 48,
"numExtents" : 7,
"storageSize" : 22507520,
"lastExtentSize" : 11325440,
"paddingFactor" : 1,
"paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only.",
"userFlags" : 1,
"capped" : false,
"nindexes" : 1,
"indexDetails" : {
},
"totalIndexSize" : 9304288,
"indexSizes" : {
"_id_" : 9304288
},
"ok" : 1
}
This is correct.
As you can see from collections stats, documents from both collections have the same size (avgObjSize field). So there is no difference between BSON ObjectID and string field size (both 12 bytes).
What really matters is the index size. Here you can notice that index size on
BSON collections is about 30% smaller than on String collection, because BSON objectID can take full advantage of index prefix compression. The index size difference is too small to see a real performance change with 200 000 documents, but I guess that increasing the number of documents could show different results

MongoDB count query performance

I have problem with count performance in MongoDB.
I'm using ZF2 and Doctrine ODM with SoftDelete filter. Now when query "first time" collection with db.getCollection('order').count({"deletedAt": null}), it takes about 30 seconds, sometimes even more. Second and more query takes about 150ms. After few minutes query takes again about 30 seconds. This is only on collections with size > 700MB.
Server is Amazon EC2 t2.medium instance, Mongo 3.0.1
Maybe it similar to MongoDB preload documents into RAM for better performance, but those answers do not solve my problem.
Any ideas what is going on?
/edit
explain
{
"executionSuccess" : true,
"nReturned" : 111449,
"executionTimeMillis" : 24966,
"totalKeysExamined" : 0,
"totalDocsExamined" : 111449,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : []
},
"nReturned" : 111449,
"executionTimeMillisEstimate" : 281,
"works" : 145111,
"advanced" : 111449,
"needTime" : 1,
"needFetch" : 33660,
"saveState" : 33660,
"restoreState" : 33660,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 111449
},
"allPlansExecution" : []
}
The count will go through each document which is creating performance issues.
Care about the precise number if it's a small one. You're interested to know if there are 100 results or 500. But once it goes beyond, let's say, 10000, you can just say 'More than 10000 results' found to the user.
db.getCollection('order').find({"deletedAt": null}).limit(10000).count(true)

can't match digits in haystack elastic search

I have some products that I'm indexing that go something like "99% chocolate". If I search for chocolate, it matches this particular item, but if I search for "99", it doesn't match. I came across this Using django haystack autocomplete with elasticsearch to search for digits/numbers? which had the same issue, but nobody has answered his question. Can someone please help?
Edit2: I'm sorry I neglected to include an important detail. The numeric search itself works, but the autocomplete doesn't work. I'm including the relevant lines:
#the relevant line in my index
name_auto = indexes.EdgeNgramField(model_attr='name')
#the relevant line in my view
prodSqs = SearchQuerySet().models(Product).autocomplete(name_auto=request.GET.get('q', ''))
Edit: following are the results of running the analyser:
curl -XGET 'localhost:9200/haystack/_analyze?analyzer=standard&pretty' -d '99% chocolate'
{
"tokens" : [ {
"token" : "99",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<NUM>",
"position" : 1
}, {
"token" : "chocolate",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
finally found the answer here: ElasticSearch: EdgeNgrams and Numbers
Add the following classes and change the Engine under Haystack_connections in settings file to use CustomElasticsearchSearchEngine below instead of default haystack one:
class CustomElasticsearchBackend(ElasticsearchSearchBackend):
"""
The default ElasticsearchSearchBackend settings don't tokenize strings of digits the same way as words, so they
get lost: the lowercase tokenizer is the culprit. Switching to the standard tokenizer and doing the case-
insensitivity in the filter seems to do the job.
"""
def __init__(self, connection_alias, **connection_options):
# see https://stackoverflow.com/questions/13636419/elasticsearch-edgengrams-and-numbers
self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['tokenizer'] = 'standard'
self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['filter'].append('lowercase')
super(CustomElasticsearchBackend, self).__init__(connection_alias, **connection_options)
class CustomElasticsearchSearchEngine(ElasticsearchSearchEngine):
backend = CustomElasticsearchBackend
Running you string 99% chocolate through the standard analyser gives the right results (99 is a term on its own), so if you're not using it currently, you should switch to it.
curl -XGET 'localhost:9200/myindex/_analyze?analyzer=standard&pretty' -d '99% chocolate'
{
"tokens" : [ {
"token" : "99",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<NUM>",
"position" : 1
}, {
"token" : "chocolate",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}

Error fetching data from table in Presto HIVE_CURSOR_ERROR

We are using Prestodb(0.69) and client on a single node server.
Where in we are using hive catalog, with tables in ORC format, consisting of 350,000,000 rows.
While running the query "select column1 from ORC_Table1 where column2=123456789", we are getting HIVE_CURSOR_ERROR.
The datatype of column2 is "int"
Below is the error stack :-
"failures" : [ {
"type" : "com.facebook.presto.spi.PrestoException",
"message" : "Read past end of RLE integer from compressed stream Stream for column 2 kind DATA position: 477741 length: 477741 range: 0 offset: 478409 limit: 478409 range 0 = 0 to 477741 uncompressed: 212681 to 212681",
"cause" : {
"type" : "java.io.EOFException",
"message" : "Read past end of RLE integer from compressed stream Stream for column 2 kind DATA position: 477741 length: 477741 range: 0 offset: 478409 limit: 478409 range 0 = 0 to 477741 uncompressed: 212681 to 212681",
"suppressed" : [ ],
"stack" : [ "org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:46)", "org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:287)", "org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$LongTreeReader.next(RecordReaderImpl.java:473)", "org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1157)", "org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2196)", "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:106)", "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:57)", "com.facebook.presto.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:241)", "ScanFilterAndProjectOperator_11.filterAndProjectRowOriented(Unknown Source)", "com.facebook.presto.operator.AbstractScanFilterAndProjectOperator.getOutput(AbstractScanFilterAndProjectOperator.java:177)", "com.facebook.presto.operator.Driver.process(Driver.java:329)", "com.facebook.presto.operator.Driver.processFor(Driver.java:271)", "com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:674)", "com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:443)", "com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:577)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)", "java.lang.Thread.run(Thread.java:745)" ]
},
"suppressed" : [ ],
"stack" : [ "com.facebook.presto.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:257)", "ScanFilterAndProjectOperator_11.filterAndProjectRowOriented(Unknown Source)", "com.facebook.presto.operator.AbstractScanFilterAndProjectOperator.getOutput(AbstractScanFilterAndProjectOperator.java:177)", "com.facebook.presto.operator.Driver.process(Driver.java:329)", "com.facebook.presto.operator.Driver.processFor(Driver.java:271)", "com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:674)", "com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:443)", "com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:577)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)", "java.lang.Thread.run(Thread.java:745)" ],
"errorCode" : {
"code" : 16777217,
"name" : "HIVE_CURSOR_ERROR"
}
The query runs fine on table consisting of few rows.
Can anyone help me sort this out.
Below is the config.properties:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=http://172.168.1.99:8080
Can Hive read this table? If it can, this is likely a bug that has been fixed in a newer version of the Hive libraries than Presto is using, and you will need to wait until Presto upgrades to the newest Hive release. If Hive can not read the table, the file is either corrupt or there is still a bug in the ORC reader.

Resources