How to Quickly Update Mongo Documents String Fields with Complex Functions - performance

What is the fastest way to update documents in a Mongo database with complex functions, let's say a string search / replace or a sqrt calculation?
Since such operations are missing, e.g. a $replace, it is not possible with update (which would probably be the fastest, since on my test collection it only takes about 50 ms to set a field on some 100k objects).
When I simply iterate over all documents it takes about 45 seconds. It gets a little faster when I limit my query to the fields I'm using during the update.
This time of course grow larger on larger collections, therefore the question whether there is a faster way than iterating over the collection (e.g. via a map reduce job?).

No :) Without native support for such functionality you'll be stuck with a read->modify->write approach. That said if you can write a field to 100k objects on a machine that manages that in 50ms that process shouldn't take anywhere near 45 seconds if you have to read, modify and write those same documents. Are you sure the bottleneck is the database rather than the machine that is running that pass? Are you sure you're batching appropriately and not do an update per document?

Related

CosmosDB - Gremlin - high memory usagage with query containing limit() step

I want to retrive a large amount of items but using limit clause:
g.V().hasLabel('foo').as('f').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
This query takes very long time and consumes about 800 MB memory to download the collection
Whan i use below query:
g.V().hasLabel('foo').as('f').has('propA','ValueA').has('propB','ABC').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
it is faster and consumes less memory around 500 MB to download this collection but still high.
My Qestion is how to optimize the first query with just limit if i do not want to filter by Properties A and B.
Second Question why there is such difference in memory size between those two results? In both queries i download 5000 items to memory. What could be possible way to reduce this consumption.
I use GremlinDriver for .Net.
I'm not expert at CosmosDB optimization but from a Gremlin perspective when I look at this traversal:
g.V().hasLabel('foo').as('f').
limit(5000).order().by('f_Id',incr).by('f_bar',incr).
select('f').unfold().dedup()
I wonder why you wouldn't just write it as:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr)
Meaning, you want 5000 "foo" vertices ordered a certain way. The need to use the "f" step label and unfold() seem unnecessary and I don't see how you could end up with duplicates so you can drop dedup(). I'm not sure if those changes will make any difference to how CosmosDB processes things but it certainly removes some unneeded processing.
I'd also wonder if you need to pair down the data returned in your vertices. Right now you're returning all the properties for each vertex. If you don't need all of those it might be better to be more specific and transform the data to the form your application requires:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr).
valueMap('name','age')
That should help reduce serialization costs.

Collecting large statistical sets with pg_stat_statements?

According to Postgres pg_stat_statements documentation:
The module requires additional shared memory proportional to
pg_stat_statements.max. Note that this memory is consumed whenever the
module is loaded, even if pg_stat_statements.track is set to none.
and also:
The representative query texts are kept in an external disk file, and
do not consume shared memory. Therefore, even very lengthy query texts
can be stored successfully. However, if many long query texts are
accumulated, the external file might grow unmanageably large.
From these it is unclear what the actual memory cost of a high pg_stat_statements.max would be - say at 100k or 500k (default is 5k). Is it safe to set the levels that high, would could be the negative repercussions of such high levels? Would aggregating statistics into an external database via logstash/fluentd be a preferred approach above certain sizes?
1.
from what I have read, it hashes the query and keeps it in DB, saving the text to FS. So next concern is more expected then overloaded shared memory:
if many long query texts are accumulated, the external file might grow
unmanageably large
the hash of text is so much smaller then text, that I think you should not worry about extension memory consumption comparing long queries. Especially knowing that extension uses Query Analyser (which will work for EVERY query ANYWAY):
the queryid hash value is computed on the post-parse-analysis
representation of the queries
Setting pg_stat_statements.max 10 times bigger should take 10 times more shared memory I believe. The grows should be linear. It does not say so in documentation, but logically should be so.
There is no answer if it is safe or not to set setting to distinct value, because there is no data on other configuration values and HW you have. But as growth should be linear, consider this answer: "if you set it to 5K, and query runtime has grown almost nothing, then setting it to 50K will prolong it almost nothing times ten". BTW, my question - who is gong to dig 50000 slow statements? :)
2.
This extension already makes a pre-aggregation for "dis-valued" statement. You can select it straight on DB, so moving data to other db and selecting it there will only give you the benefit of unloading the original DB and loading another. In other words you save 50MB for a query on original, but spend same on another. Does it make sense? For me - yes. This is what I do myself. But I also save execution plans for statement (which is not a part of pg_stat_statements extension). I believe it depends on what you have and what you have. Definitely there is no need for that just because of a number of queries. Again unless you have so big file that extension can
As a recovery method if that happens, pg_stat_statements may choose to
discard the query texts, whereupon all existing entries in the
pg_stat_statements view will show null query fields

Parse, replacing large (several thousands) number of records

I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.
I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances
I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).
Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.

MyBatis - Pulling out 44000 rows into 3000 objects

I am using mybatis 3.1.0 jar along with spring mybatis jar.
It is taking me 16 seconds to pullout the 44000 rows into 3000 entity objects.
Time Taken: Normal query execution time : 11 seconds
Any suggestions to increase the performance?
Thanks.
First off, do some analysis to see how much of your time is spent (a) pulling out data, e.g. run the query in your standalone DB tool and benchmark that, and then (b) how much is spent marshalling the data into your objects.
Subsequent steps will depend on which of (a) or (b) appears least performant.
If (a), then spend time tuning the query - indexes on tables, denormalise underlying structure.
If (b) consider a flatter or less heavily populated model.
Edit:
extra thoughts on DB side to reduce 11s duration:
check your query is hitting indexes at every point, not doing any table scans
check your query pulls back the minimum fields, e.g. if you just need 8 fields, don't pull back 20
check your query isn't doing inefficient subqueries if you're on MySql (are you?)
check your query isn't calling any unnecessary functions or subroutines (probably unlikely but worth mentioning...)
extra thoughts on object side:
avoid setting object fields you don't need
extra general thoughts:
add some logging for, e.g. a single case (or for, say, 100) to see where time's spent in the application, you need general code optimisation techniques here - anything that's a loop, or isn't 100% necessary, look at changing/removing
if performance is most important consideration, consider changing your requirement - structure your page, or screen, differently, for example to allow you to make a faster DB retrieval realistic
2nd edit - do you really need to use all 44,000 rows and the 3,000 objects they feed? Could you get by with fewer, e.g. break them into 10 groups and paginate with them? (wild guess, your app might do nothing of that sort)
Certain measures on mybatis configuration to handle situations like these,
1)Set fetchSize to a considerable amount.I set my fetchsize as 1000
2)Not to retrieve and map all columns(mapping only necessary columns reduced the time)
3)Make use of nested joins instead of nested selects.

What are the performance considerations when adding a large number of documents to a large Solr core?

If I have a Solr core with a half-dozen small fields that's loaded with 100 million documents, will adding a batch of 1 million documents run in a reasonable amount of time? How about 10 million? By reasonable, I'm thinking hours, rather than days. I've been told that this will take a long time to run. Is this really an issue? What are known strategies to improve performance? The fields are typically small, that is, 5-50 characters.
two suggestions on top of already mentioned in other answers for improving the performance (first tried, second to be tried):
1) decrease logging while updating: on INFO level SOLR appends one entry per document. See here on how we did it: http://dmitrykan.blogspot.fi/2011/01/solr-speed-up-batch-posting.html Some people reported "x3 speed increase".
2) set the amount of segments in solrconfig.xml to something very large for indexing, like 10000. Once the batch indexing is complete, change the parameter value back to something reasonably low, like 10.
This is a very "tricky" question whose answer differs from schema to schema.
Your solr installation has a half-dozen fields. But, how many are actually indexed? If only one field is indexed, then adding 1 million documents will be faster than adding 1 million docs when 6 fields are indexed.
I think the type of fields that are indexed also matters. A field that is of the type "text_general" is broken down into tokens while indexing whereas a field that is of the type "string" is not. "String" type is not analyzed and is stored as one complete token.
I have got some very long fields which are indexed and adding 2 million docs take a few minutes (although my installation does not contain 100 million documents). So, I do not think that it will take days to add 10 million records to your installation.
I am not sure about this but maybe the configuration of your cpu which is running the solr instance also matters. So, you might need to see if you cpu and memory can handle this much load.
It's upto you to decide if a long running data post is an issue or not. If your application is user intensive, then I suggest that you follow some kind of master-slave configuration so that the user is not impacted by the high cpu usage when you post the data. Some strategies which I know about improving performance is "sharding".
http://carsabi.com/car-news/2012/03/23/step-by-step-solr-sharding/
or if it is possible to demarcate the records by some field and put those different documents onto different servers.
100 million records is a fairly large index for Solr. But adding 10 million records on a good machine should be hours not days. You may find the following email thread interesting as it includes both in-depth questions and some final advice on tuning for 10M records index process.
Also, you did not say if you 'store' the fields as well as index them. If you do, you may also look forward to Solr 4.1 field compression.
An important parameter which impacts the indexing performance(in terms of Time) is the way in which you have defined your data-config.xml file.
If your fields come from multiple tables in a Database, you can configure it in 2 ways:
Entities within entities
A single entity with a join query
The second method is comparatively faster than the first one by a large degree because the number of queries fired against the database is decreased.

Resources