CosmosDB - Gremlin - high memory usagage with query containing limit() step

CosmosDB - Gremlin - high memory usagage with query containing limit() step - performance

I want to retrive a large amount of items but using limit clause:
g.V().hasLabel('foo').as('f').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
This query takes very long time and consumes about 800 MB memory to download the collection
Whan i use below query:
g.V().hasLabel('foo').as('f').has('propA','ValueA').has('propB','ABC').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
it is faster and consumes less memory around 500 MB to download this collection but still high.
My Qestion is how to optimize the first query with just limit if i do not want to filter by Properties A and B.
Second Question why there is such difference in memory size between those two results? In both queries i download 5000 items to memory. What could be possible way to reduce this consumption.
I use GremlinDriver for .Net.

I'm not expert at CosmosDB optimization but from a Gremlin perspective when I look at this traversal:
g.V().hasLabel('foo').as('f').
limit(5000).order().by('f_Id',incr).by('f_bar',incr).
select('f').unfold().dedup()
I wonder why you wouldn't just write it as:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr)
Meaning, you want 5000 "foo" vertices ordered a certain way. The need to use the "f" step label and unfold() seem unnecessary and I don't see how you could end up with duplicates so you can drop dedup(). I'm not sure if those changes will make any difference to how CosmosDB processes things but it certainly removes some unneeded processing.
I'd also wonder if you need to pair down the data returned in your vertices. Right now you're returning all the properties for each vertex. If you don't need all of those it might be better to be more specific and transform the data to the form your application requires:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr).
valueMap('name','age')
That should help reduce serialization costs.

Related

What does Elasticsearch automatic slicing do?

What does Elasticsearch automatic slicing do? I find the documentation to be very laconic about this function. I tried searching for other explanations of this functionality, but to no avail. Neither I have managed to find what slice is in Elasticsearch.

Automatic slicing is a way to parallelize work for a few different endpoints, such as reindex, update by query and delete by query.
The three above APIs all work the same way by making a scroll query over the target index. Scroll queries provide a more performant way of making queries yielding big result sets than normal paged queries. Scroll queries can be further improved by slicing them.
In clear, if a query is supposed to return a big amount of hits, you can make a normal query and page through results using from/size, but that will not be performant because of deep-paging. To circumvent that issue, ES allows you to use scroll queries in order to get results in batches of N hits. Those scroll queries can further be improved by slicing them, i.e. split the scroll in multiple slices which can be consumed independently by your client application.
So, say you have a query which is supposed to return 1,000,000 hits, and you want to scroll over that result set in batches of 50,000 hits, using a normal scroll query (i.e. without slicing), your client application will have to make the first scroll call and then 20 more synchronous calls (i.e. one after another) to retrieve each batch of 50K hits.
By using slicing, you can parallelize the 20 scroll calls. If your client application is multi-threaded, you can make each scroll call use 5 (e.g.) slices, and thus, you'll end up with 5 slices of ~10K hits that can be consumed by 5 different threads in your application, instead of having a single thread consume 50K hits. You can thus leverage the full computing power of your client application to consume those hits.
The ideal number of slices should be a multiple of the number of shards in the source index. For the best performance, you should pick the same number of slices as there are shards in your source index. For that reason, you might want to use automatic slicing instead of manual slicing, as ES will pick that number for you.

Collecting large statistical sets with pg_stat_statements?

According to Postgres pg_stat_statements documentation:
The module requires additional shared memory proportional to
pg_stat_statements.max. Note that this memory is consumed whenever the
module is loaded, even if pg_stat_statements.track is set to none.
and also:
The representative query texts are kept in an external disk file, and
do not consume shared memory. Therefore, even very lengthy query texts
can be stored successfully. However, if many long query texts are
accumulated, the external file might grow unmanageably large.
From these it is unclear what the actual memory cost of a high pg_stat_statements.max would be - say at 100k or 500k (default is 5k). Is it safe to set the levels that high, would could be the negative repercussions of such high levels? Would aggregating statistics into an external database via logstash/fluentd be a preferred approach above certain sizes?

1.
from what I have read, it hashes the query and keeps it in DB, saving the text to FS. So next concern is more expected then overloaded shared memory:
if many long query texts are accumulated, the external file might grow
unmanageably large
the hash of text is so much smaller then text, that I think you should not worry about extension memory consumption comparing long queries. Especially knowing that extension uses Query Analyser (which will work for EVERY query ANYWAY):
the queryid hash value is computed on the post-parse-analysis
representation of the queries
Setting pg_stat_statements.max 10 times bigger should take 10 times more shared memory I believe. The grows should be linear. It does not say so in documentation, but logically should be so.
There is no answer if it is safe or not to set setting to distinct value, because there is no data on other configuration values and HW you have. But as growth should be linear, consider this answer: "if you set it to 5K, and query runtime has grown almost nothing, then setting it to 50K will prolong it almost nothing times ten". BTW, my question - who is gong to dig 50000 slow statements? :)
2.
This extension already makes a pre-aggregation for "dis-valued" statement. You can select it straight on DB, so moving data to other db and selecting it there will only give you the benefit of unloading the original DB and loading another. In other words you save 50MB for a query on original, but spend same on another. Does it make sense? For me - yes. This is what I do myself. But I also save execution plans for statement (which is not a part of pg_stat_statements extension). I believe it depends on what you have and what you have. Definitely there is no need for that just because of a number of queries. Again unless you have so big file that extension can
As a recovery method if that happens, pg_stat_statements may choose to
discard the query texts, whereupon all existing entries in the
pg_stat_statements view will show null query fields

How to figure out the optimal fetch size for the select query

In JDBC the default fetch size is 10, but I guess that's not the best fetch size when I have a million rows. I understand that a fetch size too low reduces performance, but also if the fetch size is too high.
How can I find the optimal size? And does this have an impact on the DB side, does it chew up a lot of memory?

If your rows are large then keep in mind that all the rows you fetch at once will have to be stored in the Java heap in the driver's internal buffers. In 12c, Oracle has VARCHAR(32k) columns, if you have 50 of those and they're full, that's 1,600,000 characters per row. Each character is 2 bytes in Java. So each row can take up to 3.2MB. If you're fetching rows 100 by 100 then you'll need 320MB of heap to store the data and that's just for one Statement. So you should only increase the row prefetch size for queries that fetch reasonably small rows (small in data size).

As with (almost) anything, the way to find the optimal size for a particular parameter is to benchmark the workload you're trying to optimize with different values of the parameter. In this case, you'd need to run your code with different fetch size settings, evaluate the results, and pick the optimal setting.
In the vast majority of cases, people pick a fetch size of 100 or 1000 and that turns out to be a reasonably optimal setting. The performance difference among values at that point are generally pretty minimal-- you would expect that most of the performance difference between runs was the result of normal random variation rather than being caused by changes in the fetch size. If you're trying to get the last iota of performance for a particular workload in a particular configuration, you can certainly do that analysis. For most folks, though, 100 or 1000 is good enough.

The default value of JDBC fetch size property is driver specific and for Oracle driver it is 10 indeed.
For some queries fetch size should be larger, for some smaller.
I think a good idea is to set some global fetch size for whole project and overwrite it for some individual queries where it should be bigger.
Look at this article:
http://makejavafaster.blogspot.com/2015/06/jdbc-fetch-size-performance.html
there is description on how to set up fetch size globally and overwrite it for carefully selected queries using different approaches: Hibernate, JPA, Spring jdbc templates or core jdbc API. And some simple benchmark for oracle database.
As a rule of thumb you can:
set fetchsize to 50 - 100 as global setting
set fetchsize to 100 - 500 (or even more) for individual queries

JDBC does have default prefetch size of 10. Check out
OracleConnection.getDefaultRowPrefetch in JDBC Javadoc

tl;dr
How to figure out the optimal fetch size for the select query
Evaluate some maximal amount of memory (bytesInMemory)
4Mb, 8Mb or 16Mb are good starts.
Evaluate the maximal size of each column in the query and sum up
those sizes (bytesPerRow)
...
Use this formula: fetch_size = bytesInMemory / bytesPerRow
You may adjust the formula result to have predictable values.
Last words, test with different bytesInMemory values and/or different queries to appreciate the results in your application.
The above response was inspired by the (as of this writing attic) Apache MetaModel project. They found an answer for this exact question. To do so, they built a class for calculating a fetch size given a maximal memory amount. This class is based on an Oracle whitepaper explaining how Oracle JDBC drivers manage memory.
Basically, the class is constructed with a maximal memory amount (bytesInMemory). Later, it is asked a fetch size for a Query (an Apache Metamodel class). The Query class helps find the number of bytes (bytesPerRow) a typical query results row would have. The fetch size is then calculated with the below formula:
fetch_size = bytesInMemory / bytesPerRow
The fetch size is also adjusted to stay in this range : [1,25000]. Other adjustments are made along during the calculation of bytesPerRow but that's too much details for here.
This class is named FetchSizeCalculator. The link leads to the full source code.

How to Quickly Update Mongo Documents String Fields with Complex Functions

What is the fastest way to update documents in a Mongo database with complex functions, let's say a string search / replace or a sqrt calculation?
Since such operations are missing, e.g. a $replace, it is not possible with update (which would probably be the fastest, since on my test collection it only takes about 50 ms to set a field on some 100k objects).
When I simply iterate over all documents it takes about 45 seconds. It gets a little faster when I limit my query to the fields I'm using during the update.
This time of course grow larger on larger collections, therefore the question whether there is a faster way than iterating over the collection (e.g. via a map reduce job?).

No :) Without native support for such functionality you'll be stuck with a read->modify->write approach. That said if you can write a field to 100k objects on a machine that manages that in 50ms that process shouldn't take anywhere near 45 seconds if you have to read, modify and write those same documents. Are you sure the bottleneck is the database rather than the machine that is running that pass? Are you sure you're batching appropriately and not do an update per document?

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.

You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.

It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio