In ArangoDb, does choice of collection impact performance? - performance

I was wondering.
You can throw anything in any collection in Arango. I can imagine however that placing objects with similar attributes in the same collection has impact on indexing, which impacts performance.
Is that true or shouldn't I worry about performance when creating collections?
tnx, Frank

You do not need to worry about performance and collections so much.
You design your performance largely by indexing your data according to the planned queries and choosing the proper index for the above. But your query performance are going to again be hugely affected by filtering the data before sorting and vice versa.
This is all as long as you are on a single server instance. Once you are looking at sharding your data over many cluster nodes, you can again boost or impair the performance.
tldr: Don't worry about collections before you have worried about your queries and your indexes.

Related

CQEngine query overhead / precompiled parameterized queries

When I query an indexed collection many times and the query is the same and only differs in the attribute value, how big is the overhead to execute it?
Is there a way to precompile a parameterized query to get rid of this overhead?
Edit: Here's a simple benchmark showing that making multiple retrievals from a CQEngine collection with a hash index tends to be ~18 times slower than retrieving items from a LinkedHashMap.
https://github.com/Inego/cqe-simple-benchmark/blob/main/src/main/kotlin/Benchmark.kt
There is no support for parameterized queries per-se.
However if you would like to reduce the overhead of constructing queries frequently, such as the impact on garbage collection, you can leverage the fact that queries are immutable and stateless, and cache frequently used queries.
Queries are trees. So you can also cache frequently used branches of queries and reassemble query trees on the fly where branches are retrieved from cache.
However that said, generally the overhead of constructing queries should be pretty small. Id recommend to benchmark your application to see if this really is worthwhile.

Slow aggregation queries while ingesting data

I have a large aggregation query that comes incredibly slow when I am updating my data. I am not saving the data to a tmp index (and then renaming it when it's done) but saving it directly to the index I'm querying.
What are some ways to improve querying performance while indexing is occurring?
What are the usual bottlenecks that I'm seeing here (possibly memory?)?
It's hard to tell without any details, as there can be many factors affecting performance.
In general, though, indexing is a computationally intensive operation, so while it may feel counterintuitive, but as well as looking at how to improve your search, I'd have a look at how you can optimize your indexing to reduce load it causes.
In my experience, I have had a somewhat similar problem. What I observed was high IO utilization, while indexing progress coming to a halt and search pretty much not available. And I had good results with tuning configuration related to segments and merging, which can have a pretty bad effect on spinning disks as an index grows and it starts merging big segments.
Also, if you don't have strict requirements for new documents availability, changing index.refresh_interval and batching documents for indexing can help a lot.
Have a look at docs here https://www.elastic.co/guide/en/elasticsearch/guide/2.x/indexing-performance.html

How is the concurrency query performance of elasticsearch?

Does elasticsearch can handle concurrency search/aggregation well? (For example, 1000 people issue the same/different query at the same time)
Please note that I am not talking about concurrency update, only search/agg.
Databases like oracle/mysql all talking about concurrency in there docs. Did not find elasticsearch talking about this. Does that mean concurrency is not a problem to the data structure and architecture of elasticsearch?
I know cache of filter is one good thing to make concurrency query easier. Anything else?
Queries can be cached for re-use with minimal overhead.
https://www.elastic.co/guide/en/elasticsearch/guide/current/filter-caching.html#filter-caching
This allows faster processing of future queries over the same data.
The cluster configuration and data allocation will also have an impact on performance. Requests should be made in a round-robin fashion, If a single node is receives 1000 requests simultaneously its performance will be degraded vs dividing the work among multiple nodes.
Mappings and analyzers can also have significant influence on performance.
Queries that require retrieval and parsing of the _source field are expensive.
Using Query-time synonym translation will be expensive.
The reality is the performance is based on the particular application.

Is it appropriate to use a search engine as a caching layer?

We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example

Cost of a query in/dependent of amount of data

Could you please tell me whether the cost of a query is dependent on the amount of data available in the database at that time?
means, does the cost varies with the variation in the amount of data?
Thanks,
Savitha
The answer is, Yes, the data size will influence the query execution plan, that is why you must test your queries with real amounts of data (and if possible realistic data as the distribution of the data is also important and will influence the query cost).
Any Database management system is different in some respect and what works well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way around. Even storage engines have very important differences which can affect performance dramatically.
Of course, mass data will Slow down the process, In fact If u are firing a query, it need to traverse and search into the database. For more data it ll take time, The three main issues you should be concerned if you’re dealing with very large data sets are Buffers, Indexes and Joins..

Resources