Slow aggregation queries while ingesting data - performance

I have a large aggregation query that comes incredibly slow when I am updating my data. I am not saving the data to a tmp index (and then renaming it when it's done) but saving it directly to the index I'm querying.
What are some ways to improve querying performance while indexing is occurring?
What are the usual bottlenecks that I'm seeing here (possibly memory?)?

It's hard to tell without any details, as there can be many factors affecting performance.
In general, though, indexing is a computationally intensive operation, so while it may feel counterintuitive, but as well as looking at how to improve your search, I'd have a look at how you can optimize your indexing to reduce load it causes.
In my experience, I have had a somewhat similar problem. What I observed was high IO utilization, while indexing progress coming to a halt and search pretty much not available. And I had good results with tuning configuration related to segments and merging, which can have a pretty bad effect on spinning disks as an index grows and it starts merging big segments.
Also, if you don't have strict requirements for new documents availability, changing index.refresh_interval and batching documents for indexing can help a lot.
Have a look at docs here https://www.elastic.co/guide/en/elasticsearch/guide/2.x/indexing-performance.html

Related

Lucene index segment lifecycle and performance impact

For a project which maps files in a directory structure to Lucene documents (1:1), I'd like to know the impact of using multiple index segments. When a file on disk changes, the indexing process basically removes the corresponding document and adds a new one.
In the project, at the end of the indexing, the forceMerge() method of IndexWriter is used to reduce the number of segments to 1. This practice has been present in the code for a very long time, likely since early Lucene versions. As noted in the Lucene documentation, this is expensive task:
This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed).
Based on this I am considering removing this step altogether. It's just unclear what will be the performance impact.
In one answer a claim is made that multi segment performance got better over the time, however this is pretty vague statement. Is there some benchmark and/or explanatory article that would shed more light on the performance with multiple segments ? What if the segment count grows to thousands, millions ? Is this even possible ? How much will the search/indexing performance degrade ?
Also, when experimenting with disabiling the forceMerge() step, I noticed that after adding bunch of documents to the index, the next time the indexer is run, the segment count grows, however sometimes decreases after subsequent runs of the indexer (according to the segmentInfos field in the IndexReader object). Is there some automatic segment merge process ?

In ArangoDb, does choice of collection impact performance?

I was wondering.
You can throw anything in any collection in Arango. I can imagine however that placing objects with similar attributes in the same collection has impact on indexing, which impacts performance.
Is that true or shouldn't I worry about performance when creating collections?
tnx, Frank
You do not need to worry about performance and collections so much.
You design your performance largely by indexing your data according to the planned queries and choosing the proper index for the above. But your query performance are going to again be hugely affected by filtering the data before sorting and vice versa.
This is all as long as you are on a single server instance. Once you are looking at sharding your data over many cluster nodes, you can again boost or impair the performance.
tldr: Don't worry about collections before you have worried about your queries and your indexes.

How is the concurrency query performance of elasticsearch?

Does elasticsearch can handle concurrency search/aggregation well? (For example, 1000 people issue the same/different query at the same time)
Please note that I am not talking about concurrency update, only search/agg.
Databases like oracle/mysql all talking about concurrency in there docs. Did not find elasticsearch talking about this. Does that mean concurrency is not a problem to the data structure and architecture of elasticsearch?
I know cache of filter is one good thing to make concurrency query easier. Anything else?
Queries can be cached for re-use with minimal overhead.
https://www.elastic.co/guide/en/elasticsearch/guide/current/filter-caching.html#filter-caching
This allows faster processing of future queries over the same data.
The cluster configuration and data allocation will also have an impact on performance. Requests should be made in a round-robin fashion, If a single node is receives 1000 requests simultaneously its performance will be degraded vs dividing the work among multiple nodes.
Mappings and analyzers can also have significant influence on performance.
Queries that require retrieval and parsing of the _source field are expensive.
Using Query-time synonym translation will be expensive.
The reality is the performance is based on the particular application.

Fastest nosql option for number crunching?

I had always thought that Mongo had excellent performance with it's mapreduce functionality, but am now reading that it is a slow implementation of it. So if I had to pick an alternative to benchmark against, what should it be?
My software will be such that users will often have millions of records, and often be sorting and crunching through unpredictable subsets that are 10s or 100s of thousands. Most of the analysis of data that uses the full millions of records can be done in summary tables and the like. I'd originally thought Hypertable was a viable alternative, but in doing research I saw in their documents their mention that Mongo would be a more performant option, while Hypertable had other benefits. But for my application speed is my number one initial priority.
First of all, it's important to decide on what is "fast enough". Undoubtedly there are faster solutions than MongoDB's map/reduce but in most cases you may be looking at significantly higher development cost.
That said MongoDB's map/reduce runs, at time of writing, on a single thread which means it will not utilize all the cpu available to it. Also, MongoDB has very little in the way of native aggregation functionality. This will change fixed with version 2.1 onwards that should improve performance though (see https://jira.mongodb.org/browse/SERVER-447 and http://www.slideshare.net/cwestin63/mongodb-aggregation-mongosf-may-2011).
Now, what MongoDB is good at is scaling up easily, especially when it comes to reads. And this is important because the best solution for number crunching on large datasets is definitely a map/reduce cloud like Augusto suggested. Let such an m/r do the number crunching while MongoDB makes the required data available at high speeds. Database query throughput too low is easily solved by adding more mongo shards. Number crunching/aggregation performance too slow is solved by adding more m/r boxes. Basically performance becomes a function of number of instances you reserve for the problem, and thus cost.

Is excessive use of lucene good?

In my project, entire searching and listing of content is depend on Lucene. I am not facing any performance issues. Still, the project is in development phase and long way to go in production.
I have to find out the performance issues before the project completed in large structure.
Whether the excessive use of lucene is feasible or not?
As an example, I have about 3 GB of text in a Lucene index, and it functions very quickly (milliseconds response times on searches, filters, and sorts). This index contains about 300,000 documents.
Hope that gave some context to your concerns. This is in a production environment.
Lucene is very mature and has very good performance for what it was designed to do. However, it is not an RDBMS. The amount of fine-tuning you can do to improve performance is more limited than a database engine.
You shouldn't rely only on lucene if:
You need frequent updates
You need to do joined queries
You need sophisticated backup solutions
I would say that if your project is large enough to hire a DBA, you should use one...
Performance wise, I am seeing acceptable performance on a 400GB index across 10 servers (a single (4GB, 2CPU) server can handle 40GB of lucene index, but no more. YMMV).
By excessive, do you mean extensive/exclusive?
Lucene's performance is generally very good. I recently ran some performance tests for Lucene on my Desktop with QuadCore # 2.4 GHz 2.39 GHz
I ran various search queries against a disk index composed of 10MM documents, and the slowest query (MatchAllDocs) returned results within 1500 ms. Search queries with two or more search terms would return around 100 ms.
There are tons of performance tweaks you can do for Lucene, and they can significantly increase your search speed.
What would you define as excessive?
If your application has a solid design, and the performance is good, I wouldn't worry too much about it.
Perhaps you could get a data dump to test the performance in a live scenario.
We use lucence to enable type-ahead searching. This means for every letter typed, it hits the lucence index to get the results. Multiple that to tens of textboxes on multiple interfaces and again tens of employees typing, with no complaints and extremely fast response times. (Actually it works faster than any other type-ahead solution we tried).

Resources