I have an OrientDB graph database. Some of the class types are pretty large (i.e. > 60M) some are somewhat smaller. Orient is pretty fast on searching, when indexed.
Some of the properties I indexed are running at near 100K/s, while some other start at 5K/s then slow towards 10/s.
This, for a part, is due to the settings maxHeap and bufferSize.
I could not find a helpful page on how to compute the server.bat/sh settings for indexing certain types of data.
Anybody got experience on indexing large ( >>10M) items?
Do I have to reboot the server each time I index a large set, or start creating edges? Does the index-type matter wrt indexing speed?
Related
For a project which maps files in a directory structure to Lucene documents (1:1), I'd like to know the impact of using multiple index segments. When a file on disk changes, the indexing process basically removes the corresponding document and adds a new one.
In the project, at the end of the indexing, the forceMerge() method of IndexWriter is used to reduce the number of segments to 1. This practice has been present in the code for a very long time, likely since early Lucene versions. As noted in the Lucene documentation, this is expensive task:
This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed).
Based on this I am considering removing this step altogether. It's just unclear what will be the performance impact.
In one answer a claim is made that multi segment performance got better over the time, however this is pretty vague statement. Is there some benchmark and/or explanatory article that would shed more light on the performance with multiple segments ? What if the segment count grows to thousands, millions ? Is this even possible ? How much will the search/indexing performance degrade ?
Also, when experimenting with disabiling the forceMerge() step, I noticed that after adding bunch of documents to the index, the next time the indexer is run, the segment count grows, however sometimes decreases after subsequent runs of the indexer (according to the segmentInfos field in the IndexReader object). Is there some automatic segment merge process ?
I have a large aggregation query that comes incredibly slow when I am updating my data. I am not saving the data to a tmp index (and then renaming it when it's done) but saving it directly to the index I'm querying.
What are some ways to improve querying performance while indexing is occurring?
What are the usual bottlenecks that I'm seeing here (possibly memory?)?
It's hard to tell without any details, as there can be many factors affecting performance.
In general, though, indexing is a computationally intensive operation, so while it may feel counterintuitive, but as well as looking at how to improve your search, I'd have a look at how you can optimize your indexing to reduce load it causes.
In my experience, I have had a somewhat similar problem. What I observed was high IO utilization, while indexing progress coming to a halt and search pretty much not available. And I had good results with tuning configuration related to segments and merging, which can have a pretty bad effect on spinning disks as an index grows and it starts merging big segments.
Also, if you don't have strict requirements for new documents availability, changing index.refresh_interval and batching documents for indexing can help a lot.
Have a look at docs here https://www.elastic.co/guide/en/elasticsearch/guide/2.x/indexing-performance.html
Given 4-5 nodes having many IMaps with lots of data in it, some of the predicate queries started to become significantly slow. One of the solutions for solving this performance issue (as I think) could be adding indexes. However, this data is part of a sensible system which is currently being used in production.
Before adding indexes, I was wondering what would be the consequences of doing it on huge IMaps? (would it lock the entire map ?; would it bring down the entire system?; etc.) Hazelcast documentation includes information about how to do it, but doesn't give any other explanation.
If you want to add the index in runtime this is what will happen:
the AddIndexOperation will be executed on every partition
during the execution of the AddIndexOperation the partition will be blocked until all partition data are iterated and added to the index.
Queries won't be blocked in this timeframe - but get/put operations will.
I would recommend doing it in the "maintenance window" where you have the smallest load.
lots of data is relative - just execute a test in your dev environment having exactly the same amount of data to see how long it will take to add an index in your environment.
I want to get all results from a match-all query in an elasticsearch cluster. I don't care if the results are up to date and I don't care about the order, I just want to steadily keep going through all results and then start again at the beginning. Is scroll and scan best for this, it seems like a bit of a hit taking a snapshot that I don't need. I'll be looking at processing 10s millions of documents.
Somewhat of a duplicate of elasticsearch query to return all records. But we can add a bit more detail to address the overhead concern. (Viz., "it seems like a bit of a hit taking a snapshot that I don't need.")
A scroll-scan search is definitely what you want in this case. The
"snapshot" is not a lot of overhead here. The documentation describes it metaphorically as "like a snapshot in time" (emphasis added). The actual implementation details are a bit more subtle, and quite clever.
A slightly more detailed explanation comes later in the documentation:
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
So the reason the context is cheap to preserve is because of how Lucene index segments behave. A Lucene index is partitioned into multiple segments, each of which is like a stand-alone mini index. As documents are added (and updated), Lucene simply appends a new segment to the index. Segments are write-once: after they are created, they are never again updated.
Over time, as segments accumulate, Lucene will periodically do some housekeeping in the background. It scans through the segments and merges segments to flush the deleted and outdated information, eventually consolidating into a smaller set of fresher and more up-to-date segments. As newer merged segments replace older segments, Lucene will then go and remove any segments that are no longer actively used by the index at large.
This segmented index design is one reason why Lucene is much more performant and resilient than a simple B-tree. Continuously appending segments is cheaper in the long run than the accumulated IO of updating files directly on disk. Plus the write-once design has other useful properties.
The snapshot-like behavior used here by Elasticsearch is to maintain a reference to all of the segments active at the time the scrolling search begins. So the overhead is minimal: some references to a handful of files. Plus, perhaps, the size of those files on disk, as the index is updated over time.
This may be a costly amount of overhead, if disk space is a serious concern on the server. It's conceivable that an index being updated rapidly enough while a scrolling search context is active may as much as double the disk size required for an index. Toward that end, it's helpful to ensure that you have enough capacity such that an index may grow to 2–3 times its expected size.
Most of the documentation of Lucene advises to keep a single instance of the indexReader and reuse it because of the overhead of opening a new Reader.
However i find it hard to see what this overhead is based and what influences it.
related to this is how much overhead does having an open IndexReader actualy cause?
The context for this question is:
We currently run a clustered tomcat stack where we do fulltext from the ServletContainer.
These searches are done on a separate Lucene indexes for each client because each client only seeks in his own data. Each of these indexes contains ranging from a few thousand to (currently) about 100.000 documents.
Because of the clustered tomcat nodes, any client can connect on any tomcat node.
Therefore keeping the IndexReader open would actually mean keep a few thousand indexReaders open on each tomcat node. This seems like a bad idea, however constantly reopening doesn't seem like a very good idea either.
While its possible for me to somewhat change the way we deploy Lucene if its not needed i'd rather not.
Usually the field cache is the slowest piece of Lucene to warm up, although other things like filters and segment pointers contribute. The specific amount kept in cache will depend on your usage, especially with stuff like how much data is stored (as opposed to just indexed).
You can use whatever memory usage investigation tool is appropriate for your environment to see how much Lucene itself takes up for your application, but keep in mind that "warm up cost" also refers to the various caches that the OS and file system keep open which will probably not appear in top or whatever you use.
You are right that having thousands of indexes is not a common practice. The standard advice is to have them share an index and use filters to ensure that the appropriate results are returned.
Since you are interested in performance, you should keep in mind that having thousands of indices on the server will result in thousands of files strewn all across the disk, which will lead to tons of seek time that wouldn't happen if you just had one big index. Depending on your requirements, this may or may not be an issue.
As a side note: it sounds like you may be using a networked file system, which is a big performance hit for Lucene.