Lucene and how to measure index fragmentation

Lucene and how to measure index fragmentation - performance

We are using Lucene 2.9.2 (upgrade to 3.x is planned) and it is a known fact that the search queries become slower over time. Usually we perform a full reindex. I have read the question https://stackoverflow.com/a/668453/356815 and its answers and to answer it right now: we do NOT use optimize() because performance was not acceptable anymore when running it.
Fragmentation?
I wonder the following: What are the best practices to measure the fragmentation of an existing index? Can Luke help me in that?
It would be very interesting to hear your thoughts about this analysis topic.
A bit more infos about our index:
We have indexed 400'000 documents
We heavily use properties per document
For each request we create a new searcher object (as we want changes to appear immediately in the search results)
Query performance is between 30ms (repeated same searches) and 10 seconds (complex)
The index consists of 44 files (15 .del files, 24 cfs files) and has a size of 1GB

Older version of Lucene did not effectively deal with large numbers of segments. This is why some people recommended to optimize (merge all segments together) in order to improve search performance.
This is less true with recent versions of Lucene. Indeed optimize has been renamed to sound less magical (you now need to call forceMerge(1)) and always merging segments is even considered harmful (look at this nice article from Lucene developer Simon Willnauer).
For each request we create a new searcher object
Opening a reader is very costly. You should rather use SearcherManager which will help you reopen (incremental open) your index only when necessary.

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?

Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

Elasticserach forcemerge and disc space problem

I'm new to Elasticsearch, so don’t blame me much.
The situation is this, the server has almost run out of logging space, about 400 MB remains. I had to delete the old logs of two years ago, but as it turned out, it simply marks them deleted, but in fact, deletes them in the background during auto-merge. The index that I was trying to clean up is actively using for write logs to it, but in order to free up disk space, I decided to run a POST /logging/_forcemerge?only_expunge_deletes=true . Through the GET _tasks?detailed=true&actions=*forcemerge, I see that the task is running, but for 2 hours nothing has been happening. The place is not vacated and there is a feeling that it was not worth doing a force merge and listening to all the reviews on forums and sites about this procedure.
The questions themselves.
Is there any way I can find out how long force merge will last?
I turned off the servers that write to this index, as I understand it, is it not worth writing to the index during force merge?
Since I used the parameter only_expunge_deletes=true to merge only segments with deleted documents, will this affect the search by index?
What is the best practice that would not arise in such situations?

Is there any way I can find out how long force merge will last?
No, sorry, a force merge doesn't report any information about its progress.
I turned off the servers that write to this index, as I understand it, is it not worth writing to the index during force merge?
A force merge is generally only useful when you will never again write to an index. There's no reason to stop writing to an index just for the duration of the merge, but conversely if you wish to continue writing to an index then it's not recommended to force-merge it at all.
Since I used the parameter only_expunge_deletes=true to merge only segments with deleted documents, will this affect the search by index?
Merging is often beneficial for searches, which is why Elasticsearch (really Lucene) does it in the background. However, force-merging can disrupt the usual automatic merge process in future, which is why it's recommended not to do it on indices that will see future writes.
What is the best practice that would not arise in such situations?
I think a good practice that you are missing, given that you are indexing logs, is to use time-based indices: every so often (e.g. monthly) start a new index whose name contains the date (e.g. month and year). Elasticsearch lets you search across multiple indices at once (maybe using a wildcard or an alias). Then you can manage the lifecycle of these indices individually (there's even a feature for automatic index lifecycle management) which includes deleting older indices when they reach a suitable age. Deleting a subset of the documents in an index is expensive and doesn't necessarily save space, but deleting an entire index is cheap and frees up space immediately.

Elasticsearch sync database recommended / standard strategy

I'm pondering a strategy to maintain an index for Elasticsearch, I've found a plugin which may handle maintenance quite well however I would like to get a little more intimate with Elasticsearch since I really like her and the plugin would make playtime a little less intimate if you know what I mean.
So anyway, if I have a data set that would have fairly frequent updates (say ~ 1 update / 10s), would I run into performance problems with Elasticsearch? Can partial index updates be done when a single row changes or is a full re-rebuild of the index necessary? The strategy I plan on implementing involves modifying the index whenever I do CRUD with my application (python postgre) so there will be some overhead with the code which I'm not overly concerned about, just the performance. Is my strategy common?
I've used Sphinx which did have partial re-indexing which was run with a cron job to keep in sync, it had mapping between indexes and MySQL tables defined in the config. This was the recommended approach for Sphinx. Is there a recommended approach with Elasticsearch?

There are a number of different strategies for handling this, there's no simple one size fits all solution.
To answer some of your questions, first, there is no such thing as a partial update in Elasticsearch/Lucene. If you update a single field in a document the whole document is rewritten. Be aware of the performance implications of this when designing your schema. If you update a single document however, it should be available near instantly. Elasticsearch is a near-realtime search engine, you don't have to worry about regenerating the index constantly.
For your write load one update / 10s the default performance settings should be fine. That's a very low write load for ES in fact, it can scale much higher. Netflix, for instance, performs 7 millions updates / minute in one of their clusters.
As far as syncing strategies go, I've written an in-depth article on this "Keeping Elasticsearch in Sync"

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.

The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?

There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.

One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.

Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD

As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)

Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm

This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.

Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.

There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.

I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

Why is Solr so much faster than Postgres?

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."
I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).
As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.
What differences in architecture would explain this?

First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.
Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)
But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).

You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.
Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.
Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.
I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.
-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
(
lineno serial PRIMARY KEY,
linetext text NOT NULL,
tsv tsvector
);
-- Load from downloaded data into database.
COPY war_and_peace (linetext)
FROM '/home/kgrittn/Downloads/war-and-peace.txt';
-- "Digest" data to lexemes.
UPDATE war_and_peace
SET tsv = to_tsvector('english', linetext);
-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
ON war_and_peace
USING gist (tsv);
-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;
Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:
-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'gentlemen');
84 rows, gist: 2.006 ms, gin: 0.194 ms
-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies');
184 rows, gist: 3.549 ms, gin: 0.328 ms
-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies & gentlemen');
1 row, gist: 0.971 ms, gin: 0.104 ms
Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.
Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.
The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.
The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.
With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.
UPDATE (Dec 9 '16)
I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.

Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.
As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"
The answer is a qualified yes, and I refer you to the following:
https://stackoverflow.com/questions/5814050/solr-or-database
Using Solr search index as a database - is this "wrong"?
For the guardian solr is the new database
My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.

This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.

Please read this and this.
Solr (Lucene) creates an inverted index which is where retrieving data gets quite faster. I read that PostgreSQL also has similar facility but not sure if you had used that.
The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio