Payload performance in Lucene - performance

I know there are several topics on the web, as well as on SO, regarding indexing and query performance within Lucene, but I have yet to find one that discusses whether or not (and if so, how much?) creating payloads will affect query performance...
Here's the scenario ...
Let's say I want to index a collection of documents (anywhere from 100K - 10M), and each document has a subsection that I want to be able to search separately (or perhaps rank higher, depending on whether a match was found within that section).
I'm considering adding a payload (during indexing) to any term that appears within that subsection, so I can efficiently make that determination at query-time.
Does anyone know of any performance issues related to using payloads, or even better, could you point me to any online documentation about this topic?
Thanks!
EDIT: I appreciate the alternative solutions to my scenario, but in case I do need to use payloads in the future, does anyone have any comments regarding the original question about query performance?

The textbook solution to what you want to do is index each original document as two fields: one for the full document, and the other for the subsection. You can boost the subsection field separately either during indexing or during retrieval.
Having said that, you can read about Lucene payloads here: Getting Started with Payloads.

Your use case doesn't fit well with the purpose of payloads -- it looks to me that any payload information would be redundant.
Payloads are attached to individual occurrences of terms in the document, not to document/term pairs. In order to store and access payloads, you have to use the offset of the term occurrence within the document. In your case, if you know the offset, you should be able to calculate which section the term occurrence is in, without using payload data.
The broader question is the effect of payloads on performance. My experience is that when properly used, the payload implementation takes up less space and is faster than whatever workaround I was previously using. The biggest impact on disk space will be wherever you currently use Field.setOmitTermFreqAndPositions(true) to reduce index size. You will need to include positions to use payloads, which potentially makes the index much larger.

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?
Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

Can ElasticSearch "explain" search option be used for all requests but not for debugging?

I would like to retrieve information about what exact terms match the search query.
I found out that this problem was discussed in the following topic: https://github.com/elastic/elasticsearch/issues/17045
but was not resolved "since it would be too cumbersome and expensive to keep this information around" (inside of ElasticSearch context).
Then I discovered that using "explain" option in search request I get the detailed information about score calculation including matching terms.
I made some kind of performance test to compare search requests with explain option set to true and without explain option. And this test doesn't show significant impact of explain option usage.
So I'm wondering if this option can be used for production system? It looks like some kind of workaround but seems it's working.
Any considerations about this?
First of all, you didn't include the details of your performance test, so it's really difficult to know and say whether it would make a performance impact or not and again it's relative to:
What is your cluster configuration, total nodes, size, shards, replicas, JVM, no of documents, size of documents?
Index configuration ie, for which index you are using the explain API, again is it a ready or write-heavy index, how many docs, during peak time how it performs, etc.
Apart from that, in An application, there will be only certain types of queries although search term might change, the underlying concept of whether it matches or not them can be understood by samples itself.
I've worked with search systems extensively and I use explain API a lot but only on samples and not on all queries and have not seen this happening anywhere.
EDIT:- Please have a look at named queries which can also be used to check which part of your queries matched the search results and more info on this official blog

Does elasticsearch/lucene impose memory overhead for missing values in fieldcache?

This question is for Elasticsearch primarily, but I believe the answer will be based on underlying Lucene semantics.
I'm contemplating using multiple types in the same index. A lot of fields will be sortable and a lot of fields will only be used by one particular type. I.e: fields will be sparse, say 10% coverage on average.
Since sorting keeps values for all docs in memory (regardess of type) , I'd like to know if there's any memory overhead with regards to missing fieldvalues (the ~90% in my case)
In a recent blog post on the official Elasticsearch blog titled "Index vs Type", the author tackles a common problematic when it comes to choosing whether one wants to model his data using several indices or several types.
One fact is that Lucene indices don't like sparsity. As a result, the author says that
Fields that exist in one type will also consume resources for documents of types where this field does not exist. [...] And the issue is even worse with doc values: for speed reasons, doc values often reserve a fixed amount of disk space for every document, so that values can be addressed efficiently.
There is a Lucene issue that aims at improving this situation, which has been fixed in 5.4 and will be available in Elasticsearch v2.2. Even then, the author advises to still model your data in a way to limits sparsity as much as possible.

"Fan-out" indexing strategy

I'm planning to use Elasticsearch for a social network kind of platform where users can post "updates", be friends with other users and follow their friends' feed. The basic and probably most frequent query will be "get posts shared with me by friends I follow". This query could be augmented by additional constraints (like tags or geosearch).
I've learned that social networks usually take a fan-out-on-write approach to disseminate "updates" to followers so queries are more localized. So I can see 2 potential indexing strategies:
Store all posts in a single index and search for posts (1) shared with the requester and (2) whose author is among the list of users followed by the requester (the "naive" approach).
Create one index per user, inject posts that are created by followed users and directly search among this index (the "fan-out" approach).
The second option is obviously much more efficient from a search perspective, although it presents sync challenges (like the need to delete posts when I stop following a friend, for example). But the thing I would be most concerned with is the multiplication of indices; in a (successful) social network, we can expect at least tens of thousands of users...
So my questions here are:
how does ES cope with a very high number of indices? can it incur performance issues?
any thoughts about a better indexing strategy for my particular use-case?
Thanks
Each elasticsearch index shard is a separate Lucene index, which means several open file descriptors and memory overhead. Generally, even after reducing number of shards per index from default 5, the resource consumption in index-per-user scenario may be too large.
It is hard to give any concrete numbers, but my guess is that if you stick to two shards per index, you would be able to handle no more than 3000 users per m3.medium machine, which is prohibitive in my opinion.
However, you don't necessarily need to have dedicated index for every user. You can use filtered aliases to use one index for multiple users. From application point of view, it would look like a per-user scenario, without incurring overhead mentioned above. See this video for details.
With that being said, I don't think elasticsearch is particularly good fit for fan-out-on-write strategy. It is, however, very good solution to employ in fan-out-on-read scenario (something similar to what you've outlined as (1)):
The biggest advantage of using elasticsearch is that you are able to perform relevance scoring, typically based on some temporal features, like browsing context. Using elasticsearch to just retrieve documents sorted by timestamp means that you don't utilize its potential. Meanwhile, solutions like Redis will give you far superior read performance for such task.
Fan-out-on-write scenario means a lot of writes on each update (especially, if you have users with many followers). Elasticsearch is not a database and is not optimized for such usage-pattern. It is, however, prepared for frequent reads.
Fan-out-on-write also means that you are producing a lot of 'extra' data by duplicating info about posts. To keep this data in RAM, you need to store only metadata, like id of document in separate document storage and tags. Again, there are other formats than JSON to store and search this kind of structured data effectively.
Choosing between the two scenarios is a question about your requirements, like average number of followers, number of 'hubs' that nearly everybody follows, whether the feed is naturally ordered (e.g. by time) etc. I think that deciding whether to use elasticsearch needs to be a consequence of this analysis.

Increasing relevancy of search results

I have a problem with making search output more practically usefull for the end users. The problem is rather related to the algorithm and approach then to exact technology or framework to use.
At the moment we have a database of products, that can be described with following schema:
From the search perspective we've done pretty standard things, 3-rd party text search with token analyzer, handling mistypes and synonyms (it is not the full list, but as I said, it is rather out of scope). But stil we need to perform extra work to make the search result closer to real life user needs, probably, in somewhat similar way how Google ranks indexed pages by relevancy. Ideas, that we`ve already considered as potentially applicable in solving the problem:
Analyze most popular search requests in widespread search engines (it is still a question how to get them) and increase rank for those entries in the index, which correspond (could be found with) to the popular requests;
Increase rank for newest (hot) entries;
Increase rank for the biggest group of entries, which correspond to the popular request and have something in common (that`s why it is a group);
Appreciate for any help or advising a direction, where to dig.
You may try pLSA; there are many references on the web, and there should be libraries and source code.
EDIT:
well, I took a closer look at Lucene recently, and it seems to give a much better answer to what the question actually asked (it does not use pLSA). As for the integration with db, you may use Hibernate Search (although it does not seem to be as powerful as using Lucene directy is).

Resources