SOLR External file field performance issue - performance

I am using SOLR 4.5(standalone instance) and I am trying to use external field to improve the ranking of documents. I have two external file fields for two different parameters which change daily which I use in "bf" and "boost" params of the edismax parser. Previously, these fields were part of the SOLR index.
I am facing serious performance issue for moving these fields out from index to external file. The CPU usage of SOLR machine reaches 100% in peak load and average response time has risen from 13 milliseconds to almost 150 milliseconds.
Is there anything I can do to improve the performance of SOLR when using external file fields. Are there any things to take care of while using external file field values within boost/bf functions ?

As described in the SO Relevancy boosting very slow in Solr the key=value pairs the external file consists of, should be sorted by that key. This is also stated in the java doc of the ExternalFileField
The external file may be sorted or unsorted by the key field, but it will be substantially slower (untested) if it isn't sorted.
So if the content of your file would look like this (just an example)
300=3.8294805903e-07
5=3.8294805903e-07
20=3.8294805903e-07
You will need a script that alters the contents to
5=3.8294805903e-07
20=3.8294805903e-07
300=3.8294805903e-07

Related

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Associating each document with a function to be satisfied by search parameters in Elasticsearch

In Elasticsearch, can I associate each document with a (different) function that must be satisfied by parameters I supply on a search, in order to be returned on that search?
The particular functions I would particularly like to use involve a loop, some kind of simple branching (if-statement of switch-statement), an array-like data structure, strings comparisons, and simple boolean operators.
couple of keynotes here:
At query time:
- If your looking to shape the relevancy function, meaning the actual relevancy score of each document, you could use a script score query.
- If you're only looking to filter out unwanted documents, you could use a script query that allows you to do just that.
Both of those solutions enables you to compute a score comparing incoming query parameters against existing previously indexed values.
Take note that usage of scripts at query time can lead to increased memory usage and performance issues.
Elastic can also handle a second batch of filtering rules that are applied to the actual query result in the form of a post filter. Can come in handy sometime if you're not in a position of stream processing the output at API view level.
At index time:
There is such a thing called script fields that allows you to store a function that computes a result based on other fields value and incoming query parameters. they can be really powerful given the fact that they are assigned at index time. I think they might be what you are looking for.
I would not be using those if i weren't to have those field values compared against query params. Reason is that I like my index process to be lean and fast so I tend to compute those kinds of values at stream level, in upstream from the actual bulk indexing query.
Although convenient, those custom scripts results are likely to be achievable with a combination of regular queries and filters. In each release, the elasticsearch teams is adding new query and field types that let you do what you use to do via scripted queries whiteout the risk of blowing out you memory. a good example of this is the rank feature datatype recently introduced in the 7.x release.
A piece of advice for you. think of your elasticsearch service as a regular API in your datalayer. As such you can do query processing before the actual call to elastic and you can do data processing from the actual elastic results. If you really can't fit your business rules in there, that would be your last resort.
Fell free to contact me if you still have any questions. All the best.

Elasticsearch lucene, understand code path for search

I want to understand how each of the lucene index files (nvd,dvd,tim,doc.. mainly these four) are used in ES query.
E.g. say my index has ten docs and i am doing a aggregation query. I would like to understand how ES/Lucene performs access to these four files for a single query.
I am trying to see if I can make some optimization in my system which is mostly a disk heavy system to speed up query performance.
I looked at ES code and understand that the QueryPhase is the most expensive and it seems to be doing a lot of randomn access to disk for the log oriented data I have.
I want to now dive deeper on lucene level as well and possibly debug code and see in action. Lucene code has zero log messages for IndexReader related classes. Also debugging lucene code directly seems unhelpful since the unittest don't create indexes with tim, doc, nvd, dvd files
Any pointers ?
As I know, ES don't do much on search details, if your want optimize search, my experience is optimize your data layout, here is some important lucene files description:
(see http://lucene.apache.org/core/7_2_1/core/org/apache/lucene/codecs/lucene70/package-summary.html#package.description):
Term Index(.tip) # ON MEMORY.
Term Dictionary(.tim) # ON DISK.
Frequencies(.doc) # ON DISK.
Per-Document Values(.dvd, .dvm), very useful on aggregation. # ON DISK.
Field Index(.fdx) # ON MEMORY.
Field Data(.fdt), finally data fetch from disk in here. # ON DISK.
And there are some point can optmize performance:
trying use small date type, for example: INTEGER or LONG values instead of STRING.
CLOSE DocValues on unnecessary field, at the same time open DocValues on that filed which your want to sort/aggregation.
just incluse necessasy field on source like "_source": { "includes": ["some_necessasy_field"]}.
only index field that your need using ES defined mappings.
split your data on multi index.
add SSD.

Need clarification about usage of mahout with hadoop

I currently have an implementation of a recommender in mahout using the in memory recommendation apis. However, I would like to move to a distributed solution using hadoop in order to calculate offline recommendations. This is my first time using hadoop and I'm looking for clarification on a few concepts and api usages.
Currently, my understanding of hadoop is minimal and I think that the correct approach is the following:
use something like apache drill in order to populate the hdfs with the user and item data.
using the recommendation job in mahout train on the data from the hdfs.
transform the resultant data in the hdfs to index shards to be used by solr
use solr to provide the recommendations to the userbase
However, I am looking for clarifications on a couple aspects of this design:
How would I utilize a rescorer in the manner that it is used in the in memory live recommendations?
What is the best manner in which to invoke the recommendations job?
I have other questions besides these two but the answers to these would be a huge help.
You may be talking about the Mahout + Hadoop + Solr recommender. This method handles rescoring in a couple different ways.
The basic recommender can be put together in two ways:
After getting data into into HDFS in the form of (user id, item id, preference weight) run the ItemSimilarityJob on the data (use LLR similarity, which is usaully best). It will create what is called an indicator matrix. This will be an item id by item id sparse matrix of values indicating the similarity magnitude between any two items. You must then convert this into values that Solr can index. That means translating the internal Mahout integer IDs into some unique string representation, which is probably what they were at the very beginning. This will look like (item123,item223 item643 item293 item445...) as a CSV. so two Solr fields, the first is an item id, the second is a list of similar items. All ids must be text tokens. Then the query for recommendations is a Solr query made up of item ids that a particular user has shown a preference for. So query = "item223 item344 item445...". Make the query against the filed that olds the indicator matrix values. You will get back an ordered list of item IDs
A much easier way that may work for you is to use a tool in the /examples folder of Mahout 1.0-SNAPSHOT or here: https://github.com/pferrel/solr-recommender. It takes in raw log files with unique strings for user and item ids. it does all the work on Hadoop to output CSVs that can be indexed by Solr directly or loaded into a DB as described above.
The way I did the demo site (https://guide.finderbots.com) is to use my Solr web app integration, putting the indicator matrix into a DB attaching the similar item list to my collection of items. So item123 got item223 item643 item293 item445... in its indicator field. After you index the collection the query is then = "item223 item344 item445..." -- the user's prefered items.
Here are three ways to do rescoring:
Mix in metadata with the query. So you could do query = "item223 item344 item445..." against the indicator field AND "SciFi" against the "genre" field. This gives you blended collaborative filtering and metadata in your query and as you can imagine, the recs are based on the user's prefs but skewed towards "SciFi". There are lots of other interesting things you can do once you get item+indicators+metadata into an index.
Filter recs by metadata. You can get recs not skewed but filtered, if you want. Using the Solr query = "item223 item344 item445..." against the indicator field AND "SciFi" as a filter against the "genre" field. In this case you get nothing but "SciFi" where #1 you would get mostly "SciFi"
Get your ordered list of recs back and rescore them in any way you'd like based on other things you know about the user, context, or items. Often these can be encoded into a Solr query and done with one query but reordering and filtering can be done after the recs are returned too. You would have to write that code, it is not built in.
The fun thing is you can mix filters, metadata fields, and user preferences with what Solr calls "boost" values to get all sorts of rescoring. Solr can even use location to query, skew, or filter.
Note: You don't have to worry about Solr shards necessarily. Solr will index most DBs and HDFS directly but only the index is sharded. You shard an index if you have a very big one, you replicate it if you have lots of queries/second (or for failover). Solr queries are generally very fast so I'd worry about that after you have a functioning system since it's a config thing and shouldn't be affected by the rest of your workflow.

Solr performance with multiple fields

I have to index around 10 million documents in solr for full text search. Each of these documents have around 25 additional metadata fields attached to them. Each of the metadata fields individually are small (upto 64 characters). Common queries would be involving a search term along with multiple metadata fields used to filter the data. So my questions is which would provide better performance wrt search response time. (indexing time is not a concern):
a. Index the text data as well as push all metadata fields into solr as stored fields and query solr for all the fields using a single query. (Effectively solr does the filtering with metadata as well as search)
b. Store the metadata fields in a db like Mysql. Use solr only for full text and then use the document ids returned from solr as an input to the database to filter based on other metadata to retrieve the final set of documents.
Thanks
Arijit
Definitely a). Solr isn't simply a fulltext search engine, it's much more. It's filter queries are at least as good/fast as MySQL select.
b) is just silly. Fetch many ids from MySQL by selecting those with correct metadata, do a fulltext search in Solr while filtering against that ids list, fetch document from MySQL or Solr (if you choose to store data in it, not just indexes). I can't imagine a case where this would be faster.
Why complicate things, especially if indexing time and HD space is not an issue, you should store all your data (meaning: subset needed by users) in Solr.
Exception would be if you had large amount of text to store (and retrieve) in each document. In those cases it would be faster to fetch it from RDB after you get your search results back. Anyway, noone can tell for sure which one would be faster in your case, so I suggest you test performance of both approaches (using JMeter for example).
Also, since you don't care about index time, you should do all the processing you can at index time instead of at query time (e.g. synonyms, payloads where they can replace boosting, ...).
See here for some additional info on Solr performance:
http://wiki.apache.org/solr/SolrPerformanceFactors

Resources