Relevancy boosting very slow in Solr - performance

I have a Solr index with about 2.5M items in it and I am trying to use an ExternalFileField to boost relevancy. Unfortunately, it's VERY slow when I try to do this, despite it being a beefy machine and Solr having lots of memory available.
In the external file I have contents like:
747501=3.8294805903e-07
747500=3.8294805903e-07
1718770=4.03292174724e-07
1534562=3.8294805903e-07
1956010=3.8294805903e-07
747509=3.8294805903e-07
747508=3.8294805903e-07
1718772=3.8294805903e-07
1391385=3.8294805903e-07
2089652=3.8294805903e-07
1948271=3.8294805903e-07
108368=3.84404072186e-06
Each line is a document ID and it's corresponding boosting factor.
In my query I'm using edismax, and I am using the boost parameter, setting it to pagerank. The entire query is here.
In my schema I have:
<!-- External File Field Type-->
<fieldType name="pagerank"
keyField="id"
stored="false"
indexed="true"
omitNorms="false"
class="solr.ExternalFileField"
valType="float"/>
and
<field name="pagerank"
type="pagerank"
indexed="true"
stored="true"
omitNorms="false"/>
But the performance is just, plain bad. Am I missing a setting or something?

According to the javadoc
The external file may be sorted or unsorted by the key field, but it
will be substantially slower (untested) if it isn't sorted.
And as I see, ids in your file are unsorted. Can you sort it and test if it helps?

Related

Why solr error in logs when adding documents

I am getting Error when am adding document. My process will be deleting document with id and adding whole document again.
I am using spring boot solr to do this operations.
Most of the time document is added but some time I see it is missing
in solr 9 logs I see this error
I don't have groups value in my document at all
Unknown operation for the an atomic update: GROUPS
schema.xml
<field name="groups" type="string" indexed="true" stored="true" multiValued="true"/>

Elasticsearch 7 - Sort on custom field of multi-field property

I am working on upgrading a system at work from using ES1 to ES7.
Part of the ES1 implementation included a custom plugin to add an analyzer for custom sorting. The custom sorting behavior we have is similar to "natural sort", but extended to deal with legal codes. For example, it will sort 1.1.1 before 1.10.1. We've been calling this "legal sort". We used this plugin to add an extra .legalsort field to multi-field properties in our index, and then we would sort based on this field when searching.
I am currently trying to adapt the main logic for indexing and searching to ES7. I am not trying to replace the "legal sort" plugin yet. When trying to implement sorting for searches, I ran into the error Fielddata is disabled on text fields by default. The solution I've seen suggested for that is to add a .keyword field for any text properties, which will be used for sorting and aggregation. This "works", but I don't see how I can then apply our old logic of sorting based on a .legalsort field.
Is there a way to sort on a field other than .keyword, which can use a custom analyzer, like we were able to in ES1?
The important aspect is not the name of your field (like *.keyword), but the type of field. For exact match searches, sorting and aggregation the type of the field should be “keyword“.
If you only use the legalsort field for display, sorting, aggregations or exact match, simply change the type from “text” to “keyword”.
If you want to use the same information for both purposes, it’s recommended to make it a multi-field by itself. Use the “keyword”-type field for sorting, aggregations and exact match search and use the “text”-type field for full-text search.
Having 2 types available for the 2 purposes is a significant improvement over the single string type you had in ES 1.0. When you sorted in ES 1.0, the information stored in the inverted index, had to get uninverted and was kept in RAM. This datastructure was/has been called fielddata. It was unbounded and often caused out-of-memory exceptions. Newer versions of Lucene introduced an alternative data structure which resides on disk (and in the file system cache) as a “replacement” to the “fielddata” data structure. It’s named doc-values and allows to sort on huge amounts of data without consuming significant amount of heap RAM. The only drawback: docvalues are not available for analyzed text (fields of type text), hence the need for a field of type keyword.
You also could set the mapping parameter “fielddata” to true for your legalsort field, enabling fielddata for this particular field to get back the previous behaviour with all its drawbacks

Can ElasticSearch be used purely for aggregations?

In my current usecase, I'm using ElasticSearch as a document store, over which I am building a faceted search feature.
The docs state the following:
Sorting, aggregations, and access to field values in scripts requires a different data access pattern.
Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations.
Does this imply that the aggregations are not dependent on the index? If so, is it advisable to prevent the fields from being indexed altogether by setting {"index": "no"} ?
This is a small deviation, but where does the setting enabled come in? How is it different from index?
On a broader note, should I be using ElasticSearch if aggregations is all I'm looking for? Should I opt for other solutions like MongoDB? If so, what are the performance considerations?
HELP!
It is definitely possible to use Elasticsearch for the sole purpose of aggregating data. I've seen such setups a few times. For instance, in one past project, we'd index data but we'd only run aggregations in order to build financial reports, and we rarely needed to get documents/hits. 99% of the use cases were simply aggregating data.
If you have such a use case, then you can tune your mapping to
The role of enabled is to decide whether your data is indexed or not. It is true by default, but if you set it to false, your data will simply be stored (in _source) but completely ignored by analyzers, i.e. it won't be analyzed, tokenized and indexed, and thus, it won't be searchable, you'll be be able to retrieve the _source, but not search for it. If you need to use aggregations, then enabled needs to be true (the default value)
The store parameter is to decide whether you want to store the field or not. By default, the field value is indexed, but not stored as it is already stored with the _source itself and you can retrieve it using source filtering. For aggregations, this parameter doesn't play any role.
If your use case is only about aggregations, you might be tempted to set _source: false, i.e. not store the _source at all since all you'll be needed is to index the field values in order to aggregate them, but this is rarely a good idea for various reasons.
So, to answer your main question, aggregations do depend on the index, but the (doc-)values used for aggregations are written in dedicated files, whose inner structure is much more performant and optimal than accessing the data from the index in order to build aggregations.
If you're using ES 1.x, make sure to set doc_values to true for all the fields you'll want to aggregate on (except analyzed strings and boolean fields).
If you're using ES 2.x, doc_values is true by default, so you don't need to do anything special.
Update:
It is worth noting that aggregations are dependent on doc_values (i.e. Per Document Values .dvd and .dvm Lucene files), which basically contains the same info as in the inverted index, but organized in a column-oriented fashion, which makes it much more efficient for aggregations.

How Alfresco and SOLR works with indexes queries

I have a doubt about how indexed properties works in Alfresco 4.1.6 with SOLR 1.4.
I use something like this for my queries:
SearchParameters sp = new SearchParameters();
sp.addStore(StoreRef.STORE_REF_WORKSPACE_SPACESSTORE);
sp.setLanguage(SearchService.LANGUAGE_FTS_ALFRESCO);
sp.setQuery(query);
ResultSet results = getSearchService().query(sp);
where query variable is something like this:
PATH:" /app:company_home/app:user_homes/cm:_x0030_123//*" AND
((#cm\:title:food) OR (#cm\:name:abcde) OR (TEXT:valles) OR
(#doc\:custom_property:"report") OR (#doc\:custom_property2:"report")
AND (#doc\:custom_property3:"report") AND TYPE:"{my.model}voc_document"
On my model.xml I specify what custom properties are indexed
<index enabled="true">
My question is... How works SOLR 1.4 with the indexes if I put on the search query two or more indexed properties? Like Oracle? Oracle try the best index and use only this. Or maybe SOLR combine all the indexed properties and uses all the index on the query?
I need this answer to determine how many indexes put on my model.xml. Maybe put a lot of indexes don't give me the best and efficient result and is better index only a few properties.
And finally, one question. I use LANGUAGE_FTS_ALFRESCO, but I can see that exists a LANGUAGE_SOLR_FTS_ALFRESCO. Is the same? I need to use the second if I use SOLR?
Thanks a lot!
Best regards
There is only one "index". Every field you mark as indexable (which is enabled by default) ends up in your solr index. Alfresco takes your query and sends it to SOLR for processing.
If you don't have a lot of documents, you can go ahead and index every field. By far the biggest impact on indexing and search is the full text index of the content field, which is enabled by default also.
LANGUAGE_FTS_ALFRESCO will use whatever index subsystem you have enabled. In later versions it may use SOLR or the database depending on your configuration. If you try to LANGUAGE_SOLR_FTS_ALFRESCO, it's forcing SOLR, so if you don't have solr enabled, you would have an error.
Regards!

SOLR External file field performance issue

I am using SOLR 4.5(standalone instance) and I am trying to use external field to improve the ranking of documents. I have two external file fields for two different parameters which change daily which I use in "bf" and "boost" params of the edismax parser. Previously, these fields were part of the SOLR index.
I am facing serious performance issue for moving these fields out from index to external file. The CPU usage of SOLR machine reaches 100% in peak load and average response time has risen from 13 milliseconds to almost 150 milliseconds.
Is there anything I can do to improve the performance of SOLR when using external file fields. Are there any things to take care of while using external file field values within boost/bf functions ?
As described in the SO Relevancy boosting very slow in Solr the key=value pairs the external file consists of, should be sorted by that key. This is also stated in the java doc of the ExternalFileField
The external file may be sorted or unsorted by the key field, but it will be substantially slower (untested) if it isn't sorted.
So if the content of your file would look like this (just an example)
300=3.8294805903e-07
5=3.8294805903e-07
20=3.8294805903e-07
You will need a script that alters the contents to
5=3.8294805903e-07
20=3.8294805903e-07
300=3.8294805903e-07

Resources