How does solr group by query(group.query) work internally?

How does solr group by query(group.query) work internally? - performance

I am trying to achieve grouping according to the fields matched in a single query. So in the group.query parameter I am passing individual queries according to field names and in the query parameter, I am passing just the search term to be matched against the default copy fields. However It is taking too much time compared to querying individual fields separately and combining the data at my end.
here is the sample query:
http://localhost:8983/solr/psqlTest/select?q=201*&wt=json&indent=true&group=true&group.query=road_name:201*&group.query=own_name:201*&group.query=tel_no:201*&group.query=assm_no:201*&group.limit=5
This query takes time in order of seconds to complete when atleast 100 users are hitting(using jmeter).
The query over each fields and joining the results take time in order of 100-200 ms for the same jmeter parameters.
Why is there such a huge difference in the performance?

Related

Does query multiple indices slower than querying single index in Elasticsearch?

I have a retention index which is used to save transaction data. The index pattern in on yearly, which means transaction-2000, transaction-2001, etc. There is a timestamp field inside each document which indicate the time of this document occurred.
I also have an alias transaction which points to all yearly transaction indices. When I query the transaction data in my application, I just use the alias name rather than the yearly index name.
My question is if I query just one year document based on the timestamp field, e.g. 2000, will the query be faster if I only query the single index transaction-2000 rather than the alias transaction? Or whether they are the same speed?

Joey , this is a classic elasticsearch problem. When you have multiple aliases behind an alias , all of them are queried. One way to overcome this is to use routing which comes in extremely handy if you already know which index to go to. At query time if you already know the ts field (2000 or 2000 and 2001 for example) , you can specifically instruct to search only 2 indexes behind the alias using the route
https://www.elastic.co/blog/customizing-your-document-routing
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shard-routing.html
We had a similar issue when we scaled for a large DataSet search (similar design where multiple indexes were behind an alias) , routing came in handy and the queries scaled to our requirements. Hope this helps

How can I get information from 2 different ElasticSearch indexes?

So, I have 2 indexes in my Elasticsearch server.
I need to gather the results from the first index, and for each result I need to gather info from the second index.
How to do that? Tried the foreach processor, but no luck so far.
Tky

I need to gather the results from the first index, and for each result I need to gather info from the second index.
Unless you create parent/child relationships, that's not possible in ElasticSearch.
However, note:
In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance.
Handle reading from multiple indexes within your application or rethink your index mapping.

The foreach processor is for ingest pipelines, meaning, stuff that gets done at indexing time. So it won't help you when you are trying to gather the results.
In general, it's not going to be possible to query another index (which might live on another shard) from within a query.
In some cases, you can use a join field. There are performance implications, it's only recommended in specific cases.
If you are not in the join field use case, and you can restructure your data to use nested objects, it will be more performant than join fields.
Otherwise, you'll be better off running multiple queries in the application code (maybe you can fetch all the "secondary" results using just one query, so you'd have 2 queries in total?)

Associating each document with a function to be satisfied by search parameters in Elasticsearch

In Elasticsearch, can I associate each document with a (different) function that must be satisfied by parameters I supply on a search, in order to be returned on that search?
The particular functions I would particularly like to use involve a loop, some kind of simple branching (if-statement of switch-statement), an array-like data structure, strings comparisons, and simple boolean operators.

couple of keynotes here:
At query time:
- If your looking to shape the relevancy function, meaning the actual relevancy score of each document, you could use a script score query.
- If you're only looking to filter out unwanted documents, you could use a script query that allows you to do just that.
Both of those solutions enables you to compute a score comparing incoming query parameters against existing previously indexed values.
Take note that usage of scripts at query time can lead to increased memory usage and performance issues.
Elastic can also handle a second batch of filtering rules that are applied to the actual query result in the form of a post filter. Can come in handy sometime if you're not in a position of stream processing the output at API view level.
At index time:
There is such a thing called script fields that allows you to store a function that computes a result based on other fields value and incoming query parameters. they can be really powerful given the fact that they are assigned at index time. I think they might be what you are looking for.
I would not be using those if i weren't to have those field values compared against query params. Reason is that I like my index process to be lean and fast so I tend to compute those kinds of values at stream level, in upstream from the actual bulk indexing query.
Although convenient, those custom scripts results are likely to be achievable with a combination of regular queries and filters. In each release, the elasticsearch teams is adding new query and field types that let you do what you use to do via scripted queries whiteout the risk of blowing out you memory. a good example of this is the rank feature datatype recently introduced in the 7.x release.
A piece of advice for you. think of your elasticsearch service as a regular API in your datalayer. As such you can do query processing before the actual call to elastic and you can do data processing from the actual elastic results. If you really can't fit your business rules in there, that would be your last resort.
Fell free to contact me if you still have any questions. All the best.

Big ElasticSearch bool query: Somehow mark items that match a particular piece of the query

I have a giant ElasticSearch query that I am executing. It consists of a number of "subqueries" joined by should's on a bool. I need to know which resulting items match which should's of the query.
I sort and then limit the size of the query. I cannot split the query into multiple queries and then msearch due to the huge volume of data that would be returned in order to then later sort the full result set and again truncate to size.
Is there a way to tag results that match a portion of a query, so that I can later identify which pieces caused those results?

Yes, have a look at Named Queries.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-named-queries-and-filters.html

Solr performance with multiple fields

I have to index around 10 million documents in solr for full text search. Each of these documents have around 25 additional metadata fields attached to them. Each of the metadata fields individually are small (upto 64 characters). Common queries would be involving a search term along with multiple metadata fields used to filter the data. So my questions is which would provide better performance wrt search response time. (indexing time is not a concern):
a. Index the text data as well as push all metadata fields into solr as stored fields and query solr for all the fields using a single query. (Effectively solr does the filtering with metadata as well as search)
b. Store the metadata fields in a db like Mysql. Use solr only for full text and then use the document ids returned from solr as an input to the database to filter based on other metadata to retrieve the final set of documents.
Thanks
Arijit

Definitely a). Solr isn't simply a fulltext search engine, it's much more. It's filter queries are at least as good/fast as MySQL select.
b) is just silly. Fetch many ids from MySQL by selecting those with correct metadata, do a fulltext search in Solr while filtering against that ids list, fetch document from MySQL or Solr (if you choose to store data in it, not just indexes). I can't imagine a case where this would be faster.

Why complicate things, especially if indexing time and HD space is not an issue, you should store all your data (meaning: subset needed by users) in Solr.
Exception would be if you had large amount of text to store (and retrieve) in each document. In those cases it would be faster to fetch it from RDB after you get your search results back. Anyway, noone can tell for sure which one would be faster in your case, so I suggest you test performance of both approaches (using JMeter for example).
Also, since you don't care about index time, you should do all the processing you can at index time instead of at query time (e.g. synonyms, payloads where they can replace boosting, ...).
See here for some additional info on Solr performance:
http://wiki.apache.org/solr/SolrPerformanceFactors

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio