How to find similar documents in Elasticsearch

How to find similar documents in Elasticsearch - elasticsearch

My documents are made up of using various fields. Now given an input document, I want to find the similar documents using the input document fields. How can I achieve it?

{
"query": {
"more_like_this" : {
"ids" : ["12345"],
"fields" : ["field_1", "field_2"],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
you will get similar documents to id 12345. Here you need to specify only ids and field like title, category, name, etc. not their values.
Here is another code to do without ids, but you need to specify fields with values. Example: Get similar documents which have similar title to:
elasticsearch is fast
{
"query": {
"more_like_this" : {
"fields" : ["title"],
"like" : "elasticsearch is fast",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
You can add more fields and their values

You haven't mentioned the types of your fields. A general approach is to use a catch all field (using copy_to) with the more like this query.

{
"query": {
"more_like_this" : {
"fields" : ["first name", "last name", "address", "etc"],
"like" : "your_query",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
Put everything in your_query . You can increase or decrease min_term_freq and max_query_terms

Related

In Elasticsearch, how to use a range query on a text field?

There is a 'remark' field in Elasticsearch index that contains various remarks along with the date when that remark was given. For example:
remark
------
14/02/2023 To be updated ; 15/02/2023 Further action is needed ; 16/02/2023 Looks good
Due to some implementation specific reasons, I can't split date as a separate field. I need to query all the records that match a given date range in 'remark' field. For example: Retrieve all the records that are in the date range 15/02/2023 and 16/02/2023.
I have written the following query in Elasticsearch:
GET myindex/_search
{
"query"
: {
"bool"
: {
"must"
: [
{
"range"
: {
"remark"
: {
"gte" : "2023-02-15",
"lte" : "2023-02-16"
}
}
}
]
}
},
"highlight"
: {
"fields"
: {
"content"
: {
"type" : "unified",
"fragment_size" : 150,
"number_of_fragments" : 3,
"pre_tags" : [""],
"post_tags" : [""]
}
}
},
"size"
: 1000
}
The above query doesn't work since the field 'remark' is not of type datetime. Is there any workaround to this issue?

Yes, it's possible to use text or keyword field type with range query but it's an expensive query, so it's disabled by default.
Using the range query with text and keyword fields
Range queries on text or keyword fields will not be executed if
search.allow_expensive_queries is set to false.
I won't recommend you to enable it but if you want you can use:
PUT _cluster/settings
{
"transient": {
"search.allow_expensive_queries": "true"
}
}
After you update the cluster settings your query will work.
Recommendation:
add a new field:
PUT index_name/_mapping
{
"properties": {
"remark_date": {
"type": "date"
}
}
}
and update the data, the update by query will add a new field and value for each document.
POST index_name/_update_by_query?wait_for_completion=false
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html#ranges-on-text-and-keyword

Elasticsearch slow results with IN query and Scoring

I have text document data (500k approximately) saved in elasticsearch where the document text is mapped with it's corresponding document number.
I am trying to fetch results in batches for "Sample Text" in particular set of document numbers (300k appoximately) with scoring and i am facing extreme slowness in the result.
Here is the the Mapping
PUT my_index
{
"mappings" : {
"doc_repo" : {
"properties" : {
"doc_number" : {
"type" : "integer"
},
"document" : {
"type" : "string",
"term_vector" : "with_positions_offsets_payloads"
}
}
}
}
}
Here is the request query
{
"query" : {
"bool" : {
"must" : [
{
"terms" : {
"document" : [
"sample text"
]
}
},
{
"terms" : {
"doc_number" : [1,2,3....,300K] //ArrayOf_300K_DocNumbers
}
}
]
}
},
"fields" : [
"doc_number"
],
"size" : 500,
"from" : 0
}
I Tried fetching result in two other ways
Result without scoring in particular set of document numbers(i used filtering for this)
Result with scoring but without any particular set of document numbers (in batches)
Both of these were pretty quick, but problem comes when i am trying achieve both.
Do i need to change mapping or search query or any other ways to achieve this.
Thanks in advance.

Issue was specifically with elasticsearch 2.X, Upgrading elasticsearch solves the issue.

elasticsearch percolator filter fails

I'm using a document query against a percolator that works ok. When I try to filter the percolator queries against which document percolate using queries ids, it doesn't return any result. For example:
{
"doc" : {
"text" : "This is the text within my document"
},
"highlight" : {
"order" : "score",
"pre_tags" : ["<example>"],
"post_tags" : ["</example>"],
"fields" : {
"text" : { "number_of_fragments" : 0 }
}
},
"filter":{"ids":{"values":[11,15]}}
,
"size" : 100
}
I know for sure that those ids are correct, but allways obtain "matches" : [ ]. When I don't use filter, ES retrieves correct matches.
Thanks for your help.

I think I've solved it. It seems that the filter only works on the "metadata" fields, meaning that you have to add customized fields to the queries indexed in the percolator in order to use them to filter when you need.
Using my previous example, I would have to index in percolator queries like:
{
"query" : {
"match_phrase" : {
"text" : "document"
}
},
"id" : 11
}
Adding "manually" a redundant id field in order to use it later as filter reference.
At percolation time, you have to use something like:
{
"doc" : {
"text" : "This is the text within my document"
},
"filter":{"match":{"id":11}},
"highlight" : {
"order" : "score",
"pre_tags" : ["<example>"],
"post_tags" : ["</example>"],
"fields" : {
"text" : { "number_of_fragments" : 0 }
}
},
"size" : 100
}
In order to use only that percolator query. Complementary information can be found here.

Elasticsearch phrase prefix query on multiple fields

I'm new to ES and I'm trying to build a query that would use phrase_prefix for multiple fields so I dont have to search more than once.
Here's what I've got so far:
{
"query" : {
"text" : {
"first_name" : {
"query" : "Gustavo",
"type" : "phrase_prefix"
}
}
}
}'
Does anybody knows how to search for more than one field, say "last_name" ?

The text query that you are using has been deprecated (effectively renamed) a while ago in favour of the match query. The match query supports a single field, but you can use the multi_match query which supports the very same options and allows to search on multiple fields. Here is an example that should be helpful to you:
{
"query" : {
"multi_match" : {
"fields" : ["title", "subtitle"],
"query" : "trying out ela",
"type" : "phrase_prefix"
}
}
}
You can achieve the same using the Java API like this:
QueryBuilders.multiMatchQuery("trying out ela", "title", "subtitle")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX);

How should I query Elastic Search given my mapping and using keywords?

I have a very simple mapping which looks like this (I streamlined the example a bit):
{
"location" : {
"properties": {
"name": { "type": "string", "boost": 2.0, "analyzer": "snowball" },
"description": { "type": "string", "analyzer": "snowball" }
}
}
}
Now I index a lot of locations using some random values which are based on real English words.
I'd like to be able to search for locations that match any of the given keywords in either the name or the description field (name is more important, hence the boost I gave it). I tried a few different queries and they don't return any results.
{
"fields" : ["name", "description"],
"query" : {
"terms" : {
"name" : ["savage"],
"description" : ["savage"]
},
"from" : 0,
"size" : 500
}
}
Considering there are locations which have the word savaged in the description it should get me some results (savage is the stem of savaged). It yields 0 results using the above query. I've been using curl to query ES:
curl -XGET -d #query.json http://localhost:9200/myindex/locations/_search
If I use query string instead:
curl -XGET http://localhost:9200/fieldtripfinder/locations/_search?q=description:savage
I actually get one result (of course now it would be searching the description field only).
Basically I am looking for a query that will do a OR kind of search using multiple keywords and compare them to the values in both the name and the description field.

Snowball stems "savage" into "savag" that’s why term "savage" didn't return any results. However, when you specify "savage" on URL, it’s getting analyzed and you get results. Depending on what your intention is, you can either use correct stem ("savag") or analyze your terms by using "match" query instead of "terms":
{
"fields" : ["name", "description"],
"query" : {
"bool" : {
"should" : [
{"match" : {"name" : "savage"}},
{"match" : {"description" : "savage"}}
]
},
"from" : 0,
"size" : 500
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to find similar documents in Elasticsearch - elasticsearch

My documents are made up of using various fields. Now given an input document, I want to find the similar documents using the input document fields. How can I achieve it?

You haven't mentioned the types of your fields. A general approach is to use a catch all field (using copy_to) with the more like this query.

{ "query": { "more_like_this" : { "fields" : ["first name", "last name", "address", "etc"], "like" : "your_query", "min_term_freq" : 1, "max_query_terms" : 12 } } } Put everything in your_query . You can increase or decrease min_term_freq and max_query_terms

Related

In Elasticsearch, how to use a range query on a text field?

Elasticsearch slow results with IN query and Scoring

elasticsearch percolator filter fails

Elasticsearch phrase prefix query on multiple fields

How should I query Elastic Search given my mapping and using keywords?

Categories

Resources