Elasticsearch find documents by another document - elasticsearch

I want to search documents in elasticsearch which have exactly same fields as the given document of id docId. For e.g. user calls the api with a docId, I want to filter docs such that all the docs returned fulfills some parameters in docId.
For example can I query Elasticsearch like this:
POST similarTerms/_search
{
"fields": [
"_id", "title"
] ,
"filter": {
"query": {"match": {
"title": doc[docId].title
}}
},
"size": 30
}
I know I can fetch the document with docId and then I can prepare the above query, but can I avoid the network hop somehow as even milliseconds of time improvement is of great concern for my app.
Thanks

This is a text-book scenario for "more like this" api. Quote from the docs:
The more like this (mlt) API allows to get documents that are "like" a
specified document. Here is an example:
$ curl -XGET
'http://localhost:9200/twitter/tweet/1/_mlt?mlt_fields=tag,content&min_doc_freq=1'
The API simply results in executing a search request with moreLikeThis
query (http parameters match the parameters to the more_like_this
query). This means that the body of the request can optionally include
all the request body options in the search API (aggs, from/to and so
on). Internally, the more like this API is equivalent to performing a
boolean query of more_like_this_field queries, with one query per
specified mlt_fields.
If you plan testing this (like I did) with one document only for test, make sure you also set min_term_freq=0 and min_doc_freq=0: GET /my_index/locations/1/_mlt?min_term_freq=0&min_doc_freq=0

Related

Reverse searching with Elastic Search

We have a site and want to give the users the oportunity to save a search query and be notified once an object have been added that would have been a hit they might be interested in.
We Have an index that contains search queries that the users have saved. Every time a new object is added to the object index, we want to do a reverse search in order to find the search queries that would have resulted in a hit for that object. This is in order to avoid doing one search for each saved query every time an object is added.
The problem is that the object contains all data, but the search queries only contain the properties that are interesting. So we are getting zero hits for most queries.
Example:
Search query:
{
"make": "foo",
"model": "bar
}
Newly added object:
{
"make": "foo",
"model: "bar",
"type": "jazz"
}
As you can see, the user is interested in any object with make "foo" and model "bar", and we want a query that would result in a hit because type "jazz" is missing in the index. What we get is zero hits.
We use the nest client version 7.13.0 in a dotnet6 application and Elastic Search version 7.13.4.
Would it be possible to reverse search so that a null in the index would be considered as a hit for any search query?
Thank you
You can achieve this with Percolate Query in Elasticsearch.
I have recently written blog on Percolate Query where I have explained with an example.
You can save a user query with Percolate query and when you index document at that time you can call search API and check if any query is matched the document or not. As you are using Nest client this will be easy to implement.

Elastic Search query analyzing

I want to see all the tokens is generated from 'match' text
I am wondering to know is there any specific file or capability to show details of query executing in elastic search or another way to see what is generated as a sequence of tokens when I am using 'match level' queries?
I did not use the log file to see which tokens are generated in the time of 'match' operation. Instead, I used _analayze endpoint.
Better to say if you want to use analyser of the specific index (in the case of different indices that each of them using its own customised analyser) put the name of the index in the URL:
POST /index_name/_analyze
{
"text" : "1174HHA8285M360"
}
This will use the default analyser defined in that index. And if we have more than one analyser in one index we can specify it in the query just as follow:
POST /index_name/_analyze
{
"text" : "1174HHA8285M360",
"analyzer" : "analyser_name"
}

Elasticsearch - use a "tags" index to discover all tags in a given string

I have an elasticsearch v2.x cluster with a "tags" index that contains about 5000 tags: {tagName, tagID}. Given a string, is it possible to query the tags index to get all tags that are found in that string? Not only do I want exact matches, but I also want to be able to control for fuzzy matches without being too generous. By too generous, a tag should only match if all tokens in the tag are found within a certain proximity of each other (say 5 words).
For example, given the string:
Model 22340 Sound Spectrum Analyzer
The following tags should match:
sound analyzer sound spectrum analyzer
BUT NOT
sound meter light spectrum chemical analyzer
I don't think it's possible to create an accurate elasticsearch query that will auto-tag a random string. That's basically a reverse query. The most accurate way to match a tag to a document is to construct a query for the tag, and then search the document. Obviously this would be terribly inefficient if you need to iterate over each tag to auto-tag a document.
To do a reverse query, you want to use the Elasticsearch Percolator API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html
The API is very flexible and allows you to create fairly complex queries into documents with multiple fields.
The basic concept is this (assuming your tags have an app specific ID field):
For each tag, create a query for it, and register the query with the percolator (using the tag's ID field).
To auto-tag a string, pass your string (as a document) to the Percolator, which will match it against all registered queries.
Iterate over the matches. Each match includes the _id of the query. Use the _id to reference the tag.
This is also a good article to read: https://www.elastic.co/blog/percolator-redesign-blog-post
"query": {
"match": {
"tagName": {
"query": "Model 22340 Sound Spectrum Analyzer",
"fuzziness": "AUTO",
"operator": "or"
}
}
}
If you want an equal match so that "sound meter" will not match you will have to add another field for each tag containing the terms count in the tag name, add a script to count the terms in the query and add a comparison of the both in the match_query, see: Finding Multiple Exact Values.
Regarding the proximity issue: Since you require "Fuzzyness" you cannot control the proximity because the "match_phrase" query is not integrated with Fuzzyness, as stated by Elastic docs Fuzzy-match-query:
Fuzziness works only with the basic match and multi_match queries. It doesn’t work with phrase matching, common terms, or cross_fields matches.
so you need to decide: Fuzzyness vs. Proximity.
Of course you can. You can achieve what you want to get using only just match query with standard analyzer.
curl -XGET "http://localhost:9200/tags/_search?pretty" -d '{
"query": {
"match" : {
"tagName" : "Model 22340 Sound Spectrum Analyzer"
}
}
}'

elasticsearch: allow discovery of document, without exposing source?

I'm trying to set up elasticsearch so that it allows users to discover the existence of documents, without having access to the document itself. For example, imagine a site that aggregates academic articles: they allow full-text search over the body, but only present the abstract.
I am trying to set up a system where different user groups have access to different documents, but everyone has access to the entire index.
What is the path of least resistance for me to set up restricted content search on elasticsearch? Is it a setting? A plugin? Write my own plugin? Fork?
To answer first part of your query,
First way: You can disable returning _source field for particular query by this.
{
"_source": false,
"query": {
"term": {
"user": "kimchy"
}
}
}
Second way: If you never want to see _source field, you can disable storing it.
{
"tweet": {
"_source": {
"enabled": false
}
}
}
Second part, you mentioned
I didn't exactly get your requirements but Shield can be useful if you want simple authentication, role based access control so some set of users can't modify documents and so on.
If you have your user-facing system, you can achieve it simply by adding access permission field in each document and mapping the permissions with user. Then you can use the filters when searching for documents. This is in-case if you don't get into details of Shield.

ElasticSearch run script on document insertion (Insert API)

Is it possible to specify a script be executed when inserting a document into ElasticSearch using its Index API? This functionality exists when updating an existing document with new information using its Update API, by passing in a script attribute in the HTTP request body. I think it would be useful too in the Index API because perhaps there are some fields the user wants to be auto-calculated and populated during insertion, without having to send an additional Update request after the insertion to have the script be executed.
Elasticsearch 1.3
If you just need to search/filter on the fields that you'd like to add, the mapping transform capabilities that were added into 1.3.0 could possibly work for you:
The document can be transformed before it is indexed by registering a
script in the transform element of the mapping. The result of the
transform is indexed but the original source is stored in the _source
field.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-transform.html
You can also have the same transformation run when you get a document as well by adding the _source_transform url parameter to the request:
The get endpoint will retransform the source if the _source_transform
parameter is set.The transform is performed before any source
filtering but it is mostly designed to make it easy to see what was
passed to the index for debugging.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_get_transformed.html
However, I don't think the _search endpoint accepts the _source_transform url parameter so I don't think you can apply the transformation to search results. That would be a nice feature request.
Elasticsearch 1.4
Elasticsearch 1.4 added a couple features which makes all this much nicer. As you mentioned, the update API allows you to specify a script to be executed. The update API in 1.4 can also accept a default document to be used in the case of an upsert. From the 1.4 docs:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : "ctx._source.counter += count",
"params" : {
"count" : 4
},
"upsert" : {
"counter" : 1
}
}'
In the example above, if the document doesn't exist it uses the contents of the upsert key to initialize the document. So in the case above the counter key in the newly created document will have a value of 1.
Now, if we set scripted_upsert to true (scripted_upsert is another new option in 1.4), our script will run against the newly initialized document:
curl -XPOST 'localhost:9200/test/type1/2/_update' -d '{
"script": "ctx._source.counter += count",
"params": {
"count": 4
},
"upsert": {
"counter": 1
},
"scripted_upsert": true
}'
In this example, if the document didn't exist the counter key would have a value of 5.
Full documentation from Elasticsearch site.

Resources