Elasticsearch termvector API not working - elasticsearch

I've set the mapping the title field for the article type for the testindex1 index as follows:
PUT /testindex1/article/_mapping
{
"article": {
"type": "object",
"dynamic": false,
"properties": {
"title": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets",
"_index": {
"enabled": true
}
},
}
}
}
omitting the remainder of the mapping specification. (This example and those that follow assume the Marvel Sense dashboard interface.) testindex1 is then populated with articles, including article with id 4540.
As expected,
GET /testindex1/article/4540/?fields=title
produces
{
"_index": "testindex1",
"_type": "article",
"_id": "4540",
"_version": 1,
"exists": true,
"fields": {
"title": "Elasticsearch is the best solution"
}
}
(The title text has been changed to protect the innocent.)
However,
GET /testindex1/article/4540/_termvector?fields=title
produces
No handler found for uri [/testindex1/article/4540/_termvector?fields=title&_=1404765178625] and method [GET]
I've experimented with variants of the mapping specification, and variants of the termvector request, so far to no avail. I've also looked for tips in official and non-official documentation, and on forums that cover Elasticsearch topics, including Stack Overflow. elasticsearch.org looks authoritative. I expect I've misused the termvector API in a way that will be instantly obvious to people who are familiar with it. Please point out my mistake(s). Thanks.

The _termvector api endpoint for returning term vector stats was only added in the 1.0 Beta - you will need to upgrade if you want to use term vectors.
Term Vectors
Note
Added in 1.0.0.Beta1.
Returns information and statistics on terms in the fields of a
particular document as stored in the index.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html

Related

Misspelling suggestion ("did you mean") with phrase suggest and whitespace correction with Elasticsearch

I use default analyzer "english" for searching documents and it is pretty good.
But also I need "did you mean" results when search query is misspelled OR search by such misspelled prhases.
What analyzers/filters/query do I need to achieve such behaveour?
Source text
Elasticsearch is a distributed, open source search and analytics engine for all types of data,
including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built
on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is
the central component of the Elastic Stack, a set of open source tools for data ingestion,
enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack
(after Elasticsearch, Logstash, and Kibana), the Elastic Stack now includes a rich collection
of lightweight shipping agents known as Beats for sending data to Elasticsearch.
Search terms
search query => did you mean XXX?
missed letter or something like
Elastisearch => Elasticsearch
distribated => distributed
Apacje => Apache
extra space
Elastic search => Elasticsearch
no space
opensource => open source
misspelled phrase
serach engne => search engine
Your first example of missed letter or something else can be achieved using the fuzzy query and second one using the custom analyzer which uses ngram or edge-ngram tokenizer for examples on it, please refer to my blog on autocomplete.
Adding fuzzy query example on your sample doc
Index mapping
{
"mappings": {
"properties": {
"title": {
"type": "text"
}
}
}
}
Index your sample docs and use below search queries
{
"query": {
"fuzzy": {
"title": {
"value": "distributed"
}
}
}
}
And search res
"hits": [
{
"_index": "didyou",
"_type": "_doc",
"_id": "2",
"_score": 0.89166296,
"_source": {
"title": "distribated"
}
}
]
And for Elasticsearch
{
"query": {
"fuzzy": {
"title": {
"value": "Elasticsearch"
}
}
}
}
And search Result
"hits": [
{
"_index": "didyou",
"_type": "_doc",
"_id": "1",
"_score": 0.8173577,
"_source": {
"title": "Elastisearch"
}
}
]

Count of "actual hits" (not just matching docs) for arbitrary queries in Elasticsearch

This one really frustrates me. I tried to find a solution for quite a long time, but wherever I try to find questions from people asking for the same, they either want something a little different (like here or here or here) or don't get an answer that solves the problem (like here).
What I need
I want to know how many hits my search has in total, independently from the type of query used. I am not talking about the number of hits you always get from ES, which is the number of documents found for that query, but rather the number of occurrences of document features matching my query.
For example, I could have two documents with text a text field "description", both containing the word hero, but one of them containing it twice.
Like in this minimal example here:
Index mapping:
PUT /sample
{
"settings": {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"doc": {
"properties": {
"name": { "type": "keyword" },
"description": { "type": "text" }
}
}
}
}
Two sample documents:
POST /sample/doc
{
"name": "Jack Beauregard",
"description": "An aging hero"
}
POST /sample/doc
{
"name": "Master Splinter",
"description": "This rat is a hero, a real hero!"
}
...and the query:
POST /sample/_search
{
"query": {
"match": { "description": "hero" }
},
"_source": false
}
... which gives me:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.22396864,
"hits": [
{
"_index": "sample",
"_type": "doc",
"_id": "hoDsm2oB22SyyA49oDe_",
"_score": 0.22396864
},
{
"_index": "sample",
"_type": "doc",
"_id": "h4Dsm2oB22SyyA49xDf8",
"_score": 0.22227617
}
]
}
}
So there are two hits ("total": 2), which is correct, because the query matches two documents. BUT I want to know many times my query matched inside each document (or the sum of this), which would be 3 in this example, because the second document contained the search term twice.
IMPORTANT: This is just a simple example. But I want this to work for any type of query and any mapping, also nested documents with inner_hits and all.
I didn't expect this to be so difficult, because it must be an information ES comes across during search anyway, right? I mean it ranks the documents with more hits inside them higher, so why can't I get the count of these hits?
I am tempted to call them "inner hits", but that is the name of a different ES feature (see below).
What I tried / could try (but it's ugly)
I could use highlighting (which I do anyway) and try to make the highlighter generate one highlight for each "inner match" (and don't combine them), then post-process the complete set of search results and count all the highlights --> Of course, this is very ugly, because (1) I don't really want to post-process my results and (2) I'd have to get all results to do this by setting size to a high enough value, but actually i only want to get the number of results requested by the client. This would be a lot of overhead!
The feature inner_hits sounds very promising, but it just means that you can handle the hits inside nested documents independently to get a highlighting for each of them. I use this for my nested docs already, but it doesn't solve this problem because (1) it persists on inner hit level and (2) I want this to work with non-nested queries, too.
Is there a way to achieve this in a generic way for arbitrary queries? I'd be most thankful for any suggestions. I'm even down for solving it by tinkering with the ranking or using script fields, anything.
Thank's a lot in advance!
I would definitely not recommend this for any kind of practical use due to the awful performance, but this data is technically available in the term frequency calculation in the results from the explain API. See What is Relevance? for a conceptual explanation and Explain API for usage.

How to apply synonyms at query time instead of index time in Elasticsearch

According to the elasticsearch reference documentation, it is possible to:
Expansion can be applied either at index time or at query time. Each has advantages (⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance versus flexibility.
The advantages and disadvantages all make sense and for my specific use I want to make use of synonyms at query time. My use case is that I want to allow admin users in my system to curate these synonyms without having to reindex everything on an update. Also, I'd like to do it without closing and reopening the index.
The main reason I believe this is possible is this advantage:
(⬆)︎ Synonym rules can be updated without reindexing documents.
However, I can't find any documentation describing how to apply synonyms at query time instead of index time.
To use a concrete example, if I do the following (example stolen and slightly modified from the reference), it seems like this would apply the synonyms at index time:
/* NOTE: This was all run against elasticsearch 1.5 (if that matters; documentation is identical in 2.x) */
// Create our synonyms filter and analyzer on the index
PUT my_synonyms_test
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
// Create a mapping that uses this analyzer
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
// Some data
PUT my_synonyms_test/rulers/1
{
"name": "Elizabeth II",
"title": "Queen"
}
// A query which utilises the synonyms
GET my_synonyms_test/rulers/_search
{
"query": {
"match": {
"title": "monarch"
}
}
}
// And we get our expected result back:
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4142135,
"hits": [
{
"_index": "my_synonyms_test",
"_type": "rulers",
"_id": "1",
"_score": 1.4142135,
"_source": {
"name": "Elizabeth II",
"title": "Queen"
}
}
]
}
}
So my question is: how could I amend the above example so that I would be using the synonyms at query time?
Or am I barking up completely the wrong tree and can you point me somewhere else please? I've looked at plugins mentioned in answers to similar questions like https://stackoverflow.com/a/34210587/2240218 and https://stackoverflow.com/a/18481495/2240218 but they all seem to be a couple of years old and unmaintained, so I'd prefer to avoid these.
Simply use search_analyzer instead of analyzer in your mapping and your synonym analyzer will only be used at search time
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"search_analyzer": "my_synonyms" <--- change this
}
}
}
To use the custom synonym filter at QUERY TIME instead of INDEX TIME, you first need to remove the analyzer from your mapping:
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string"
}
}
}
You can then use the analyzer that makes use of the custom synonym filter as part of a query_string query:
GET my_synonyms_test/rulers/_search
{
"query": {
"query_string": {
"default_field": "title",
"query": "monarch",
"analyzer": "my_synonyms"
}
}
}
I believe the query_string query is the only one that allows for specifying an analyzer since it uses a query parser to parse its content.
As you said, when using the analyzer only at query time, you won't need to re-index on every change to your synonyms collection.
Apart from using the search_analyzer, you can refresh the synonyms list by restarting the index after making changes in the synonym file.
Below is the command to restart your index
curl -XPOST 'localhost:9200/index_name/_close'
curl -XPOST 'localhost:9200/index_name/_open'
After this automatically your synonym list will be refreshed without the need to reingest the data.
I followed this reference Elasticsearch — Setting up a synonyms search to configure the synonyms in ES

Percentage of matched terms in Elasticsearch

I am using elasticsearch to find similar documents. Below is the query I am using:
{
"query": {
"more_like_this":{
"like": {
"_index": "docs",
"_type": "pdfs",
"_id": "pdf_1"
},
"min_term_freq": 1,
"min_doc_freq": 1,
"max_query_terms: 50,
"minimum_should_match": "50%"
}
}
}
I am extracting the text from PDF and storing in my index "docs". Below are the mappings for type "pdfs":
{
"properties": {
"content":{
"type": "string",
"analyzer": "my_analyzer"
}
}
}
In the result sets I am getting similar documents with their scores. Based on what I have read so far it is not possible to calculate percentage similarity based on score so I am not trying to do that. I am trying to figure out if it is possible to know:
"Out of 50 query terms from the source document how many terms are
matched in a document? or percentage of terms matched?"
As you can see that in my query I am specifying minimum_should_match as 50% so I am assuming that elasticsearch is filtering the documents somewhere based on the how much percentage of terms are matched in a document. I want to get that percentage. I am fairly new to elasticsearch. So far I have gone through the documentation but couldn't find out how to do it.
Any pointer/help is appreciated!

Discrepancies in ElasticSearch Results

I have a relatively simple search index built up for simple, plain text queries. No routing, custom analyzers or anything like that. One search instance/node, one index.
There are docs within the index that I have deleted, and the RESTfull API confirms that:
GET /INDEX_NAME/person/464
{
"_index": "INDEX_NAME",
"_type": "person",
"_id": "464",
"exists": false
}
However the doc is being returned from a simple search
POST /INDEX_NAME/person
{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "person.offices",
"query": "Chicago"
}
}
]
}
}
}
One of the rows that is returned:
{
"_index": "INDEX_NAME",
"_type": "person",
"_id": 464,
"_score": null,
"fields": [
...
]
}
I'm new to ElasticSearch and thought I finally had a grasp of the basic concepts before digging deeper. But I'm not sure why a document isn't accessible via REST but it is still appearing in the results?
I'm also running into the reverse issue where docs are returned from the API but they are not being returned in the search. For the sake of clarity I am considering that a separate issue for the time being, but I have a feeling that these two issues might be related.
Part of me wants to delete my index and rebuild it, but I don't want to get into the same situation in a few days (and I'm not sure if that would even help).
Any ideas or pointers on why this discrepancy might be happening? Maybe a process is in some zombie state and elasticsearch just needs to be restarted?

Resources