Discrepancies in ElasticSearch Results - elasticsearch

I have a relatively simple search index built up for simple, plain text queries. No routing, custom analyzers or anything like that. One search instance/node, one index.
There are docs within the index that I have deleted, and the RESTfull API confirms that:
GET /INDEX_NAME/person/464
{
"_index": "INDEX_NAME",
"_type": "person",
"_id": "464",
"exists": false
}
However the doc is being returned from a simple search
POST /INDEX_NAME/person
{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "person.offices",
"query": "Chicago"
}
}
]
}
}
}
One of the rows that is returned:
{
"_index": "INDEX_NAME",
"_type": "person",
"_id": 464,
"_score": null,
"fields": [
...
]
}
I'm new to ElasticSearch and thought I finally had a grasp of the basic concepts before digging deeper. But I'm not sure why a document isn't accessible via REST but it is still appearing in the results?
I'm also running into the reverse issue where docs are returned from the API but they are not being returned in the search. For the sake of clarity I am considering that a separate issue for the time being, but I have a feeling that these two issues might be related.
Part of me wants to delete my index and rebuild it, but I don't want to get into the same situation in a few days (and I'm not sure if that would even help).
Any ideas or pointers on why this discrepancy might be happening? Maybe a process is in some zombie state and elasticsearch just needs to be restarted?

Related

Misspelling suggestion ("did you mean") with phrase suggest and whitespace correction with Elasticsearch

I use default analyzer "english" for searching documents and it is pretty good.
But also I need "did you mean" results when search query is misspelled OR search by such misspelled prhases.
What analyzers/filters/query do I need to achieve such behaveour?
Source text
Elasticsearch is a distributed, open source search and analytics engine for all types of data,
including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built
on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is
the central component of the Elastic Stack, a set of open source tools for data ingestion,
enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack
(after Elasticsearch, Logstash, and Kibana), the Elastic Stack now includes a rich collection
of lightweight shipping agents known as Beats for sending data to Elasticsearch.
Search terms
search query => did you mean XXX?
missed letter or something like
Elastisearch => Elasticsearch
distribated => distributed
Apacje => Apache
extra space
Elastic search => Elasticsearch
no space
opensource => open source
misspelled phrase
serach engne => search engine
Your first example of missed letter or something else can be achieved using the fuzzy query and second one using the custom analyzer which uses ngram or edge-ngram tokenizer for examples on it, please refer to my blog on autocomplete.
Adding fuzzy query example on your sample doc
Index mapping
{
"mappings": {
"properties": {
"title": {
"type": "text"
}
}
}
}
Index your sample docs and use below search queries
{
"query": {
"fuzzy": {
"title": {
"value": "distributed"
}
}
}
}
And search res
"hits": [
{
"_index": "didyou",
"_type": "_doc",
"_id": "2",
"_score": 0.89166296,
"_source": {
"title": "distribated"
}
}
]
And for Elasticsearch
{
"query": {
"fuzzy": {
"title": {
"value": "Elasticsearch"
}
}
}
}
And search Result
"hits": [
{
"_index": "didyou",
"_type": "_doc",
"_id": "1",
"_score": 0.8173577,
"_source": {
"title": "Elastisearch"
}
}
]

How to enforce a required field in elastic search?

I am building a cms using elastic search on the back end and my team has decided to use elastic search. I am new to it. I mostly use mongoose with mongodb from previous projects. In mongodb if I wrong assign a field or completely skip a required field mongodb throws an error.
Is there a way to enforce required fields in elasticsearch?
There is not built in functionality, that will allow you to define required/mandatory fields in the mappings. Many will recommend you to do checks on the client side.
However, in Elasticsearch 5.x you have the possibility to do the trick by using Ingest node.
You can use ingest node to pre-process documents before the actual
indexing takes place. This pre-processing happens by an ingest node
that intercepts bulk and index requests, applies the transformations,
and then passes the documents back to the index or bulk APIs.
To pre-process documents before indexing, you define a pipeline that
specifies a series of processors. Each processor transforms the
document in some way.
An example, which shows the possibility of using this approach.
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"script": {
"lang": "painless",
"inline": "if (ctx.title == null) { throw new Exception('Document does not have the *title* field') }"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "type",
"_id": "1",
"_source": {
"title": "Elasticsearch 101"
}
},
{
"_index": "index",
"_type": "type",
"_id": "2",
"_source": {
"company": "Elastic"
}
}
]
}
For more information please take a look here - https://www.elastic.co/guide/en/elasticsearch/reference/5.2/ingest.html

Directions on how to index words and annotate with their type (entity, etc) and then Elasticsearch/w.e. returns these words with the annotations?

I'm trying to build a very simple NLP chat (I could even say pseudo-NLP?), where I want to identify a fixed subset of intentions (verbs, sentiments) and entities (products, etc)
It's a kind of entity identification or named-entity recognition, but I'm not sure I need a full fledged NER solution for what I want to achieve. I don't care if the person types cars instead of car. HE HAS to type the EXACT word. So no need to deal with language stuff here.
It doesn't need to identity and classify the words, I'm just looking for a way that when I search a phrase, it returns all results that contains each word of if.
I want to index something like:
want [type: intent]
buy [type: intent]
computer [type: entity]
car [type: entity]
Then the user will type:
I want to buy a car.
Then I send this phrase to ElasticSearch/Solr/w.e. and it should return me something like below (it doesn't have to be structured like that, but each word should come with its type):
[
{"word":"want", "type:"intent"},
{"word":"buy", "type":"intent"},
{"word":"car","type":"car"}
]
The approach I came with was Indexing each word as:
{
"word": "car",
"type": "entity"
}
{
"word": "buy",
"type": "intent"
}
And then I provide the whole phrase, searching by "word". But I had no success so far, because Elastic Search doesn't return any of the words, even although phrases contains words that are indexed.
Any insights/ideas/tips to keep this using one of the main search engines?
If I do need to use a dedicated NER solution, what would be the approach to annotate words like this, without the need to worry about fixing typos, multi-languages, etc? I want to return results only if the person types the intents and entities exactly as they are, so not an advanced NLP solution.
Curiously I didn't find much about this on google.
I created a basic index and indexed some documents like this
PUT nlpindex/mytype/1
{
"word": "buy",
"type": "intent"
}
I used query string to search for all the words that appear in a phrase
GET nlpindex/_search
{
"query": {
"query_string": {
"query": "I want to buy a car",
"default_field": "word"
}
}
}
By default the operator is OR so it will search for every single word in the phrase in word field.
This is the results I get
"hits": [
{
"_index": "nlpindex",
"_type": "mytype",
"_id": "1",
"_score": 0.09427826,
"_source": {
"word": "car",
"type": "entity"
}
},
{
"_index": "nlpindex",
"_type": "mytype",
"_id": "4",
"_score": 0.09427826,
"_source": {
"word": "want",
"type": "intent"
}
},
{
"_index": "nlpindex",
"_type": "mytype",
"_id": "3",
"_score": 0.09427826,
"_source": {
"word": "buy",
"type": "intent"
}
}
]
Does this help?

Elasticsearch does not return existing document sometimes

For the existing document when I fire following cURL command, I get:
root#unassigned-hostname:~# curl -XGET "http://localhost:9200/test/test3/573553/"
result:
{
"_index": "test",
"_type": "test3",
"_id": "573553",
"exists": false
}
When I fire the same command second time, I get:
root#unassigned-hostname:~# curl -XGET "http://localhost:9200/test/test3/573553/"
result:
{
"_index": "test",
"_type": "test3",
"_id": "573553",
"_version": 1,
"exists": true,
"_source": {
"id": "573553",
"name": "hVTHc",
"price": "21053",
"desc": "VGNHNXkAAcVblau"
}
}
I am using Elasticsearch 0.90.11 on Ubuntu 12.04.
Could anyone please help me figuring out this problem?
I have seen cases where elasticsearch shards can get out of sync during network partitions or very high add/update/delete volume (the same document getting updated/deleted/added within milliseconds of each other, potentially racing). There is no clean way to merge the shards, instead, you just randomly choose a winner. One way to check if this is the case is to repeatedly run a match all query and check if the results jump around at all.
If you want to roll the dice and see what happens you can set replicas down to 0 and then bump back up to whatever value you were using.
While this may not be the reason for your issue, it is worth noting this is one of the reasons not to depend on elasticsearch as your primary source of truth.

Elasticsearch termvector API not working

I've set the mapping the title field for the article type for the testindex1 index as follows:
PUT /testindex1/article/_mapping
{
"article": {
"type": "object",
"dynamic": false,
"properties": {
"title": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets",
"_index": {
"enabled": true
}
},
}
}
}
omitting the remainder of the mapping specification. (This example and those that follow assume the Marvel Sense dashboard interface.) testindex1 is then populated with articles, including article with id 4540.
As expected,
GET /testindex1/article/4540/?fields=title
produces
{
"_index": "testindex1",
"_type": "article",
"_id": "4540",
"_version": 1,
"exists": true,
"fields": {
"title": "Elasticsearch is the best solution"
}
}
(The title text has been changed to protect the innocent.)
However,
GET /testindex1/article/4540/_termvector?fields=title
produces
No handler found for uri [/testindex1/article/4540/_termvector?fields=title&_=1404765178625] and method [GET]
I've experimented with variants of the mapping specification, and variants of the termvector request, so far to no avail. I've also looked for tips in official and non-official documentation, and on forums that cover Elasticsearch topics, including Stack Overflow. elasticsearch.org looks authoritative. I expect I've misused the termvector API in a way that will be instantly obvious to people who are familiar with it. Please point out my mistake(s). Thanks.
The _termvector api endpoint for returning term vector stats was only added in the 1.0 Beta - you will need to upgrade if you want to use term vectors.
Term Vectors
Note
Added in 1.0.0.Beta1.
Returns information and statistics on terms in the fields of a
particular document as stored in the index.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html

Resources