Elasticsearch script_fields with geo_shape - elasticsearch

I would like to use elasticsearch to annotate data with geographic belonging.
I'm quite new to elasticsearch so I'm a bit unsure about the correct word to search for. I might have missed an obvious answer, but I didn't find what I was looking for in the Elasticsearch documentation.
Problem description :
I have a huge set of events stored in a ElasticSearch database. I would like to query these events, and for each event, add a kind of scripted field telling is the event point is within a given polygon.
I should be able to do this with two queries :
a first one retrieving a batch of events
another one filtering the same events with a geo_shape filter
I could then check for every event in the first list if it's also in the second one.
But it would be so much easier to have one query, with the event document annotated with an is_within_polygon boolean.
And it would also allow me to check against multiple polygons in one query.
Is it possible with Elasticsearch ?

Related

Query ElasticSearch after the index operation

I have the eservice A that executes some text processing. After it, service B has to execute some set of Elasticsearch queries on the document. The connectivity between the services provided by Kafka. The solution is tightly coupled to ES free text search capabilities, so I can't query in another way.
Possible solution:
To store the document in ES and query it. The problem is that ES is eventually consistent and I don't know if the document already indexed or not.
Is there some API to ensure that the document is already indexed?
Another option is to publish a message from service A with delay X+5 seconds, where X is the refresh interval of the index, where the document should be stored. Seems to me an unreliable solution. What do you think?
Another direction that I thought about, is some way to query the document with ES queries where the document is in memory. For example, if I will have some magic way to convert the ES query to Luciene DSL, so I don't need to deal with the eventual consistent behavior of Elasticsearch and I can query Lucine directly.
Maybe there are some other solutions?
take a look at the ?refresh flag so that an indexing request will only return once a refresh has happened. otherwise you can use the GET API to see if the document exists or not
however there is no magic options here, Elasticsearch is eventually consistent and you need to factor that in

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Can I get messages from the Kibana visualization?

Wondering if there is a way to get list of the messages related to a Kibana visualization. I understand if I apply the same filter on the "Discover", which is on "Visualization", I can filter the related messages. But I want to have more direct user experience like an user clicks on a region of a graph and can get the related messages which formed that region. Is there any way to do it?
This helped me:
https://discuss.elastic.co/t/can-i-get-the-related-messages-from-a-kibana-visualization/101692/2
It says:
Not directly, unfortunately. You can click on the visualization to create a filter, and you can pin that filter and take it to discover, which will do what you're asking, but isn't very obvious.
The reason is that visualizations are built using aggregate data, so they don't know what the underlying documents are, they only know the aggregate representation of the information. For example, if you have a bunch of traffic data, and you are looking at bytes over time, the records get bucketed by time and the aggregate of the bytes in that bucket are shown (average, sum, etc.).
In contrast, Discover only works with the raw documents, showing you exactly what you have stored in Elasticsearch. Both documents and aggregations can use filters and queries, which is why you can create a filter in one and use it in the other, but the underlying data is not the same.

how can I find related keywords with elasticsearch?

I am pretty new to elasticsearch and already love it.
Right know I am interested in understanding on how I can let elasticsearch make suggestions for similar keywords.
I have already read this article: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html.
The More Like This Query (MLT Query) finds documents that are "like" a given set of documents.
This is already more than I am looking for. I dont need similar documents but only related / similar keywords.
So lets say I have an index of documents about movies and I start a query about "godfather". Then elasticsearch should suggest related keywords - e.g. "al pacino" or "Marlon Brando" because they are likely to occur in the same documents.
any ideas how this can be done?
Unfortunately, there is no built-in way to do that in Elastic. What you could possibly do, is to write a program, that will query Elastic, return matched documents, then you will get the _source data, or just retrieve it from your original datasource (like DB or file), later you will need to calculate TF-IDF for each term in the retrieved ones and somehow combine everything all together to get top K terms out of all returned terms.

Set up Elasticsearch suggesters that can return suggestions from different data types

We're in the process of setting up Amazon Elasticsearch Service (running Elasticsearch version 2.3).
We have different types of data (that I'm currently thinking of as different document types within the same index).
We have a generic search in an app where we want an inline autocomplete function, that is, a completion suggester returning hits from all different data (document) types. How can that be set up?
When querying suggesters you have to specify an index, so that's why I wanted to keep all the data in the same index. According to the documentation, the completion suggester considers all documents in the index.
Setting up the completion suggester for the first document type was pretty straight forward and is working great. However, as far as I can see you to specify a suggest field when querying. That would be all good hadn't it been for the error message we get when setting up the mapping for the second document type:
Type: illegal_argument_exception Reason: "[suggest] is defined as an object in mapping [name_of_document_type] but this name is already used for a field in other types"
Writing this question I see that it's possible to specify more than one suggester in a single suggest query. Maybe that is what we have to solve it? (I.e. get X results from Y suggesters where we compare the scores to get the 1 suggestion we want to present to the user.)
One of the core principles of good data design for Elasticsearch (as with many data stores) is to optimise your data storage for ease of reading. Usually, this means embracing duplication.
With this in mind, I'd suggest having a separate autocomplete index with a mapping that's designed specifically for the suggester queries.
Whenever you insert or write one of your other documents, map it to your autocomplete type and add or update it in your autocomplete index at the same time (or, depending on how up-to-date it needs to be, create an offline process to update your autocomplete index e.g. every day).
Then, when you do your suggest query, you can just use your autocomplete index and not worry about dealing with different types of documents with different fields.

Resources