ElasticSearch, get the field where a text matches - elasticsearch

I do index JSON documents, where the structure is 'unknown' and I want to search for content and the result should be the FIELD-NAME where the content belongs to. Any way beside doing a query and then iterating through _source document fields to find it again in the results? I thought also about the precolate feature and generating the queries when indexing a JSON document (but this would create hundrets of queries to check while indexing ...). Maybe there is a simple feature I don't know.
E.g. I know "Main Street" will be stored in the data received (e.g. from a web crawler), but to optimise the crawler it would be helpful to get a suggestion to crawl only the field "property.address.street". The point is customer might know some sample data set to extract from JSON comming from different sources. To apply this knowledge to already collected data, the relevant field name must be found, especially when you want to make it automatically by provided sample content.

Related

Elastic search document storing

Basic usecase that we are trying to solve is for users to be able to search from the contents of the log file .
Lets say a simple situation where user searches for a keyword and this is present in a log file which i want to render it back to the user.
We plan to use ElasticSearch for handling this. The idea that i have in mind is to use elastic search as a mechanism to store the indexed log files.
Having this concept in mind, i went through https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
Couple of questions i have,
1) I understand the input provided to elastic search is a JSON doc. It is going to scan this JSON provided and create/update indexes. So i need a mechanism to convert my input log files to JSON??
2) Elastic search would scan this input document and create/update inverted indexes. These inverted indexes actually point to the exact document. So does that mean, ES would store these documents somewhere?? Would it store them as JSON docs? Is it purely in memory or on file sytem/database?
3) No when user searches for a keyword , ES returns back the document which contains the searched keyword. Now do i need to have the ability to convert back this JSON doc to the original log document that user expects??
Clearly im missing something.. Sorry for asking questions this silly , but im trying to improve my skills and its WIP.
Also , i understand that there is ELK stack out there. For some reasons we just want to use ES and not the LogStash and Kibana part of the stack..
Thanks
Logs needs to be parsed to JSON before they can be inserted into Elasticsearch
All documents are stored on the filesystem and some data is kept in memory but all data is persistent.
When you search Elasticsearch you get back matching JSON documents. If you want to display the original error message, you can store that original message in one of the JSON fields and display just that.
So if you just want to store log messages and not break them into fields or anything, you can simply take each row and send it to Elasticsearch like so:
{ "message": "This is my log message" }
To parse logs, break them into fields and add some logic, you will need to use some sort of app, like Logstash for example.

What's the difference between source filtering and the fields option in the elasticsearch get API?

I'm confused between source filtering (i.e. using the _source_include parameter) and the fields option of the GET API in elasticsearch. How are they different in terms of performance? When are they supposed to be used?
Update: re: fields
Note that this is the 1.x documentation if you just arrived here from the future.
For backwards compatibility, if the fields parameter specifies fields which are not stored (store mapping set to false), it will load the _source and extract it from it. This functionality has been replaced by the source filtering parameter.
-- https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-fields.html#search-request-fields
AFAICT:
_source tells elasticsearch whether to include the source of matched documents in the response. The "source" is the data in the document as it was inserted.
fields tells elasticsearch to include source, but only include the defined fields.
Permformance: Unless you have low bandwidth to the Elasticsearch server, it might be negligible.
I had the same doubt, here I found what can be the answer.
fields restricts the fields whose contents are parsed and returned
_source_filtering restricts the fields which are returned
Another way of seeing it is to think that fields is used to optimize data transfer and CPU usage while _source_filtering only optimizes data transfer
Source filtering allows us to control which parts of the original JSON document are returned for each hit[...]It's worth keeping in mind that this only saves us on bandwidth costs between the nodes participating in the search as well as the client, not CPU or Disk, as was the case when using fields.
In addition:
One feature about fields that's not commonly known is the ability to select metadata-fields as well. Of particular note is its ability to select the _ttl-field, which actually returns the number of milliseconds until the document expires, not the original lifespan of the document. A very handy feature indeed.
The fields parameter applies only to stored fields. From the 2.3 documentation:
Besides indexing the values of a field, you can also choose to store
the original field value for later retrieval. Users with a Lucene
background use stored fields to choose which fields they would like to
be able to return in their search results. In fact the _source field
is a stored field. In Elasticsearch, setting individual document
fields to be stored is usually a false optimization. The whole
document is already stored as the _source field. It is almost always
better to just extract the fields that you need using the _source
parameter.
See source filetring for how to limit the fields returned from _source

Recommended way to store data in elasticsearch

I want to use elasticsearch on my backend and I have few questions:
My DB contains semi-structured data of products, i.e. each product may have different attributes inside it.
I want to be able to search a text on most of the fields and also search a text on one specific field.
What is the recommended way to store the document in ES ? to store all text in on field (maybe using _all feature) or leave it in different fields.
My concern of different fields is that I might have a lot of indexes (because I have many different product attributes)
I'm using couchbase as my main DB.
What is the recommended way to move the documents from it to ES, assuming I need to make some modifications on the document ?
To update the index from my code explicitly or use external tool ?
10x,
It depends on how many docs you are indexing at a time. If the number of docs are like >2million. Then it's better to store everything in one field , which will save time while indexing.
If the docs indexed are very less, then index them field by field and then search on _all field. This will give a clear view on the data and will be really helpful for what to display and what not to display.

In Elasticsearch, what happens if I set 'store' to yes on a few fields, but _source to false?

We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.
The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.
I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?
Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.
Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.

ElasticSearch: Impact of setting a "not_analyzed" field as "store":"yes"?

Suppose I have a string field specified as not_analyzed in the mapping. If I then add "store":"yes" to the mapping, will ElasticSearch duplicate the storage? My understanding of not_analyzed fields is that they are not run through an Analyzer, indexed as is, but a client is able to match against it. So, if a field is both not_analyzed and store:yes, this could cause ElasticSearch to keep two copies of the string.
My question:
If a string field is stored as both not_analyzed and store:yes, will there be duplicate storage of the string?
I hope that's clear enough. Thanks!
You're mixing up the concept of indexed field and stored field in lucene, the library that elasticsearch is built on top of.
A field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
A field is stored when you want to be able to retrieve it. Let's say Lucene provides some kind of storage too, which doesn't have anything to do with the inverted index itself.
When you search using lucene you get back a list of document ids that match. Then you can retrieve some text from their stored fields, which is what you literally show as search results. If you don't store a field you'll never be able to get it back from lucene (this is not true for elasticsearch though, as I'm going to explain below).
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.
Therefore the two data structures are not related to each other. If you both index and store a field in lucene, its content will not be present twice in the same form. Stored fields are stored as they are, as you send them to lucene, while indexed fields might be analyzed and will be part of the inverted index, which is something else. Stored fields are made to be retrieved for a specific document (by lucene document id), while indexed fields are made to search, in such a structure that literally inverts the text having as a result each term as key, together with a list of document ids that contain it (the postings list).
When it comes to elasticsearch things change a little though. When you don't configure a field as stored in your mapping (default is store:no) you are able to retrieve it anyway by default. This happens because elasticsearch always stores in lucene the whole source document that you send to it (unless you disable this feature) within a special lucene field, called _source.
When you search using elasticsearch you get back by default the whole source field, but you can also ask for specific fields. What happens in that case is that elasticsearch checks whether those specific fields are stored or not in lucene. If they are the content will be retrieved from lucene, otherwise the _source stored field will be retrieved from lucene, parsed as json (pull parsing) and those specific fields will be extracted. In the first case it might be a little faster, but not necessarily. If your source is really big and you only want to load a couple of fields, configuring them as stored in lucene would probably make the loading process faster; on the other hand, if your _source is not that big and you want to load many fields, then it's probably better to load only one stored field (the _source), which would lead to a single disk seek, parse it etc. In most of the cases using the _source field works just fine.
To answer your question: inverted index and lucene storage are two completely different things. You end up having two copies of the same data in lucene only if you decide to store a field (store:yes in the mapping), since elasticsearch keeps that same content within the json _source, but this doesn't have anything to do with the fact that you're indexing or analyzing the field.

Resources