How to only store the index,not the original text in ES - elasticsearch

I am using elastic search 7.10.1. I would store and search against my blogs. The blog has id,title and content fields.
I would like to search against id, title and content, but since the content of blog is too big, so that I would like to save the original content text outside of Elastic Search, such as HBase.
I would ask how to achieve this in ES?

If you are using a static mapping then simply don't define your content field in your index mapping, and don't populate it while indexing your document to ES.
Refer to Mapping param for more info, and specifically, store param default false which means you can't retrieve field value if _source(true by default) is also disabled.
index param default true, which controls whether the field is searchable or not, in your case if you don't want to search and retrieve it you have to disable these two params.

Related

Best way to add data elastic search

I am posting data so I can search it later in elastic search.
I am posting it like this:
POST https://localhost:9200/superheroes/_doc/27
body:
{
"name": "Flash",
"super_power": "Super Speed"
}
This is automatically stored on a _source object... but I read the _source field itself is not indexed (and thus is not searchable) online, and the entire purpose of this application is quick search by value... like if I wanna know which superheroes have super speed and I write super speed on the search bar.
you are right about the _source field, but if you don't define the mapping for your fields, Elasticsearch generates the default mapping with default param for these fields . You can read more about the mapping param and in your case, if you want field to be searchable it needs to be the part of the inverted index and this is controlled by the index param.
You can override the value of these params by defining your own mapping(recommended otherwise Elasticsearch will guess the field type based on the first data you index in a field).
Hope this helps and clears your doubt, in short if you are not defining your mapping, your text data is searchable by default.

Dealing with Empty Fields

I am new to stormcrawler and elasticsearch in general. I am currently using stormcrawler 2.0 to index website data (including non-HTML items such as PDF's and Word Documents) into elasticsearch. In some cases, the metadata of PDF's or Word documents do not contain a title so the field is stored blank/null in elasticsearch. This is unfortunately causing issues in the webapp I am using to display search results (search-ui). Is there a way I can have stormcrawler insert a default value of "Untitled" into the title field if none exists in the metadata?
I understand that elasticsearch has a null_value field parameter, but if I understand correctly that parameter cannot be used for text fields and only helps with searching.
Thanks!
One option would be to write a custom ParseFilter to give an arbitrary value to any missing key or a key with an empty value. The StormCrawler code has quite a few examples of ParseFilters, see also the WIKI.
The same could be done as a custom Bolt placed between the parser and the indexer; grab the metadata and normalise to your heart's content.

Filtering Elasticsearch fields from index/store

I was wondering what is the recommended approach to filter out some of the fields that are sent to Elasticsearch from Store and Index?
I want to filter our some fields from getting indexed in Elasticsearch. You may ask why you are sending them to Elasticsearch from the first place. Unfortunately, it is sent via another application that doesn't accept any filtering mechanism. Hence, filtering should be addressed at the time of indexing. Here is what we have done, but I am not sure what would be the consequences of these steps:
1- Disable dynamic mapping ("dynamic": "false" ) in ES templates.
2- Including only the required fields in _source and excluding the rest.
According to ES website, some of the ES functionalities will be disabled by disabling _source fields. Given I don't need the filtered fields at all, I was wondering whether the mentioned solution will break anything regarding the remaining fields or not?
There are a few mapping parameters that allow you to do what you want:
index: true/false: if true the field value is indexed in order to be searched later on (default: true)
store: true/false: if true the field values are stored in addition to being indexed. Usually, the field values are stored in the source already, but you can choose to not store the source but store the field value itself (default: false)
enabled: true/false: only for the mapping type as a whole or for object types. you can decide whether to only store the value but not index it
So you can use any combination of the above parameters if you don't want to modify the source documents and simple let ES do it for you.

Only index certain fields from Wikipedia River

I'm trying to use the Wikipedia River
Is there a way / How can I customize the mapping so that ElasticSearch only index the title fields (I'd still like to access the whole text)?
The mapping is useful more to decide how you index data rather than what you index, unless you set it to dynamic: false which means that elasticsearch effectively accepts only the fields that are explicitly declared in the mapping.
The problem is that the wikipedia river always sends a set of fields for every document and this behaviour is not currently configurable, thus there's no way to index only a subset of those fields (e.g. only title and _source). What you could do is modify your search request so that you get back only the fields that you are interested in, but the content of the index will stay the same.

Stored field in elastic search

In the documentation, some types, such as numbers and dates, it specifies that store defaults to no. But that the field can still be retrieved from the json.
Its confusing. Does this mean _source?
Is there a way to not store a field at all, and just have it indexed and searchable?
None of the field types are stored by default. Only the _source field is. That means you can always get back what you sent to the search engine. Even if you ask for specific fields, elasticsearch is going to parse the _source field for you and give you back those fields.
You can disable the _source if you want but then you could only retrieve the fields that you explicitly stored, according to your mapping.

Resources