Dealing with Empty Fields - elasticsearch

I am new to stormcrawler and elasticsearch in general. I am currently using stormcrawler 2.0 to index website data (including non-HTML items such as PDF's and Word Documents) into elasticsearch. In some cases, the metadata of PDF's or Word documents do not contain a title so the field is stored blank/null in elasticsearch. This is unfortunately causing issues in the webapp I am using to display search results (search-ui). Is there a way I can have stormcrawler insert a default value of "Untitled" into the title field if none exists in the metadata?
I understand that elasticsearch has a null_value field parameter, but if I understand correctly that parameter cannot be used for text fields and only helps with searching.
Thanks!

One option would be to write a custom ParseFilter to give an arbitrary value to any missing key or a key with an empty value. The StormCrawler code has quite a few examples of ParseFilters, see also the WIKI.
The same could be done as a custom Bolt placed between the parser and the indexer; grab the metadata and normalise to your heart's content.

Related

elasticsearch unknown setting index.include_type_name

I'm in really weird situations, I need to create indexes in elasticsearch that contain typeless fields. I have a rails application that sends any data per second to my elasticsearch. about my architecture, I have to say I use elastic-stack on docker in ubuntu server and use socket to send data's to elk and all of them are the latest version.
In my rails application user could choose datatype for each field but the issues happen when the user want to change the datatype of one field right after it's created, logstash return this error
error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [field] of type [long] in document with id '5e760cac-cafc-4fd0-9e45-1c650967ccd4'. Preview of field's value: '2022-01-18T08:06:30'", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"For input string: \"2022-01-18T08:06:30\
I found deadly queue letter plugins to save wrong input in my server after that I think if I could index documents without any type the problem is solved so I start to googling and found Removal of mapping types in elasticsearch documents and I follow instructions which describe in tutorials I get the following error:
unknown setting [index.include_type_name] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
even I put "include_type_name" in the request to send to the elastic noting change I have the latest version of elastic.
I think maybe it's helpful to edit the default elasticsearch template but noting the change. could you please help me with what should I do?
As already mentioned in the comments, Elasticsearch does not support changing the data type of a field without a reindex or creating a new index.
For example, if a field is mapped as a numeric field like integer and the user wants to index a string value in this field, elasticsearch will return an mapping error.
You would need to change the mapping of the index and reindex it or create a entirely new index using the new mapping.
None of this is done automatically by elastic, you would need to deal with this in your application, you could catch the error and implement some logic to create a new index with the new mapping, but this also could lead to other problems as having too many indices in the cluster and query errors when the range of the query include index with the same field with different mappings.
One feature that Elasticsearch has that could help you in some way is the runtime fields, with runtime fields you can query a field that has a specific mapping using a different mapping.
For example, if you have a field that has date values, but was wrongly mapped as a keyword or text field, you could use a runtime field to query it as it was a date field.
But again, this will need that you implement a logic to build those runtime fields and also can lead to other problems, not all the data types are available to runtime fields and runtime fields can impact in the performance.
Another feature that could help you is to use of multi-fields, this, I think, is the closest you got of having a field with multiple data types.
Using multi-fields you could have a field named date with the date type and also a field named date.keyword with the keyword type, you also could have a field name code with the keyword type and a field name code.int with the integer type, you would also need to use the ignore_malformed setting in the mapping so elastic does not reject the entire document in case of mapping errors, just the field with the wrong mapping.
Just keep in mind that when use multi-fields, you will have a different field for each mapping, for example date is a field, date.keyword is another field, this will increase the storage usage.
But again, none of this is done automatically, it needs logic in your application, elasticsearch does not allows you to change the mapping of a field, if your application needs this, you will need to implement something in the application that can work with that limitations of elasticsearch.

How to only store the index,not the original text in ES

I am using elastic search 7.10.1. I would store and search against my blogs. The blog has id,title and content fields.
I would like to search against id, title and content, but since the content of blog is too big, so that I would like to save the original content text outside of Elastic Search, such as HBase.
I would ask how to achieve this in ES?
If you are using a static mapping then simply don't define your content field in your index mapping, and don't populate it while indexing your document to ES.
Refer to Mapping param for more info, and specifically, store param default false which means you can't retrieve field value if _source(true by default) is also disabled.
index param default true, which controls whether the field is searchable or not, in your case if you don't want to search and retrieve it you have to disable these two params.

Get top 10 most used words in text fields

I have an index containing thousands of documents, each one of them having a full text field.
I want to search through all those fields and fetch the 10 most common words that come back most often.
I would also like a way of visualizing it on Kibana if that's possible.
The most common way to achieve that is to duplicate your full text field with a keyword datatype. That will get you able to make terms aggregation on that field - doc here. Maybe you could consider to do a significant term aggregation - doc here, thus to avoid the presence of stopwords and common words. In ES 6.x you could use also the significant text aggregation - doc here, without create the keyword field, but i never try it, i don't know how it works. Instead if you need to retrieve the frequency of the words for each document, you should use the termvector - doc here

Lucene: Filter query by doc ID

I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?
Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.

Only index certain fields from Wikipedia River

I'm trying to use the Wikipedia River
Is there a way / How can I customize the mapping so that ElasticSearch only index the title fields (I'd still like to access the whole text)?
The mapping is useful more to decide how you index data rather than what you index, unless you set it to dynamic: false which means that elasticsearch effectively accepts only the fields that are explicitly declared in the mapping.
The problem is that the wikipedia river always sends a set of fields for every document and this behaviour is not currently configurable, thus there's no way to index only a subset of those fields (e.g. only title and _source). What you could do is modify your search request so that you get back only the fields that you are interested in, but the content of the index will stay the same.

Resources