Mapping openNLP or StanfordNLP in elasticsearch - elasticsearch

I am trying to map openNLP to enable parsing of filed in a document. Using the following code:
"article":
"properties":
"content" : { "type" : "opennlp" }
Prior to create the mapping, I downloaded the named entity extraction binary file from sourceforge.net and installed/unpacked using cURL in elasticsearch plugin folders.
I get the following error message when I tried to run the above mapping code.
"error": "MapperParsingException[No handler for type [opennlp]
declared on field [content]]" "status": 400

After quick Googling I've found this: https://github.com/spinscale/elasticsearch-opennlp-plugin
I assume that you're trying to install it. However - it's outdated and probably not even supported by recent Elasticsearch versions.
The purpose of it seems to extract data from files and index them as tags. Elasticsearch Mapper Attachments Type plugin does exactly that. I would encourage you to use it instead of OnenNLP. Quick extract from documentation:
The mapper attachments plugin adds the attachment type to
Elasticsearch using Apache Tika. The attachment type allows to index
different "attachment" type field (encoded as base64), for example,
microsoft office formats, open document formats, ePub, HTML, and so on
(full list can be found here).
An example how to use map fields using it:
PUT /test/person/_mapping
{
"person" : {
"properties" : {
"my_attachment" : {
"type" : "attachment"
}
}
}
}

Related

Elasticsearch: How to delete unsupported static index setting created by a previous release?

How to delete static settings from an index if this setting is not supported/known anymore to the running ES version.
Indices created with ES 5.2.2 or 5.3.0 have been subject to shrinking with a hot-warm strategy in order to lower the number of shards.
This shrinking created two static index settings shrink.source.name and shrink.source.uuid in the newly created index.
The new index works as expected.
In the meantime I upgraded to ES 6.8.1 and I am preparing the Elasticsearch cluster for ES 7.0 as indices created with older versions are not supported anymore with ES 7.0.
Kibana offers a nice UI for the required reindexing but this fails due to these two unsupported setting.
As I have no need for these settings anyway (they are just informational for me) I want to delete them from the indices.
Deleting a static setting from an index requires the follwing steps:
close the index
set the setting to null
reopen the index
Unfortunately this does not work with settings which are not supported anymore with the current version of ES.
curl -X PUT "elk29:9200/logstash-20160915/_settings?pretty" -H 'Content-Type: application/json' -d' { "index" : { "shrink.source.uuid" : null }}'
{
"error" : {
"root_cause" : [
{
"type" : "remote_transport_exception",
"reason" : "[elk24][10.21.15.24:9300][indices:admin/settings/update]"
}
],
"type" : "illegal_argument_exception",
"reason" : "unknown setting [index.shrink.source.uuid] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
},
"status" : 400
}
I expected that the setting is simply removed.
Oviously ES emulates the removal of a setting by setting the value of a setting to null. Unfortunately this only works with explicitly supported settings but not with outdated unsupported settings.
The question remains how to remove index settings which are not supported anymore with the current version of ES?

Random field number in protobuf

A server sends responces in protobuf format. I was trying to recreate definitions (proto file) using protoc "decode_raw" mode and faced some strange structure:
2 {
1: 215647270
2 {
215647270 {
1 {
2: "30093005"
}
}
}
5: 360883463
}
As you can see, the value of the field #1 (215647270) is some kind of reference to another field. The value (and the branch, respectively) is random. I couldn't find any information in the official Protobuf documentation regarding this "dynamic" generation of the fields.
Does anyone know how to describe this structure with Protocol buffers messages?
I found out that it is the protobuf’s extension feature and the “random” numbers were just hardcoded by the authors.

Filebeat - how to override Elasticsearch field mapping?

We're ingesting data to Elasticsearch through filebeat and hit a configuration problem.
I'm trying to specify a date format for a particular field (standard #timestamp field holds indexing time and we need an actual event time). So far, I was unable to do so - I tried fields.yml, separate json template file, specifying it inline in filebeat.yml. That last option is just a guess, I haven't found any example of this particular configuration combo.
What am I missing here? I was sure this should work:
filebeat.yml
#rest of the file
template:
# Template name. By default the template name is filebeat.
#name: "filebeat"
# Path to template file
path: "custom-template.json"
and in custom-template.json
{
"mappings": {
"doc": {
"properties": {
"eventTime": {
"type": "date",
"format": "YYYY-MM-dd HH:mm:ss.SSSS"
}
}
}
}
}
but it didn't.
We're using Filebeat version is 6.2.4 and Elasticsearch 6.x
I couldn't get the Filebeat configuration to work. So in the end changed the time field format in our service and it worked instantly.
I found official Filebeat documentation to be lacking complete examples. May be that's just my problem
EDIT actually, it turns out you can specify a list of allowed formats in your mapping

Unable to update Indices Recovery settings dynamically in elasticsearch

As per this article in elasticsearch reference. We can update the following setting dynamically for a live cluster with the cluster-update-settings.
indices.recovery.file_chunk_size
indices.recovery.translog_ops
indices.recovery.translog_size
But when I try to update any of the above I am getting the following error:
PUT /_cluster/settings
{
"transient" : {
"indices.recovery.file_chunk_size" : "5mb"
}
}
Response:
"type": "illegal_argument_exception",
"reason": "transient setting [indices.recovery.file_chunk_size], not dynamically updateable"
Have they changed this and didn't updated there reference article or am I missing something? I am using Elasticsearch 5.0.2
They have been removed in this pull request:
indices.recovery.file_chunk_size - now fixed to 512kb
indices.recovery.translog_ops - removed without replacement
indices.recovery.translog_size - now fixed to 512kb
indices.recovery.compress - file chunks are not compressed due to lucene's compression but translog operations are.
But I'm surprised it is not reflected in the documentation.

elasticsearch term query with string value containing :(colon)

I am trying to execute a term query with value being a string which has colon in it. It works fine with the sense plugin.
GET XX/XX/_search
{
"query": {
"term" : { "XX.XX" : "7:140453136:T" }
}
}
But the same term query doesnt work with java API.
SearchRequestBuilder response = client.prepareSearch(indexName);
response.setTypes(indexType);
response.setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
response.setQuery(QueryBuilders.termQuery("XX.XX", "7:140453136:T"));
response.setFrom(0).setSize(60).setExplain(true);
SearchResponse matchallResponse = response.execute().actionGet()
error:
TransportSerializationException[Failed to deserialize response of type [org.elasticsearch.action.search.SearchResponse]]
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.codecs.lucene50.Lucene50DocValuesFormat
in my mapping i had set this field to be not analyzed. so i am sure elasticsearch is not tokenizing it.
2.I am using ES 2.1.1
I see that there is already a question on this. But the solution posted there dosent solve my problem.
I had the latest lucene core libraries in my classpath which doesnt have Lucene50DocValuesFormat. I replaced the lucene core with 5.3.1 and everything works fine now. Dont use lucene core >5.3.1 with ES 2.1.1 is the solution for this problem. Thanks everyone! Hope this helps!

Resources