I am trying to update my current elastic search schema which is on 1.3.2 to the latest one. For one of the indexes, the current schema looks something like the below:
curl -XPOST localhost:9200/_template/<INDEXNAME> -d '{
"template" : "*-<INDEXNAME_TYPE>",
"index.mapping.attachment.indexed_chars": -1,
"mappings" : {
"post" : {
"properties" : {
"sub" : { "type" : "string" },
"sender" : { "type" : "string" },
"dt" : { "type" : "date", "format" : "EEE, d MMM yyyy HH:mm:ss Z" },
"body" : { "type" : "string"},
"attachments" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"attachments" : {
"type" : "string",
"term_vector" : "with_positions_offsets",
"store" : true
},
"name" : {"store" : "yes"},
"title" : {"store" : "yes"},
"date" : {"store" : "yes"},
"content_type" : {"store" : "yes"},
"content_length" : {"store" : "yes"}
}
}
}
}
}
}'
With my old version of Elastic Search, there is a "mapper-attachment" plugin installed. I am aware that the "mapper-attachment" plugin has been replaced by the "Ingest Attachment Processor" and following the examples from the plugins' website, I do understand their examples where I got to create a pipeline,
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data",
"indexed_chars" : -1
}
}
}
}
]
}
PUT my-index-000001/_doc/my_id?pipeline=attachment
{
"sub" : "This is a test post",
"sender" : "jane.doe#gmail.com",
"dt" : "Sat, 15 Jan 2022 08:50:00 AEST"
"body" : "Test Body",
"fromaddr": "jane.doe#gmail.com",
"toaddr": "larne.jones#gmail.com",
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK"
}
]
}
How do I make use of this new attachment processor to create the index template I had before?
Note: With my index and schema, for each "post", there will be one or many attachments,
The answer is, unlike the previous version, I cannot use the data type of attachment. So following the example from the elastic.co website and from my own question, the answer is in my question itself.
1st: create the pipeline as in the question
2nd Create the schema [see below]
3rd Insert the data as shown in the question. When inserting the data into the index, use pipeline=attachment as the name of the pipeline and the plugin would parse the given attachment into the schema above
curl -XPOST localhost:9200/_template/<INDEXNAME> -d '{
"template" : "*-<INDEXNAME_TYPE>",
"index.mapping.attachment.indexed_chars": -1,
"mappings" : {
"post" : {
"properties" : {
"sub" : { "type" : "string" },
"sender" : { "type" : "string" },
"dt" : { "type" : "date", "format" : "EEE, d MMM yyyy HH:mm:ss Z" },
"body" : { "type" : "string"},
"attachments" : {
"properties" : {
"attachment" : {
"properties" : {
"content" : {
"type" : "text",
"store": true,
"term_vector": "with_positions_offsets"
},
"content_length" : { "type" : "long" },
"content_type" : { "type" : "keyword" },
"language" : { "type" : "keyword"},
"date" : { "type" : "date", "format" : "EEE, d MMM yyyy HH:mm:ss Z" }
}
},
"content" : { "type": "keyword" },
"name" : { "type" : "keyword" }
}
}
}
}
}
}'
Related
I am doing a DIY Tweet Sentiment analyser, I have an index of tweets like these
"_source" : {
"id" : 26930655,
"status" : 1,
"title" : "Hereโs 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : null,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan ๐๐ฉ๐ฉ๐ฎ๐ณ""",
"hashtags" : null,
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
There Mappings are settings are like
"sentiment-en" : {
"mappings" : {
"properties" : {
"category" : {
"type" : "text"
},
"created_at" : {
"type" : "integer"
},
"hashtags" : {
"type" : "text"
},
"id" : {
"type" : "long"
},
"language" : {
"type" : "integer"
},
"status" : {
"type" : "integer"
},
"title" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
},
"raw_text" : {
"type" : "text"
},
"stop" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "stop_words_filter"
},
"syn" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "synonyms_filter"
}
},
"index_options" : "docs",
"analyzer" : "all_ok_filter"
}
}
}
}
}
"settings" : {
"index" : {
"number_of_shards" : "10",
"provided_name" : "sentiment-en",
"creation_date" : "1627975717560",
"analysis" : {
"filter" : {
"stop_words" : {
"type" : "stop",
"stopwords" : [ ]
},
"synonyms" : {
"type" : "synonym",
"synonyms" : [ ]
}
},
"analyzer" : {
"stop_words_filter" : {
"filter" : [ "stop_words" ],
"tokenizer" : "standard"
},
"synonyms_filter" : {
"filter" : [ "synonyms" ],
"tokenizer" : "standard"
},
"all_ok_filter" : {
"filter" : [ "stop_words", "synonyms" ],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "0",
"uuid" : "Q5yDYEXHSM-5kvyLGgsYYg",
"version" : {
"created" : "7090199"
}
}
Now the problem is i want to extract all the Hashtags and mentions in a seprate field.
What i want as O/P
"id" : 26930655,
"status" : 1,
"title" : "Hereโs 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : BTC,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan ๐๐ฉ๐ฉ๐ฎ๐ณ""",
"hashtags" : bulls,bears,ATH, ALTSEASON, BSCGem, eth , btc, memecoin, 100xGem, satyasanatan
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
What i have tried so far
Create a pattern based tokenizer to just read Hashtags and mentions and no other token for field hashtag and mentions did not had much success there.
Tried to write an n-gram tokenizer without any analysers did not achive much success there as well.
Any help would be appreciated, I am open to reindex my data. Thanks in advance !!!
You can use Logstash Twitter input plugin for indexing data and configured below ruby script in filter plugin as mentioned in blog.
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
You can use Logtstash Elasticsearch Input plugin for source index and configured about ruby code in Filter plugin and Logtstash elasticsearch output plugin with destination index.
input {
elasticsearch {
hosts => "localhost:9200"
index => "current_twitter"
query => '{ "query": { "query_string": { "query": "*" } } }'
size => 500
scroll => "5m"
}
}
filter{
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
}
output {
elasticsearch {
index => "new_twitter"
}
}
Another option is to use reingest API with ingest pipeline but ingest pipeline not support ruby code. So you need to convert above ruby code to the painless script.
I am trying to do geomap of a value in Elasticsearch but the value type of the client_location is set as a string and I would like to change it to geo_point. When I run the following I am getting:
#curl -XGET "http://core.z0z0.tk:9200/_all/_mappings/http?pretty"
{
"packetbeat-2015.12.04" : {
"mappings" : {
"http" : {
"properties" : {
"#timestamp" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"beat" : {
"properties" : {
"hostname" : {
"type" : "string"
},
"name" : {
"type" : "string"
}
}
},
"bytes_in" : {
"type" : "long"
},
"bytes_out" : {
"type" : "long"
},
"client_ip" : {
"type" : "string"
},
"client_location" : {
"type" : "string"
},
"client_port" : {
"type" : "long"
},
"client_proc" : {
"type" : "string"
},
"client_server" : {
"type" : "string"
},
"count" : {
"type" : "long"
},
"direction" : {
"type" : "string"
},
"http" : {
"properties" : {
"code" : {
"type" : "long"
},
"content_length" : {
"type" : "long"
},
"phrase" : {
"type" : "string"
}
}
},
"ip" : {
"type" : "string"
},
"method" : {
"type" : "string"
},
"notes" : {
"type" : "string"
},
"params" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"port" : {
"type" : "long"
},
"proc" : {
"type" : "string"
},
"query" : {
"type" : "string"
},
"responsetime" : {
"type" : "long"
},
"server" : {
"type" : "string"
},
"status" : {
"type" : "string"
},
"type" : {
"type" : "string"
}
}
}
}
}
}
When I run the following command to change the type of the value from string to geo_point I am getting the following error:
# curl -XPUT "http://localhost:9200/_all/_mappings/http" -d '
> {
> "http" : {
> "properties" : {
> "client_location" : {
> "type" : "geo_point"
> }
> }
> }
> }
> '
{"error":{"root_cause":[{"type":"merge_mapping_exception","reason":"Merge failed with failures {[mapper [client_location] of different type, current_type [string], merged_type[geo_point]]}"}],"type":"merge_mapping_exception","reason":"Merge failed with failures {[mapper [client_location] of different type, current_type [string], merged_type [geo_point]]}"},"status":400}
Any suggestion how should I correctly change the type?
Thanks in advance.
Unfortunately, once you've created a field you cannot change its type anymore. The best thing to do is to delete the index and recreate it properly with the adequate mapping.
Another temporary solution if you don't want to delete your index immediately, is to create a sub-field of your existing field:
# curl -XPUT "http://localhost:9200/_all/_mappings/http" -d '{
"http": {
"properties": {
"client_location": {
"type": "string",
"fields": {
"geo": {
"type": "geo_point"
}
}
}
}
}
}'
And then you can access it in your queries using client_location.geo.
Also note that you have to re-index your data in order to populate that new sub-field... which means you might just as well delete your index and re-create it properly.
UPDATE
After installing Packetbeat you need to make sure to install the packetbeat template yourself as described here (i.e. it is not done automatically):
https://www.elastic.co/guide/en/beats/packetbeat/current/packetbeat-getting-started.html#packetbeat-template
curl -XPUT 'http://localhost:9200/_template/packetbeat' -d#/etc/packetbeat/packetbeat.template.json
I'm new to ElasticSearch, started working with ElasticSearch 1.7.3 as part of a Logstash-ElasticSearch-Kibana deployment.
I've defined a mapping template for my log messages, this is the interesting part:
{
"template" : "logstash-*",
"settings" : { "index.refresh_interval" : "5s" },
"mappings" : {
"_default_" : {
"_all" : {"enabled" : true, "omit_norms" : true},
"dynamic_templates" : [ {
"date_fields" : {
"match" : "*",
"match_mapping_type" : "date",
"mapping" : { "type" : "date", "doc_values" : true }
}
}],
"properties" : {
"#version" : { "type" : "string", "index" : "not_analyzed" },
"#timestamp" : { "type" : "date", "format" : "dateOptionalTime" },
"message" : { "type" : "string" }
}
} ,
"my_log" : {
"_all" : { "enabled" : true, "omit_norms" : true },
"dynamic_templates" : [ {
"date_fields" : {
"match" : "*",
"match_mapping_type" : "date",
"mapping" : { "type" : "date", "doc_values" : true }
}
}],
"properties" : {
"#timestamp" : { "type" : "date", "format" : "dateOptionalTime" },
"file" : { "type" : "string" },
"message" : { "type" : "string" }
"geolocation" : { "type" : "string" },
}
}
}
}
Although the #timestamp field is defined as doc_value:true I have an error of MemoryException because it is a fielddata:
[FIELDDATA] Data too large, data for [#timestamp] would be larger than
limit of [633785548/604.4 mb]
NOTE:
I know I can change the memory or add more nodes to the cluster, but in my point of view this is a design problem where this field should not be indexed in memory.
my error log in elasticsearch like that:
[2015-09-04 10:59:49,531][DEBUG][action.bulk ] [baichebao-node-2] [questions][0] failed to execute bulk item (index) index {[questions][baichebao][AU-WS7qZwHwGnxdqIztg], source[_na_]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:565)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:418)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:148)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchParseException: Failed to derive xcontent
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:195)
at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:75)
at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:53)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:507)
... 10 more
and my mapping like that:
{
"mappings" : {
"baichebao" : {
"dynamic" : false,
"_all" : { "enable" : false },
"_id" : {
"store" : true,
"path" : "id"
},
"properties" : {
"id" : {
"type" : "long"
},
"content" : {
"type" : "string",
"analyzer" : "ik_syno_smart"
},
"uid" : {
"type" : "integer"
},
"all_answer_count" : {
"type" : "integer"
},
"answer_users" : {
"type" : "integer"
},
"best_answer" : {
"type" : "long"
},
"status" : {
"type" : "short"
},
"created_at" : {
"type" : "long"
},
"distrust" : {
"type" : "short"
},
"is_expert" : {
"type" : "boolean"
},
"series_id" : {
"type" : "integer"
},
"is_closed" : {
"type" : "boolean"
},
"closed_at" : {
"type" : "long"
},
"tags" : {
"type" : "string"
},
"channel_type" : {
"type" : "integer"
},
"channel_sub_type" : {
"type" : "integer"
}
}
}
}
}
But I can not find out which field parse error?
How can i resolve this problem?
This error typically indicates that the document that was sent to elasticsearch cannot be identified as JSON or SMILE document by checking the first 20 bytes. For example, you would get this error if you omit the leading "{" in a JSON document:
curl -XPUT localhost:9200/test/doc/1 -d 'I am not a json document'
or prepend valid JSON with 20+ whitespace characters:
curl -XPUT localhost:9200/test/doc/1 -d ' {"foo": "bar"}'
I have nginx logs and i have this date format [02/Mar/2015:13:02:51 +0000]
What should i use in elasticsearch and what i should put in the dateformat field of Kibana4?
curl -XGET 'http://localhost:9200/_mapping?pretty'
{
"nginx" : {
"mappings" : {
"t07_nginx" : {
"properties" : {
"#timestamp" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"body_bytes_sent" : {
"type" : "string"
},
"geoip_country_code" : {
"type" : "string"
},
"host" : {
"type" : "string"
},
"http_host" : {
"type" : "string"
},
"http_referer" : {
"type" : "string"
},
"http_user_agent" : {
"type" : "string",
"index" : "not_analyzed"
},
"http_x_forwarded_for" : {
"type" : "string"
},
"message" : {
"type" : "string"
},
"msec request_time" : {
"type" : "string"
},
"remote_addr" : {
"type" : "string"
},
"request_http_protocol" : {
"type" : "string"
},
"request_time" : {
"type" : "string"
},
"request_type" : {
"type" : "string"
},
"request_url" : {
"type" : "string"
},
"status" : {
"type" : "string"
},
"upstream_addr" : {
"type" : "string"
},
"upstream_response_time" : {
"type" : "string"
}
}
}
}
}
with the above i can't see any data(events) in Kibana
Thanks
What does the input plugin for nginx/output plugin for elasticsearch in your fluentd config file look like?
Also, make sure you have your time range setup correctly in kibana. I believe it defaults to 15 minutes.