Extract Hashtags and Mentions into separate fields - elasticsearch

I am doing a DIY Tweet Sentiment analyser, I have an index of tweets like these
"_source" : {
"id" : 26930655,
"status" : 1,
"title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : null,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
"hashtags" : null,
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
There Mappings are settings are like
"sentiment-en" : {
"mappings" : {
"properties" : {
"category" : {
"type" : "text"
},
"created_at" : {
"type" : "integer"
},
"hashtags" : {
"type" : "text"
},
"id" : {
"type" : "long"
},
"language" : {
"type" : "integer"
},
"status" : {
"type" : "integer"
},
"title" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
},
"raw_text" : {
"type" : "text"
},
"stop" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "stop_words_filter"
},
"syn" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "synonyms_filter"
}
},
"index_options" : "docs",
"analyzer" : "all_ok_filter"
}
}
}
}
}
"settings" : {
"index" : {
"number_of_shards" : "10",
"provided_name" : "sentiment-en",
"creation_date" : "1627975717560",
"analysis" : {
"filter" : {
"stop_words" : {
"type" : "stop",
"stopwords" : [ ]
},
"synonyms" : {
"type" : "synonym",
"synonyms" : [ ]
}
},
"analyzer" : {
"stop_words_filter" : {
"filter" : [ "stop_words" ],
"tokenizer" : "standard"
},
"synonyms_filter" : {
"filter" : [ "synonyms" ],
"tokenizer" : "standard"
},
"all_ok_filter" : {
"filter" : [ "stop_words", "synonyms" ],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "0",
"uuid" : "Q5yDYEXHSM-5kvyLGgsYYg",
"version" : {
"created" : "7090199"
}
}
Now the problem is i want to extract all the Hashtags and mentions in a seprate field.
What i want as O/P
"id" : 26930655,
"status" : 1,
"title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : BTC,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
"hashtags" : bulls,bears,ATH, ALTSEASON, BSCGem, eth , btc, memecoin, 100xGem, satyasanatan
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
What i have tried so far
Create a pattern based tokenizer to just read Hashtags and mentions and no other token for field hashtag and mentions did not had much success there.
Tried to write an n-gram tokenizer without any analysers did not achive much success there as well.
Any help would be appreciated, I am open to reindex my data. Thanks in advance !!!

You can use Logstash Twitter input plugin for indexing data and configured below ruby script in filter plugin as mentioned in blog.
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
You can use Logtstash Elasticsearch Input plugin for source index and configured about ruby code in Filter plugin and Logtstash elasticsearch output plugin with destination index.
input {
elasticsearch {
hosts => "localhost:9200"
index => "current_twitter"
query => '{ "query": { "query_string": { "query": "*" } } }'
size => 500
scroll => "5m"
}
}
filter{
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
}
output {
elasticsearch {
index => "new_twitter"
}
}
Another option is to use reingest API with ingest pipeline but ingest pipeline not support ruby code. So you need to convert above ruby code to the painless script.

Related

Using Ingest Attachment Plugin within elastic search index template

I am trying to update my current elastic search schema which is on 1.3.2 to the latest one. For one of the indexes, the current schema looks something like the below:
curl -XPOST localhost:9200/_template/<INDEXNAME> -d '{
"template" : "*-<INDEXNAME_TYPE>",
"index.mapping.attachment.indexed_chars": -1,
"mappings" : {
"post" : {
"properties" : {
"sub" : { "type" : "string" },
"sender" : { "type" : "string" },
"dt" : { "type" : "date", "format" : "EEE, d MMM yyyy HH:mm:ss Z" },
"body" : { "type" : "string"},
"attachments" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"attachments" : {
"type" : "string",
"term_vector" : "with_positions_offsets",
"store" : true
},
"name" : {"store" : "yes"},
"title" : {"store" : "yes"},
"date" : {"store" : "yes"},
"content_type" : {"store" : "yes"},
"content_length" : {"store" : "yes"}
}
}
}
}
}
}'
With my old version of Elastic Search, there is a "mapper-attachment" plugin installed. I am aware that the "mapper-attachment" plugin has been replaced by the "Ingest Attachment Processor" and following the examples from the plugins' website, I do understand their examples where I got to create a pipeline,
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data",
"indexed_chars" : -1
}
}
}
}
]
}
PUT my-index-000001/_doc/my_id?pipeline=attachment
{
"sub" : "This is a test post",
"sender" : "jane.doe#gmail.com",
"dt" : "Sat, 15 Jan 2022 08:50:00 AEST"
"body" : "Test Body",
"fromaddr": "jane.doe#gmail.com",
"toaddr": "larne.jones#gmail.com",
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK"
}
]
}
How do I make use of this new attachment processor to create the index template I had before?
Note: With my index and schema, for each "post", there will be one or many attachments,
The answer is, unlike the previous version, I cannot use the data type of attachment. So following the example from the elastic.co website and from my own question, the answer is in my question itself.
1st: create the pipeline as in the question
2nd Create the schema [see below]
3rd Insert the data as shown in the question. When inserting the data into the index, use pipeline=attachment as the name of the pipeline and the plugin would parse the given attachment into the schema above
curl -XPOST localhost:9200/_template/<INDEXNAME> -d '{
"template" : "*-<INDEXNAME_TYPE>",
"index.mapping.attachment.indexed_chars": -1,
"mappings" : {
"post" : {
"properties" : {
"sub" : { "type" : "string" },
"sender" : { "type" : "string" },
"dt" : { "type" : "date", "format" : "EEE, d MMM yyyy HH:mm:ss Z" },
"body" : { "type" : "string"},
"attachments" : {
"properties" : {
"attachment" : {
"properties" : {
"content" : {
"type" : "text",
"store": true,
"term_vector": "with_positions_offsets"
},
"content_length" : { "type" : "long" },
"content_type" : { "type" : "keyword" },
"language" : { "type" : "keyword"},
"date" : { "type" : "date", "format" : "EEE, d MMM yyyy HH:mm:ss Z" }
}
},
"content" : { "type": "keyword" },
"name" : { "type" : "keyword" }
}
}
}
}
}
}'

Kibana index pattern mapping conflict

I am tired of reindexing every 2 3 weeks i have to do reindex.
{
"winlogbeat_sysmon" : {
"order" : 0,
"index_patterns" : [
"log-wlb-sysmon-*"
],
"settings" : {
"index" : {
"lifecycle" : {
"name" : "winlogbeat_sysmon_policy",
"rollover_alias" : "log-wlb-sysmon"
},
"refresh_interval" : "1s",
"number_of_shards" : "1",
"number_of_replicas" : "1"
}
},
"mappings" : {
"properties" : {
"thread_id" : {
"type" : "long"
},
"z_elastic_ecs.event.code" : {
"type" : "long"
},
"geoip" : {
"type" : "object",
"properties" : {
"ip" : {
"type" : "ip"
},
"latitude" : {
"type" : "half_float"
},
"location" : {
"type" : "geo_point"
},
"longitude" : {
"type" : "half_float"
}
}
},
"dst_ip_addr" : {
"type" : "ip"
}
}
},
"aliases" : { }
}
}
this is the template i set earlier from then i didn't change anything
in current and previous indices of log-wlb-sysmon has dst_ip_addr has ip field and older indices of log-wlb-sysmon has text field in logstash i didn't see any warnning for this issue

Elasticsearch Suggestions Multi Index and Multi Fields

I have different indexes that contain different fields. And I try to figure out how to get suggests from all indexes and all fields. I know that with GET /_all/_search I can search for results through all indexes. But how can I get all suggestions from all indexes and all fields? Because I want to have a feature like Google "Did you mean: suggests"
So, I tried this out:
GET /_all/_search
{
"query" : {
"multi_match" : {
"query" : "berlin"
}
},
"suggest" : {
"text" : "berlin",
"my-suggest-1" : {
"term" : {
"field" : "street"
}
},
"my-suggest-2" : {
"term" : {
"field" : "city"
}
},
"my-suggest-3" : {
"term" : {
"field" : "description"
}
}
}
}
"my-suggest-1" and "-2" belongs to Index address (see below) and "my-suggest-3" belongs to Index product. I get the following error:
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "no mapping found for field [street]"
},
{
"type" : "illegal_argument_exception",
"reason" : "no mapping found for field [city]"
},
{
"type" : "illegal_argument_exception",
"reason" : "no mapping found for field [description]"
}
]
}
But if I use only the fields of 1 index I get suggestions, see:
GET /_all/_search
{
"query" : {
"multi_match" : {
"query" : "berlin"
}
},
"suggest" : {
"text" : "berlin",
"my-suggest-1" : {
"term" : {
"field" : "street"
}
},
"my-suggest-2" : {
"term" : {
"field" : "city"
}
}
}
}
Response
...
"failures" : {
...
},
"hits" : {
...
}
"suggest" : {
"my-suggest-1" : [
{
"text" : "berlin",
"offset" : 0,
"length" : 10,
"options" : [
{
"text" : "berliner",
"score" : 0.9,
"freq" : 12
},
{
"text" : "berlinger",
"score" : 0.9,
"freq" : 1
}
]
}
],
"my-suggest-2" : [
{
"text" : "berlin",
"offset" : 0,
"length" : 10,
"options" : []
}
]
...
I don't know how I can get suggests from index address and product? I would be happy if someone can help me.
Index 1 - Address:
"address" : {
"aliases" : {
....
},
"mappings" : {
"dynamic" : "strict",
"properties" : {
"_entity_type" : {
"type" : "keyword",
"index" : false
},
"street" : {
"type" : "text"
},
"city" : {
"type" : "text"
}
}
},
"settings" : {
...
}
}
Index 2 - Product:
"product" : {
"aliases" : {
...
},
"mappings" : {
"dynamic" : "strict",
"properties" : {
"_entity_type" : {
"type" : "keyword",
"index" : false
},
"description" : {
"type" : "text"
}
}
},
"settings" : {
...
}
}
You can add multiple indices to your search. In this case, you need to search over the fields that exist on all indices. So In your case, you need to define all three fields in both of the indices. The fields "street" and "city" are filed in the first index and the field "description" is filled only in the second index. This will be your mapping for the "Address" index. In this index, the "description" field exists but has no data. In the second index, "street" and "city" exist but have no data.
"address" : {
"aliases" : {
....
},
"mappings" : {
"dynamic" : "strict",
"properties" : {
"_entity_type" : {
"type" : "keyword",
"index" : false
},
"street" : {
"type" : "text"
},
"city" : {
"type" : "text"
},
"description" : {
"type" : "text"
}
}
},
"settings" : {
...
}
}

Elasticsearch 'failed to find filter under name '

I'am just started with ES 5.2.2
Trying ad analyzer with support russian morhology.
Run ES using docker, i create image with installed elasticsearch-analysis-morphology.
then i:
Create index,
then put settings
after that get settings, and all sems right
curl http://localhost:9200/news/_settings?pretty
{
"news" : {
"settings" : {
"index" : {
"number_of_shards" : "5",
"provided_name" : "news",
"creation_date" : "1489343955314",
"analysis" : {
"analyzer" : {
"russian_analyzer" : {
"filter" : [
"stop",
"custom_stop",
"russian_stop",
"custom_word_delimiter",
"lowercase",
"russian_morphology",
"english_morphology"
],
"char_filter" : [
"html_strip",
"ru"
],
"type" : "custom",
"tokenizer" : "standard"
}
},
"char_filter" : {
"ru" : {
"type" : "mapping",
"mappings" : [
"Ё=>Е",
"ё=>е"
]
}
},
"filter:" : {
"custom_stop" : {
"type" : "stop",
"stopwords" : [
"n",
"r"
]
},
"russian_stop" : {
"ignore_case" : "true",
"type" : "stop",
"stopwords" : [
"а",
"без",
]
},
"custom_word_delimiter" : {
"split_on_numerics" : "false",
"generate_word_parts" : "false",
"preserve_original" : "true",
"catenate_words" : "true",
"generate_number_parts" : "true",
"catenate_all" : "true",
"split_on_case_change" : "false",
"type" : "word_delimiter",
"catenate_numbers" : "false"
}
}
},
"number_of_replicas" : "1",
"uuid" : "IUkHHwWrStqDMG6fYOqyqQ",
"version" : {
"created" : "5020299"
}
}
}
}
}
then i try open index but ES give me this:
{
"error" : {
"root_cause" : [
{
"type" : "exception",
"reason" : "Failed to verify index [news/IUkHHwWrStqDMG6fYOqyqQ]"
}
],
"type" : "exception",
"reason" : "Failed to verify index [news/IUkHHwWrStqDMG6fYOqyqQ]",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Custom Analyzer [russian_analyzer] failed to find filter under name [custom_stop]"
}
},
"status" : 500
}
Can't understand where i'm wrong.
Can anyone see what the problem is?
There was mistake in "filter" section
was:
look here this This colon was a mistake
|
v
"filter:" : {
"custom_stop" : {
"type" : "stop",
"stopwords" : [
"n",
"r"
]
}...
Thanks #asettou and #andrey-morozov

How to use _timestamp in logstash elasticsearch

I am trying to figure out how to use the _timestamp with logstash.
I have tried to add to the mapping:
"_timestamp" : {
"enabled" : true,
"path" : "#timestamp"
},
But that does not have the expected effect. I did this in the elasticsearch-template.json file (I tried with and without the "store"=true):
{
"template" : "logstash-*",
"settings" : {
"index.refresh_interval" : "5s"
},
"mappings" : {
"_default_" : {
"_timestamp" : {
"enabled" : true,
"store" : true,
"path" : "#timestamp"
},
"_all" : {"enabled" : true},
"dynamic_templates" : [ {
.....
And I added the modified file to the output filter
output {
elasticsearch_http {
template => '/tmp/elasticsearch-template.json'
host => '127.0.0.1'
port=>9200
}
}
In order to make sure the database is clean I repeatedly do:
curl -XDELETE http://localhost:9200/logstash*
curl -XDELETE http://localhost:9200/_template/logstash
rm ~/.sincedb_*
and then I try to import my logfile. But for some reasons, the _timestamp is not set.
The mapping seems to be ok
{
"logstash-2014.03.24" : {
"_default_" : {
"dynamic_templates" : [ {
"string_fields" : {
"mapping" : {
"index" : "analyzed",
"omit_norms" : true,
"type" : "string",
"fields" : {
"raw" : {
"index" : "not_analyzed",
"ignore_above" : 256,
"type" : "string"
}
}
},
"match" : "*",
"match_mapping_type" : "string"
}
} ],
"_timestamp" : {
"enabled" : true,
"store" : true,
"path" : "#timestamp"
},
"properties" : {
"#version" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"index_options" : "docs"
},
"geoip" : {
"dynamic" : "true",
"properties" : {
"location" : {
"type" : "geo_point"
}
}
}
}
},
"logs" : {
"dynamic_templates" : [ {
"string_fields" : {
"mapping" : {
"index" : "analyzed",
"omit_norms" : true,
"type" : "string",
"fields" : {
"raw" : {
"index" : "not_analyzed",
"ignore_above" : 256,
"type" : "string"
}
}
},
"match" : "*",
"match_mapping_type" : "string"
}
} ],
"_timestamp" : {
"enabled" : true,
"store" : true,
"path" : "#timestamp"
},
"properties" : {
"#timestamp" : {
"type" : "date",
"format" : "dateOptionalTime"
},
The documents in the database look like
{
"_id": "Cps2Lq1nTIuj_VysOwwcWw",
"_index": "logstash-2014.03.25",
"_score": 1.0,
"_source": {
"#timestamp": "2014-03-25T00:47:09.703Z",
"#version": "1",
"created": "2014-03-25 01:47:09,703",
"host": "macbookpro.fritz.box",
"message": "2014-03-25 01:47:09,703 - Starting new HTTP connection (1): localhost",
"path": "/Users/scharf/git/ckann/annotator-store/logs/requests.log",
"text": "Starting new HTTP connection (1): localhost"
},
"_type": "logs"
},
why is the _timestamp not set???
In short, it does work.
I tested your exact scenario and here's what I found:
When using _source enabled and specifying _timestamp from some path in the _source,
you will never see _timestamp as part of the document, but if however, you add the ?fields query string part, for example:
http://<localhost>:9200/es_test_logs/ESTest1/ilq4PU3tR9SeoLo794wZlg?fields=_timestamp
you will get the correct _timestamp value.
If, instead of using path, you pass _timestamp externally (in the _source document), you will see _timestamp under the _source property in the document as normal.
If you disable the _source field, you will not see ANY property at all in the document, even those you set as "store" : true. You will only see them when specifying ?fields, or when building a query that returns those fields.

Resources