How i can parse multiline json file in logstash to hash? - elasticsearch

I have a valid multiline JSON file. I want to parse it, and to assign it keys as field names, and values as field values.
Is it possible to do automatically?
input {
file {
path => "/home/logstash/xunit.json"
codec => json
}
}
output {
stdout {}
elasticsearch {
protocol => "http"
codec => "json"
host => "kibana.dev"
port => "9200"
}
}
After using this config, i see that something was added.. but i can't see that fields from my json appeared. Is it possible to grab name, severity, status, start & stop dates?
My json example:
[
{
"uid" : "441d1d1dd296fe60",
"name" : "test_buylinks",
"title" : "Test buylinks",
"time" : {
"start" : 1419621623182,
"stop" : 1419621640491,
"duration" : 17309
},
"severity" : "NORMAL",
"status" : "FAILED"
},
{
"uid" : "a88c89b377aca0c9",
"name" : "test_buylinks",
"title" : "Test buylinks",
"time" : {
"start" : 1419621623182,
"stop" : 1419621640634,
"duration" : 17452
},
"severity" : "NORMAL",
"status" : "FAILED"
},
{
"uid" : "32c3f8b52386c85c",
"name" : "test_buylinks",
"title" : "Test buylinks",
"time" : {
"start" : 1419621623185,
"stop" : 1419621640826,
"duration" : 17641
},
"severity" : "NORMAL",
"status" : "FAILED"
}
]

Related

creating data stream through logstash

I have installed elasticsearch cluster v 7.14.
I have created ILM policy and Index template. However data stream parameters mentioned under logstash pipeline file are giving error.
ILM policy -
{
"testpolicy" : {
"version" : 1,
"modified_date" : "2021-08-28T02:58:25.942Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_primary_shard_size" : "900mb",
"max_age" : "2d"
},
"set_priority" : {
"priority" : 100
}
}
},
"delete" : {
"min_age" : "2d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"in_use_by" : {
"indices" : [ ],
"data_streams" : [ ],
"composable_templates" : [ ]
}
}
}
Index temaplate -
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0"
}
},
"mappings" : {
"_routing" : {
"required" : false
},
"dynamic_date_formats" : [
"strict_date_optional_time",
"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
],
"numeric_detection" : true,
"_source" : {
"excludes" : [ ],
"includes" : [ ],
"enabled" : true
},
"dynamic" : true,
"dynamic_templates" : [ ],
"date_detection" : true
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
logstash pipeline config file -
input {
beats {
port => 5044
}
}
filter {
if [log_type] == "access_server" and [app_id] == "pa"
{
grok {
match => {
"message" => "%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:%{MINUTE}(?::?%{SECOND})\| %{USERNAME:exchangeId}\| %{DATA:trackingId}\| %{NUMBER:RoundTrip:int}%{SPACE}ms\| %{NUMBER:ProxyRoundTrip:int}%{SPACE}ms\| %{NUMBER:UserInfoRoundTrip:int}%{SPACE}ms\| %{DATA:Resource}\| %{DATA:subject}\| %{DATA:authmech}\| %{DATA:scopes}\| %{IPV4:Client}\| %{WORD:method}\| %{DATA:Request_URI}\| %{INT:response_code}\| %{DATA:failedRuleType}\| %{DATA:failedRuleName}\| %{DATA:APP_Name}\| %{DATA:Resource_Name}\| %{DATA:Path_Prefix}"
}
}
mutate {
replace => {
"[type]" => "access_server"
}
}
}
}
output {
if [log_type] == "access_server" {
elasticsearch {
hosts => ['http://10.10.10.76:9200']
user => elastic
password => xxx
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "access"
data_stream_namespace => "default"
ilm_rollover_alias => "access"
ilm_pattern => "000001"
ilm_policy => "testpolicy"
template => "/tmp/access_template"
template_name => "access_template"
}
}
elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
user => elastic
password => xxx
}
}
After all deployment done, can only see system indices but data stream is not created.
[2021-08-28T12:42:50,103][ERROR][logstash.outputs.elasticsearch][main] Invalid data stream configuration, following parameters are not supported: {"template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy"}
[2021-08-28T12:42:50,547][ERROR][logstash.javapipeline ][main] Pipeline error {:pipeline_id=>"main", :exception=>#<LogStash::ConfigurationError: Invalid data stream configuration: ["template", "ilm_pattern", "template_name", "ilm_rollover_alias", "ilm_policy"]>, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-elasticsearch-11.0.2-java/lib/logstash/outputs/elasticsearch/data_stream_support.rb:57:in `check_data_stream_config!'"
[2021-08-28T12:42:50,702][ERROR][logstash.agent ] Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Create<main>, action_result: false", :backtrace=>nil}
error is saying parameters like template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy" are not valid but in below link they are mentioned.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-data-streams
The solution is to use logstash without be "aware" of data_stream.
FIRST of all (before running logstash) create your ILM and index_template BUT adding the "index.lifecycle.name" in the settings. That way, you are linking the template and ILM. Also, don't forget the data_stream in the index template.
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0",
"index.lifecycle.name": "testpolicy"
}
},
"mappings" : {
...
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
Keep Logstash output like if data_stream doesn't exist but add action => create. This is because you can't use "index" API with data streams. Need the _create API call.
output { elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "test-data-stream"
user => elastic
password => xxx
action => "create"
}
That way, logstash will output to ES, but the index template will be applied automatically (because of pattern match) and also the ILM and data_stream will be applied.
Important: To make it work, you need to start from scratch. If the index "test-data-stream" already exists in ES (as a traditional index), then data_stream will NOT be created. Make the test with another index name to make sure it works.
The documentation is unclear, but the plugin does not support those options when datastream output is enabled. The plugin is logging the options returned by the invalid_data_stream_params function, which allows action, routing, data_stream, anything else that starts with data_stream_, the shared options defined by the mixin, and the common options defined by the output plugin base.

Mongoose + GraphQL (Apollo Server) Schema

We have db collection which is little complicated. Many of our keys are JSON objects where fields aren't fixed and change based on input given by user on UI. How should we write mongoose and GraphQL Schema for such complex type ?
{
"_id" : ObjectId("5ababb359b3f180012762684"),
"item_type" : "Sample",
"title" : "This is sample title",
"sub_title" : "Sample sub title",
"byline" : "5c6ed39d6ed6def938b71562",
"lede" : "Sample description",
"promoted" : "",
"slug" : [
"myurl"
],
"categories" : [
"Technology"
],
"components" : [
{
"type" : "Slide",
"props" : {
"description" : {
"type" : "",
"props" : {
"value" : "Sample value"
}
},
"subHeader" : {
"type" : "",
"props" : {
"value" : ""
}
},
"ButtonWorld" : {
"type" : "a-button",
"props" : {
"buttonType" : "product",
"urlType" : "Internal Link",
"isServices" : false,
"title" : "Hello World",
"authors" : [
{
"__dataID__" : "Qm9va0F1dGhvcjo1YWJhYjI0YjllNDIxNDAwMTAxMGNkZmY=",
"_id" : null,
"First_Name" : "John",
"Last_Name" : "Doe",
"Display_Name" : "John Doe",
"Slug" : "john-doe",
"Role" : 1
}
],
"isbns" : [
"9781497603424"
],
"image" : "978-cover.jpg",
"price" : "8.99",
"bisacs" : [],
"customCategories" : [],
},
"salePrice" : {
"type" : "",
"props" : {
"value" : ""
}
}
}
},
"tags" : [
{
"id" : "5abab58042e2c90011875801",
"name" : "Tag Test 1"
},
{
"id" : "5abab5831242260011c248f9",
"name" : "Tag Test 2"
},
{
"id" : "592450e0b1be5055278eb5c6",
"name" : "horror",
},
{
"id" : "59244a96b1be5055278e9b0e",
"name" : "Special Report",
"_id" : "59244a96b1be5055278e9b0e"
}
],
"created_at" : ISODate("2018-03-27T21:44:21.412Z"),
"created_by" : ObjectId("591345bda429e90011c1797e")
}
I believe Mongoose have Mixed type but how do i represent such complex type in Apollo GraphQL Server and Mongoose Schema. Also, currently my resolver is just models.product.find(). So if i have such complex type, need to understand what update needs to make to my resolver.
It will be great if i get complete solution for GraphQL Apollo schema, mongoose schema and resolver for my data.
Finally found solution for problem.
You can declare new type and reference it in typeDef for GraphQL Schema.
In mongoose model, you can reference it as {type: Array}

Number of records processed in logstash

We're using logstash to sync Elastic search and we've around 3 million documents. It takes 3 to 4 hours to sync. Currently all we get is, it is started and stopped. Is there any way to see how many records processed in logstash ?
If you're using Logstash 5 and higher, the Logstash Monitoring API can help you. You can see and monitor what's happening inside Logstash as it processes events. If you hit the Pipeline stats API you'll get the total number of processed events per stage and plugin (input/filter/output):
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'
You'll get this type of response in which you can clearly see at any time how many events have been processed:
{
"pipelines" : {
"test" : {
"events" : {
"duration_in_millis" : 365495,
"in" : 216485,
"filtered" : 216485,
"out" : 216485,
"queue_push_duration_in_millis" : 342466
},
"plugins" : {
"inputs" : [ {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-1",
"events" : {
"out" : 216485,
"queue_push_duration_in_millis" : 342466
},
"name" : "beats"
} ],
"filters" : [ {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-2",
"events" : {
"duration_in_millis" : 55969,
"in" : 216485,
"out" : 216485
},
"failures" : 216485,
"patterns_per_field" : {
"message" : 1
},
"name" : "grok"
}, {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-3",
"events" : {
"duration_in_millis" : 3326,
"in" : 216485,
"out" : 216485
},
"name" : "geoip"
} ],
"outputs" : [ {
"id" : "35131f351e2dc5ed13ee04265a8a5a1f95292165-4",
"events" : {
"duration_in_millis" : 278557,
"in" : 216485,
"out" : 216485
},
"name" : "elasticsearch"
} ]
},
"reloads" : {
"last_error" : null,
"successes" : 0,
"last_success_timestamp" : null,
"last_failure_timestamp" : null,
"failures" : 0
},
"queue" : {
"type" : "memory"
}
}
}

Partial update overwriting whole structure

I'm indexing a new document with the following content
{
"lastUpdate" : "20180114144020452",
"name" : "My Process",
"startDate" : "20180114162356585",
"endData" : "",
"tasks" : [
{
"1" : {
"lastUpdate" : "20180114144020452",
"taskId" : "123",
"subject" : "Terceira Atividade",
"status" : "Active",
"type" : "userTask",
"assign" : [
{
"date" : "20180114144020452",
"type" : "role",
"name" : "Time 3",
"id" : "Team3_345"
}
],
"receivedDate" : "",
"readDate" : "",
"finishDate" : ""
}
}
]
}
And then I'm trying to change task.1.status value with the following update content
{
"doc" : {
"tasks" : [
{
"1" : {
"status" : "Closed"
}
}
]
}
}
But it's overwriting the whole task.1 structure, deleting other values and letting only status value to closed instead of keep other values and change only status value.
How can I solve this? Thanks
You need to do it via a scripted partial updated like this
POST updates/update/1/_update
{
"script": {
"source": "ctx._source.tasks[0].1.status = 'Closed'"
}
}

ElasticSerach - Statistical facets on length of the list

I have the following sample mappipng:
{
"book" : {
"properties" : {
"author" : { "type" : "string" },
"title" : { "type" : "string" },
"reviews" : {
"properties" : {
"url" : { "type" : "string" },
"score" : { "type" : "integer" }
}
},
"chapters" : {
"include_in_root" : 1,
"type" : "nested",
"properties" : {
"name" : { "type" : "string" }
}
}
}
}
}
I would like to get a facet on number of reviews - i.e. length of the "reviews" array.
For instance, verbally spoken results I need are: "100 documents with 10 reviews, 20 documents with 5 reviews, ..."
I'm trying the following statistical facet:
{
"query" : {
"match_all" : {}
},
"facets" : {
"stat1" : {
"statistical" : {"script" : "doc['reviews.score'].values.size()"}
}
}
}
but it keeps failing with:
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], total failure; shardFailures {[mDsNfjLhRIyPObaOcxQo2w][facettest][0]: QueryPhaseExecutionException[[facettest][0]: query[ConstantScore(NotDeleted(cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter#a2a5984b)))],from[0],size[10]: Query Failed [Failed to execute main query]]; nested: PropertyAccessException[[Error: could not access: reviews; in class: org.elasticsearch.search.lookup.DocLookup]
[Near : {... doc[reviews.score].values.size() ....}]
^
[Line: 1, Column: 5]]; }]",
"status" : 500
}
How can I achieve my goal?
ElasticSearch version is 0.19.9.
Here is my sample data:
{
"author" : "Mark Twain",
"title" : "The Adventures of Tom Sawyer",
"reviews" : [
{
"url" : "amazon.com",
"score" : 10
},
{
"url" : "www.barnesandnoble.com",
"score" : 9
}
],
"chapters" : [
{ "name" : "Chapter 1" }, { "name" : "Chapter 2" }
]
}
{
"author" : "Jack London",
"title" : "The Call of the Wild",
"reviews" : [
{
"url" : "amazon.com",
"score" : 8
},
{
"url" : "www.barnesandnoble.com",
"score" : 9
},
{
"url" : "www.books.com",
"score" : 5
}
],
"chapters" : [
{ "name" : "Chapter 1" }, { "name" : "Chapter 2" }
]
}
It looks like you are using curl to execute your query and this curl statement looks like this:
curl localhost:9200/my-index/book -d '{....}'
The problem here is that because you are using apostrophes to wrap the body of the request, you need to escape all apostrophes that it contains. So, you script should become:
{"script" : "doc['\''reviews.score'\''].values.size()"}
or
{"script" : "doc[\"reviews.score"].values.size()"}
The second issue is that from your description it looks like your are looking for a histogram facet or a range facet but not for a statistical facet. So, I would suggest trying something like this:
curl "localhost:9200/test-idx/book/_search?search_type=count&pretty" -d '{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"histogram" : {
"key_script" : "doc[\"reviews.score\"].values.size()",
"value_script" : "doc[\"reviews.score\"].values.size()",
"interval" : 1
}
}
}
}'
The third problem is that the script in the facet will be called for every single record in the result list and if you have a lot of results it might take really long time. So, I would suggest indexing an additional field called number_of_reviews that should be populated with the number of reviews by your client. Then your query would simply become:
curl "localhost:9200/test-idx/book/_search?search_type=count&pretty" -d '{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"histogram" : {
"field" : "number_of_reviews"
"interval" : 1
}
}
}
}'

Resources