Specify alternate JSONpaths

Specify alternate JSONpaths - apache-nifi

Suppose I have one JSON log that outputs the following information:
{"timestamp":"someText","alert":"someMoreText","Level":someInt}
And I have another JSON log that outputs the same kind of information but with different label's:
{"ts":"someText","alert":"someMoreText","Level":someInt}
The difference being "timestamp" and "ts" have different names yet same quality information. How would I reference, with one JSON path call, either one of the alternate names, if such a technique is possible?
So for example, if I wanted it to reference the timestamp of both logs, I would want to use something like $.[timestamp|ts]

Using the new record processors, you might be able to do something like...
Define a schema that has both 'timestamp' and 'ts'
Send all the records with 'ts' to an UpdateRecord processor
Set the UpdateRecord processor to make /timestamp = /ts
Define another version of the schema that doesn't have 'ts'
Use a ConvertRecord processor with a writer that uses the second schema
That last step would rewrite the records without the 'ts' field.
Alternatively, you could try defining a schema with a 'timestamp' field and an alias of 'ts' which should let any of the record processors access both fields by using 'timestamp'. It would depend what you are doing in your flow to see if it can be achieved with the record processors.

Related

How to handle the situation of saving to Elasticsearch logs whose structure are very diverse?

My log POCO has several fixed properties, like user id, timestamp, with a flexible data bag property, which is a JSON representation of any kind of extra information I'd like to add to the log. This means the property names could be anything within this data bag, bringing me 2 questions:
How can I configure the mapping so that the data bag property, which is of type string, would be mapped to a JSON object during the indexing, instead of being treated as a normal string?
With the data bag object having arbitrary property names, meaning the overall document type could have a huge number of properties inside, would this hurt the search performance?

For the data translation from string to JSON you can use ingest pipeline with JSON processor:
https://www.elastic.co/guide/en/elasticsearch/reference/master/json-processor.html
It depends of you queries. If you'll use the "free text search" - yes, the huge number of fields will slow the query. If you you'll use query like "field":"value" - no, there is no problem with the fields number in the searches. Additional information about query optimization you cold find here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.15/tune-for-search-speed.html#search-as-few-fields-as-possible
And the question is: what you meen, when say "huge number"? 1000? 10000? 100000? As part of optimization i recommend to use dynamic templates with the definition: each string field automatically ingest into the index as "keyword" and not text + keyword. This setting decrease the number of fields to half.

elasticsearch - Tag data with lookup table values

I’m trying to tag my data according to a lookup table.
The lookup table has these fields:
• Key- represent the field name in the data I want to tag.
In the real data the field is a subfield of “Headers” field..
An example for the “Key” field:
“Server. (* is a wildcard)
• Value- represent the wanted value of the mentioned field above.
The value in the lookup table is only a part of a string in the real data value.
An example for the “Value” field:
“Avtech”.
• Vendor- the value I want to add to the real data if a combination of field- value is found in an document.
An example for combination in the real data:
“Headers.Server : Linux/2.x UPnP/1.0 Avtech/1.0”
A match with that document in the look up table will be:
Key= Server (with wildcard on both sides).
Value= Avtech(with wildcard on both sides)
Vendor= Avtech
So baisically I’ll need to add a field to that document with the value- “ Avtech”.
the subfields in “Headers” are dynamic fields that changes from document to document.
of a match is not found I’ll need to add to the tag field with value- “Unknown”.
I’ve tried to use the enrich processor , use the lookup table as the source data , the match field will be ”Value” and the enrich field will be “Vendor”.
In the enrich processor I didn’t know how to call to the field since it’s dynamic and I wanted to search if the value is anywhere in the “Headers” subfields.
Also, I don’t think that there will be a match between the “Value” in the lookup table and the value of the Headers subfield, since “Value” field in the lookup table is a substring with wildcards on both sides.
I can use some help to accomplish what I’m trying to do.. and how to search with wildcards inside an enrich processor.
or if you have other idea besides the enrich processor- such as parent- child and lookup terms mechanism.
Thanks!
Adi.

There are two ways to accomplish this:
Using the combination of Logstash & Elasticsearch
Using the only the Elastichsearch Ingest node
Constriant: You need to know the position of the Vendor term occuring in the Header field.
Approach 1
If so then you can use the GROK filter to extract the term. And based on the term found, do a lookup to get the corresponding value.
Reference
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_static.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_streaming.html
Approach 2
Create an index consisting of KV pairs. In the ingest node, create a pipeline which consists of Grok processor and then Enrich it. The Grok would work the same way mentioned in the Approach 1. And you seem to have already got the Enrich part working.
Reference
https://www.elastic.co/guide/en/elasticsearch/reference/current/grok-processor.html
If you are able to isolate the sub field within the Header where the Term of interest is present then it would make things easier for you.

How to handle hive/avro schema evolution with new fields added in the middle of existing fields?

I have been told that the only way for Hive to be able to process the addition of new fields to an avro schema is if the new fields are added at the end of the existing fields. Currently our avro generation is alphabetical, so a new field could show up elsewhere in the field list.
So, can Hive handle this or not? I know next to nothing about Hive but I can see that there are good explanations of how to add new fields from avro but I can't seem to find any info on whether the location of the added field affects the ability of Hive to process them or not.
As an example, see below. How could the new schema be processed into Hive?:
Original Schema
{
"type":"record","name":"user",
"fields":[
{"name":"bday","type":"string"},
{"name":"id","type":"long"},
{"name":"name","type":"string"}
]
}
New Schema (Added field in alphabetical order)
{
"type":"record","name":"user",
"fields":[
{"name":"bday","type":"string"},
{"name":"id","type":"long"},
{"name":"gender","type":"string"},
{"name":"name","type":"string"}
]
}

Yes, Hive can handle this because it's the way Avro works:
if both are records:
the ordering of fields may be different: fields are matched by name
That's possible because all Avro files also include a schema used to write the data, writer's schema.
So, when you change the schema in Hive (e.g. by modifying avro.schema.url underlying file), you change the reader's schema. But all existing files and their writer's schemas remain untouched.
And yes, for all new fields added you have to provide a default value (using "default":...) regardless of fields ordering. Otherwise, the reader (Hive) won't be able to parse files written with original schema.

It is supported. You have to take care about add a default value for the new fields to be able to read the data that was written with the older schema.

ES custom dynamic mapping field name change

I have a use case which is a bit similar to the ES example of dynamic_template where I want certain strings to be analyzed and certain not.
My document fields don't have such a convention and the decision is made based on an external schema. So currently my flow is:
I grab the inputs document from the DB
I grab the approrpiate schema (same database, currently using logstash for import)
I adjust the name in the document accordingly (using logstash's ruby mutator):
if not analyzed I don't change the name
if analyzed I change it to ORIGINALNAME_analyzed
This will handle the analyzed/not_analyzed problem thanks to dynamic_template I set but now the user doesn't know which fields are analyzed so there's no easy way for him to write queries because he doesn't know what's the name of the field.
I wanted to use field name aliases but apparently ES doesn't support them. Are there any other mechanisms I'm missing I could use here like field rename after indexation or something else?
For example this ancient thread mentions that field.sub.name can be queried as just name but I'm guessing this has changed when they disallowed . in the name some time ago since I cannot get it to work?

Let the user only create queries with the original name. I believe you have some code that converts this user query to Elasticsearch query. When converting to Elasticsearch query, instead of using the field name provided by the user alone use both the field names ORIGINALNAME as well as ORIGINALNAME_analyzed. If you are using a match query, convert it to multi_match. If you are using a term query, convert it to a bool should query. I guess you get where I am going with this.
Elasticsearch won't mind if a field does not exists. This can be a problem if there is already a field with _analyzed appended in its original name. But with some tricks that can be fixed too.

difference between source and fields

Whats the differenet between source and fields
acording to documentation they booth are used to listing fields which we want to from index.

Fields is best used for fields that are stored. When not stored it behaves similar to source.
So in case all the fields you want in result are all stored it would be faster filtering using "fields" instead of source.
Also fields can be used to get metadata fields if they are stored.
However one of the limitations of fields is that it can be used only to fetch leaf fields i.e it cannot be used on nested fields/object.
The following article in found provides a good explanation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio