Trying to use NiFi to route on an attribute.
I am attempting to take a json file, where two of the json records contain the following attributes (there are other json documents with different attributes in this file):
{
"ts" : "2020-010-07T12:00:00.448392Z",
"uid" : "CHh3F30dkfueLhnxSk",
"id.orig_h" : "10.10.10.10",
"id.orig_p" : 19726,
"id.resp_h" : "172.10.10.20",
"id.resp_p" : 443,
"proto" : "tcp",
"conn_state" : "SH",
"local_orig" : false,
"local_resp" : false,
"missed_bytes" : 0,
"history" : "F",
"orig_pkts" : 1,
"orig_ip_bytes" : 52,
"resp_pkts" : 0,
"resp_ip_bytes" : 0}
{
"ts" : "2020-10-10T12:00:00.461880Z",
"uid" : "CdoiLnRscrxO1BSYb",
"id.orig_h" : "10.10.17.777",
"id.orig_p" : 40433,
"id.resp_h" : "172.10.10.77",
"id.resp_p" : 443,
"version" : "TLSv12",
"cipher" : "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"curve" : "secp777r1",
"server_name" : "connect-stackoverflow.questions.com",
"resumed" : false,
"established" : true,
"cert_chain_fuids" : [ "FR84qjkl2342SZLwV7", "Ffweqiof48b8j" ],
"client_cert_chain_fuids" : [ ],
"subject" : "CN=connect-ulastm.bentley.com",
"issuer" : "CN=Let's Encrypt Authority X3,O=Let's Encrypt,C=US",
"validation_status" : "ok"}
I want to specifically route on the $.conn_state attribute but it is not working. I have tried to match the expression with the evaluateJSONpath processor and passed it to routeOnAttribute. Here are my settings:
evaluateJSONpath processor:
The above processor does not match the json and forward the document
and the evaluateJSONpath is followed by the routeOnAttribute processor:
I have attempted to routeOnAttribute directly from my jsonRecord, but it does not appear to pull out or identify the attribute for routing...
How would I do this?
#Advent United
The JSON sample is more than one object, your evaluateJson is going to expect a single record in matching $.conn_state. For multiple record work against json or any data stream you should use QueryRecord. Once configured w/ record reader and record writer within in that processor you just click + to create the route and the value is the select statement where conn_state is not null. Then you can drag that route to the next processor.
Related
In NiFi, PutHDFS processor is used to ingest data into HDFS directory. There are 100+ variants of file types possible (all in json format). The json starts with file type. This file type should be made the file name prefix. How to achieve this? Please advise.
{
"FILE_TYPE_1": [
{
"ORG_FIELD_1" : 38,
"ORG_FIELD_2" : 1,
"ORG_FIELD_3" : "Per Km",
"ORG_FIELD_4" : "x1",
"ORG_FIELD_5" : 1,
"ORG_FIELD_6" : 10.0,
"ORG_FIELD_7" : 0.0,
"ORG_FIELD_8" : 0.0,
"ORG_FIELD_9" : 96.0,
"ORG_FIELD_10" : 0
}
]
}
You can use EvaluateJsonPath to retrieve the name prefix to attribut then use UpdateAttribute to change the name
Scenario:
I have an index with a bunch of multi-tenant data in Elasticsearch 6.x. This data is frequently deleted (via _delete_by_query) and populated by the tenants.
When issuing a _delete_by_query request with wait_for_completion=false, supplying a query JSON to delete a tenants' data, I am able to see generic task information via the _tasks API. Problem is, with a large number of tenants, it is not actively clear who is deleting data at any given time.
My question is this:
Is there a way I can view the query for which the _delete_by_query task is operating on? Or can I attach an additional param to the URL that is cached in the task to differentiate them?
Side note: looking at the docs: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/tasks.html I see there is a description field in the _tasks API response that has the query as a String, however, I do not see that level of detail in my description field:
"description" : "delete-by-query [myindex]"
Thanks in advance
One way to identify queries is to add the X-Opaque-Id HTTP header to your queries:
For instance, when deleting all tenant data for (e.g.) User 3, you can issue the following command:
curl -XPOST -H 'X-Opaque-Id: 3' -H 'Content-type: application/json' http://localhost:9200/my-index/_delete_by_query?wait_for_completion=false -d '{"query":{"term":{"user": 3}}}'
You then get a task ID, and when checking the related task document, you'll be able to identify which task is/was deleting which tenant data thanks to the headers section which contains your HTTP header:
"_source" : {
"completed" : true,
"task" : {
"node" : "DB0GKYZrTt6wuo7d8B8p_w",
"id" : 20314843,
"type" : "transport",
"action" : "indices:data/write/delete/byquery",
"status" : {
"total" : 3,
"updated" : 0,
"created" : 0,
"deleted" : 3,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "delete-by-query [deletes]",
"start_time_in_millis" : 1570075424296,
"running_time_in_nanos" : 4020566,
"cancellable" : true,
"headers" : {
"X-Opaque-Id" : "3" <--- user 3
}
},
Context:
1) We are building a CDC pipeline (using kafka & connect framework)
2) We are using debezium for capturing mysql Tx logs
3) We are using Elastic Search connector to add documents to ES index
Sample change event generated by Debezium:
{
"source" : {
"before" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 0
},
"after" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 1
},
"source" : {
"version" : "0.7.5",
"name" : "__",
"server_id" : 252639387,
"ts_sec" : 1547805940,
"gtid" : null,
"file" : "mysql-bin-changelog.000570",
"pos" : 236,
"row" : 0,
"snapshot" : false,
"thread" : 614,
"db" : "bazaarify",
"table" : "state"
},
"op" : "u",
"ts_ms" : 1547805939683
}
What we want :
We want to visualize only 3 columns in kibana :
1) before - containing the nested JSON as string
2) after - containing the nested JSON as string
3) source - containing the nested JSON as string
I can think below possibilities here :
a) Either converting nested JSON as string
b) Combining column data in elastic search
I am a newbie to elastic search . Can someone please guide me how to do that.
I tried defining custom mapping as well but it is giving me exception.
You can always view your document as a Raw JSON in Kibana.
You don't need to manipulate it before indexing in elastic.
As this is related to visualization, handle this in Kibana only.
Check this link for a screenshot.
Refer this to add the columns which you want to see onto the results
I don't fully understand your use case, but if you would like to turn some json's to their representing strings, then you can use logstash for that, or even Elasticsearch ingest capabilities to convert an object (json) to a string.
From the link above, an example:
PUT _ingest/pipeline/my-pipeline-id { "description": "converts the
content of the id field to an integer", "processors" : [
{
"convert" : {
"field" : "source",
"type": "string"
}
} ] }
As per jdbc importer :
It is recommended to use timestamps in UTC for synchronization. This example fetches all product rows which has added since the last run, using a millisecond resolution column mytimestamp:
{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:mysql://localhost:3306/test",
"user" : "",
"password" : "",
"sql" : [
{
"statement" : "select * from \"products\" where \"mytimestamp\" > ?",
"parameter" : [ "$metrics.lastexecutionstart" ]
}
],
"index" : "my_jdbc_index",
"type" : "my_jdbc_type"
}
}
I want to input data incrementally based on a column modified data whose format is 2015-08-20 14:52:09 also i use a scheduler which runs every minute . I tried with the value of sql key as
"statement" : "select * from \"products\" where \"modifiedDate\" > ?",
But data was not loaded.
Am I missing out something ?
the format of lastexecutionstart like this "2016-03-27T06:37:09.165Z".
it contain 'T' and 'Z' . So that is why your data was not loaded.
If you want to know more.
here is link
https://github.com/jprante/elasticsearch-jdbc
Sup, good folks of the internet.
Does anyone know how to make nested queries for mongodb? This is probably best explained by an example. To retrieve specific fields, I can use the :fields option to retrieve that field (e.g. suppose it is called "useful_field"):
collection.find({},{:fields => {"useful_field" => 1}})
But suppose that useful_field itself contains an array of many further fields, i.e
useful_field = [{"value_I_want"=>"useful","value_I_dont_want"=>"not_useful"}]
My aim is to select "value_I_want". Any thoughts?
Here is a specific entry that I am trying to deal with (a reply to a tweet):
{ "_id" : ObjectId("51b6f71b0364718d71e4bca5"),
"annotations" : { },
"resultType" : "Tweet",
"score" : 1,
"groupName" : "TweetsWithConversation",
"results" : [
{
"kind" : "Tweet",
"score" : 1,
"annotations" : { "ConversationRole" : "Ancestor" },
"value" : { "created_at" : "Fri Jun 07 19:47:51 +0000 2013",
"id" : NumberLong("343091955196104704"),
"id_str" : "343091955196104704",
"text" : "THIS_IS_WHAT_I_WANT",
etc. etc. (Apologies for the odd formatting)
I'm trying to use a method of the form that will let me do something like this:
db.collection.find({},{:fields { some_way_of_selecting(THIS_IS_WHAT_I_WANT)})
(I'm querying as part of a ruby script)
Otherwise, I'll have to go back into the dark world of regex. No-one wants that.
Try the following
db.collection.find({},{"useful_field.value_I_want": 1})
Maybe try this:
db.collection.find({"resultType" : "Tweet"}, {"results" : {$elemMatch : {"value.text" : "THIS_IS_WHAT_I_WANT"}}})
What you are trying to do is called "projection" - it's specifying what fields you want returned in the second argument to find.
In your case you simply want:
db.collection.find({}, {"results.value.text":1} )