Improving flow in Apache NiFI

Improving flow in Apache NiFI - apache-nifi

I'm trying to simplify flow in Apache NiFi.
What I want:
Call Facebook Graph API to receive campaigns for ad accounts and save it to DB.
Response example:
[ {
"start_date" : "2018-10-15",
"stop_date" : "2019-03-31",
"id" : "608962192",
"account_id" : "1007311",
"name" : "Axe_Instagram_aug-dec2018_col",
"status" : "ACTIVE",
"start_time" : "2018-10-15",
"stop_time" : "2019-03-31"
}, {
"start_date" : "2018-10-08",
"stop_date" : "2018-10-31",
"id" : "61084542",
"account_id" : "10240051",
"name" : "Axe_IG_aug-dec2018",
"status" : "ACTIVE",
"start_time" : "2018-10-08",
"stop_time" : "2018-10-31"
} ]
Call Facebook Graph API to receive ads for ad accounts and save it to DB.
Response example:
[
{
"id":"23845",
"account_id":"251977841",
"name":"Post_2",
"status":"ACTIVE",
"campaign_id":"2384345125",
"adset_id":"238125",
"bid_amount":87,
"updated_time":"2019-06-20T14:21:06+0300"
},
{
"id":"23843453786320125",
"account_id":"2251971478158841",
"name":"Post_1",
"status":"ACTIVE",
"campaign_id":"238225",
"adset_id":"2384325",
"bid_amount":87,
"updated_time":"2019-06-20T14:21:06+0300"
}
]
Filter ads:
I should leave only active campaigns (from campaigns) using these rules: stop_date should be empty (NULL) OR stop_date should be > '2021-01-01'
Check if campaign_id from ads contains in result set above.
My current behaviour is:
Completed 2 steps above; all data stored in DB.
For each flow file from ads API I'm using next flow:
SplitJson to separate ad one by one;
EvaluateJsonPath to store campaign_id to attributes;
ExecuteSQL with next statement for each flow file:
select *
from facebook_api.campaigns c
where c.id = '${campaign.id}'
and (c.stop_date is null or c.stop_date > '2021-01-01')
This will return nothing or active (with my criteria) campaign. After that I can filter them with RouteOnAttribute: ${executesql.rows.count:lt(1)}.
But there is a problem. Splitting source 300 flowfile creates about 100,000 flowfiles and I'll make a 100,000 unnecessary requests to db.
Can I perform requests with same logic without splitting flow files?

Doing the SplitJson is really inefficient and probably not needed here.
You could do this with PartitionRecord to create FFs that are grouped by the campaign_id (and also have this as an attribute). This means that you do not need SplitJSON or the EvaluateJsonPath processors. Now you only end up with as many FlowFiles as there are unique campaign_ids in the original FF.
*Edit: I read this part wrong and assumed you were using QueryRecord - updated
Now your original ExecuteSQL will still work, but has far fewer FFs to execute on.
However, I'd question why you need to hit an intermediary DB in the first place. Why not have NiFi filter the raw results from hitting the Facebook API?
You could replace the ExecuteSQL with a QueryRecord that does:
select *
from FLOWFILE where (stop_date is null or stop_date > '2021-01-01')
Passing only the matching records to an 'ACTIVE' relationship. This removes the need for the DB in the middle.
The resulting flow would look something like:
InvokeHTTP (hit facebook API) -> PartitionRecord (split FFs by campaign ID) -> QueryRecord (drop all inactive campaigns)
Another thing to consider...I don't know the Facebook Graph API very well - but are there no query parameters that you could add so that the filtering is done on the FB side?

Related

Nifi: Filter flow files by content

I have about 2000 flow files from REST API calls in json format. One file looks like:
[ {
"manager_customer_id" : 637,
"resourceName" : "customers/673/customerClients/3158981",
"clientCustomer" : "customers/3158981",
"hidden" : false,
"level" : "2",
"manager" : false,
"descriptiveName" : "Volvo",
"id" : "3158981"
} ]
Now i want to filter them by parameter manager. If manager is true, i should skip this flow file. So i need to work with flow files where manager is false. How to do this with Apache Nifi?

You can convert your flowfile, to a record with the help of ConvertRecord.
It allows to pass to an Json format to whatever you prefer, you can still keep Json format.
But with your flowfile beeing a record you can now use additionnal processors like:
QueryRecord, so you can run SQL like command on the flow file:
"SELECT * FROM FLOWFILE WHERE manager=true"
I recommand you the following readings:
Query Record tutorial
Update Record tutorial

You can just use EvaluateJSONPath (to store the value of manager in attribute) and Route on attribute ( to filter based on that attribute), Direct the flow from Manager=true to auto terminate and proceed with the rest to success.

Index DynamoDB streams to elastic search

I have a requirement for implementing following entities in a DynamoDB table
I have stored these entities in DynamoDB as below.
Partition Key : PROJ#ProjectId:CountryId
Sort Key : Project Name
Company : company data as JSON document
Since this is a one to many relationship, N number of projects of the same company will create N number of project records and same company details will be stored in their Company attribute. The reason for doing this is, the most critical data access point is via ProjectId and CountryId (Assume that I can't change this DB design)
I have a requirement to implement a search functionality which supports filter table using company name, address, project name, country etc (using a single filter or any combination of these filters). I'm using DynamoDB streams to feed elastic search cluster and update any creation, deleting or update of the details there and use elastic search API to query data.
But I need to index these data in following format, so that when I receive the details from elastic search, data will not be duplicated
{
"id" : 1
"name" : "ABC",
"description" : "description",
"address" : "address",
"projects" : [
{
"id" : 10,
"name" : "project 1",
"countryId" : 10
},
{
"id" : 20,
"name" : "project 1",
"countryId" : 10
}
]
}
At the record creation time, since Project records are creating as single records, is there any recommended or standard way that I can grab all the Project records of Company and create the above json document and index it in elastic search?

This is how I would approach it :
In elastic the document id will be the companyID
What you can do is create a lambda that is triggered based on the change streams and use elastic's update by query to query for the document and PAINLESS scripting to update the project section of the document, this will work for less frequent changes.

Elasticsearch 6.0 Removal of mapping types - Alternatives

Background
I migrating my ES index into ES version 6. I currenly stuck because ES6 removed the using on "_type" field.
Old Implementation (ES2)
My software has many users (>100K). Each user has at least one document in ES. So, the hierarchy looks like this:
INDEX -> TYPE -> Document
myindex-> user-123 -> document-1
The key point here is with this structure I can easily remove all the document of specific user.
DELETE /myindex/user-123
(Delete all the document of specific user, with a single command)
The problem
"_type" is no longer supported by ES6.
Possible solution
Instead of using _type, use the index name as USER-ID. So my index will looks like:
"user-123" -> "static-name" -> document
Delete user is done by delete index (instead of delete type in previous implementation).
Questions:
My first worry is about the amount of index and performance: Having like 1M indexes is something that acceptable in terms of performance? don't forget I have to search on them frequently.
Most of my users has small amount of documents stored in ES. Is that make sense to hold a shard, which should be expensive, for < 10 documents?
My data architecture sounds reasonable for you?
Any other tip will be welcome!
Thanks.

I would not have one index per user, it's a waste of resources, especially if there are only 10 docs per user.
What I would do instead is to use filtered aliases, one per user.
So the index would be named users and the type would be a static name, e.g. doc. For user 123, the documents of that user would all be stored in users/doc/xyz and in each document you need to add the user id, e.g.
PUT users/doc/xyz
{
...
"userId": 123,
...
}
Then you can define a filtered alias for all documents of user 123, like this:
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "users",
"alias" : "user-123",
"filter" : { "term" : { "userId" : "123" } }
}
}
]
}
If you need to delete all documents of user 123, then you can simply do it like this:
POST user-123/_delete_by_query?q=*

Having these many indexes is definitely not a good approach. If your only concern to delete multiple documents with a single command. Then you can use Delete by Query API provided by ElasticSearch
You can introduce "subtype" attribute in all your document containing value for each document like "user-" value. So in your case, document would looks like.
{
"attribute1":"value",
"subtype":"user-123"
}

Search in multiple indexes in elastica

I am looking for a way to search in more than one index at the same time using Elastica.
I have an index products, and an index user.
products contains {product_id, product_name, price} and user contains {product_id, user_name, date}. Knowing that the product_id in both of them is the same, in products each products_id is unique but in user they're not as a user can buy the same product multiple times.
Anyway, I want to automatically get the price of a product from the products index while searching through the user index.
I know that we can search over multiple indexes like so (correct me if I'm wrong) :
$search = new \Elastica\Search($client);
$search->addIndex('users')
->addType('user')
->addIndex('products')
->addType('product');
But the problem is, when I write an aggregation on the products_id for example and then create a new query with some filters :
$products_agg = new \Elastica\Aggregation\Terms('products_id');
$products_agg->setField('products_id')->setSize(0);
$query = new \Elastica\Query();
$query->addAggregation($products_agg);
$query->setQuery($bool);
$search->setQuery($query);
How does elastica know in which index to search? How can I link this products_id to the other index?

The Elastica library has support for Multi Search API, The multi search API allows to execute several search requests within the same API. The endpoint for it is _msearch.
The format of the requests is similar to the bulk API, The first line
is header part that includes which index / indices to search on, The second line includes the typical search body requests.
{"index" : "products", "type": "products"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10} // write your own query to get price
{"index" : "uesrs", "type" : "user"}
{"query" : {"match_all" : {}}} // query for user
Check test case in Multi/SearchTest.php to see how to use.

Basically you want to join two indexes based on a common field as in sql.
What you can do is model you data in the same index using join datatype
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html
Index all documents in the same index ,
Make all product documents - parent.
Make all user documents as child
And the use parent-child aggregations and queries
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_queries_and_aggregations
NOTE: make sure of the performance implication of parent-child mapping
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_parent_join_and_performance
One more thing you can do is put all the information of the product with every user that buys it.
But this can unnecessarily waste you space and is not a good practice as per data rules are concerned.
But since this is a search engine and elasticsearch suggests that best is to normalise and duplicate data rather that using parent-child.

you can try the following:
1- naming indexes with specific name like the following
myFirstIndex-myProjectName
mySecIndex-myProjectName
myThirdIndex-myProjectName
and so on.
2- that's give me the ability using * in the field of indexes to search because it accepts wildcard so i can search across multiple fields like this using kibana Dev Tools
GET *-myProjectName/_search
{
"_source": {
"excludes": [ "*" ]
},
"query": { "match_all": {} },
}
this will search on each index includes -myProjectName.

You can't query two indices with different mappings. Best way to solve your problem is to just do two queries (application-side joins). First query you do the aggregations on the user and the second you get the prices.
Another option would be to add the price to the user index. Sometimes you have to sacrifice a little space for better usability.

Getting the count for several document properties by grouping similar values

I'm trying to build a set of filters in a UI for an es object. I'd like to aggregate all the documents and group certain property's by value and get a count for each.
For example I'd like to be able to build a list of available filters like:
State :
TX (5)
NJ (1)
CA (10)
Source :
Location1 (30)
Location2 (25)
Location3 (22)
Where "State" and "Source" are different properties of the document type and the counts are in parenthesis obviously. I understand an Aggregation request would be what I want, I'm just looking for a little guidance. Ideally I'd like to do this with one request and not multiple requests for each property I need a group by count on.

So, If I am correct then, you just want count of 'state' for each state, and same case for source.
here is a request for that,
POST _/_search
{
"size": 0,
"aggs":{
"state":{
"terms":{
"field" :"state"
}
}
}
}
Does that helps??

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Improving flow in Apache NiFI - apache-nifi

Related

Nifi: Filter flow files by content

Index DynamoDB streams to elastic search

Elasticsearch 6.0 Removal of mapping types - Alternatives

Search in multiple indexes in elastica

Getting the count for several document properties by grouping similar values

Categories

Resources