What NiFi processor should I be using to filter out data from 2 groups and merge back into one? - etl

I'm working on a flow which I documented the idea behind but have run into an issue where I cannot seem to find the processor to handle this.
What is happening is the input port is fed many files which based on data splits to A or B. A has more data within a file that gets split out to multiple flows and for each of the flows I set an ID found in the data. At the end I need to compare A to B and B to A and only keep what is found not to exist in the other group.
I have attached a diagram where you'll see at the bottom where I need to
Filter out of Group A any ids matched to Group B
Filter out of Group B any ids matched to Group A
Gather what remains and send it on its way for additional stuff
Question #1. Does NiFi support this kind of action?
Question #2. What processor can do this comparison?
Question #3. Also, I'm using the Wait at the end right before the compare to make sure I have all the data before I would send it to be compared. But thinking about that how do I make sure the comparison gets the data from both sides when A and B have all the data and not just A or B completes on the Wait?

Related

NiFi Create Indexes after Inserting Records into table

I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works
You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.
I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out

spring batch : Read Twice one after other from database

I need to know what is best approach to read the data from one database in chunk(100) and on the basis of that data we read the data from other database server .
example : taking id from one database server and on the basis of that id we take data from other database server.
I have searched on google but have'nt got solution to read twice and write once in batch.
One approach is read in chunk and inside process we take id and hit the database. But process will take single data at a time which is most time consuming.
Second approach is make two different step but in this we can't able share list of id to other step because we can share only small amount of data to other step.
Need to know what is best approach to read twice one after other.
There is no best approach as it depends on the use case.
One approach is read in chunk and inside process we take id and hit the database. But process will take single data at a time which is most time consuming.
This approach is a common pattern called the "Driving Query Pattern" explained in detail in the Common Batch Patterns section of the reference documentation. The idea is that the reader reads only IDs, and the processor enriches the item by querying the second server with additional data for that item. Of course this will generate a query for each item, but this what you want anyway, unless you want your second query to send the list of all IDs in the chunk. In this case, you can do it in org.springframework.batch.core.ItemWriteListener#beforeWrite where you get the list of all items to be written.
Second approach is make two different step but in this we can't able share list of id to other step because we can share only small amount of data to other step.
Yes, sharing a lot of data via the execution context is not recommended as this execution context will be persisted between steps. So I think this is not a good option for you.
Hope this helps.

Multiple sub-agents for one table in Net-SNMP

I'm writing a custom MIB to expose a table over SNMP. There will be one table with set columns, but a variable numbers of rows. Is it possible, with Net-SNMP, to add multiple rows to the table from multiple processes (e.g. process A creates row 1, process B creates row 2, etc...)? I would like to avoid having one "master sub-agent" if possible (other then something that is a part of Net-SNMP, like snmpd/snmptrapd/etc).
I would like to use mib2c to help generate code if possible, but I can work around that if it can't accomplish what I need.
I'm using Net-SNMP 5.5 at the moment. Upgrading is possible if support for what I need is added in newer versions.
If writing AgentX for snmpd, it looks like you cannot share the table OID over two or more AgentXs, snmpd responds with an error that oid is a duplicate for some of the sub-agents. Thus I am continuing my sources with my own sub-sub-agents (based on Enduro/X) which collect the data into a single AgentX which would fill the SNMP table.
According to the https://www.rfc-editor.org/rfc/rfc2741.html#section-7.1.4.1 :
7.1.4.1. Handling Duplicate and Overlapping Subtrees
As a result of this registration algorithm there are likely to be
duplicate and/or overlapping subtrees within the registration data
store of the master agent. Whenever the master agent's dispatching
algorithm (see section 7.2.1, "Dispatching AgentX PDUs") determines
that there are multiple subtrees that could potentially contain the
same MIB object instances, the master agent selects one to use,
termed the 'authoritative region', as follows:
1) Choose the one whose original agentx-Register-PDU r.subtree
contained the most subids, i.e., the most specific r.subtree.
Note: The presence or absence of a range subid has no bearing
on how "specific" one object identifier is compared to another.
2) If still ambiguous, there were duplicate subtrees. Choose the
one whose original agentx-Register-PDU specified the smaller
value of r.priority.
So in best case scenario, you might get that data is randomly collected from one AgentX or another, if the same oid is registered from different AgentX processes

Merging similar groups in concatenated PIG input files

I have a Pig job that runs daily tracking some users accounts where each user has a number of transactions a day. As part of the process this PIG writes out the transactions grouped by user per day (as an aside using Avro).
I now want to group together all of the transactions for a week (or over a longer period) per user account and process.
I can do this by brute force as follows in PIG, but it seems that there must be a better way than flatten and re-group all of the transactions. In more detail ...
Starting point that works ... (a is a user, (b,c) and (d,e) represents two transitions as do (f,g) and (h,i)
I read in ...
(a,{(b,c),(d,e)}) -- From first file - Monday
(a,{(f,g),(h,i)}) -- from second file - Tuesday
I Want ...
(a,{(b,c),(d,e),(f,g),(h,i)})
I get close with script …
-- Read in multiple days (one day per file, $input is directory with all files)
DayGroupedRecord = LOAD '$input' USING AvroStorage();
FlattenRecord = FOREACH DayGroupedRecord GENERATE $0 AS Key, FLATTEN ($1);
WeeklyGroup = GROUP FlattenRecord BY $0;
This gives
(a,{(a,b,c),(a,d,e),(a,f,g),(a,h,i)})
Which is good enough. However the group has to operate at the per transaction level seems inefficient as input records already part grouped.
Is there a different approach in PIG (perhaps more efficient) where I group the daily groups and then flatten?
I have tried (and failed) with ...
DayGroupedRecord = LOAD '$input' USING AvroStorage();
WeeklyGroupNested = GROUP DayGroupedRecord BY $0;
WeeklyGroup = FOREACH WeeklyGroupNested GENERATE FLATTEN($1);
The group operation looks promising …
(a,{(a,{(b,c),(d,e)}),(a,{(f,g),(h,i)})})
But I can’t find out how to flatten out the inner in the above bag .. the script have have just gets me back to where I started ... I have tried a number of variations on the flatten with no success (mostly generating PIG errors).
This is what I get and with above script (and not what I want).
(a,{(b,c),(d,e)})
(a,{(f,g),(h,i)})
As a Newbe to PIG can I get pig to flatten the inner bag and get close to what I want:
(a,{(b,c),(d,e),(f,g),(h,i)})
Phil
Have you tried the "brute force" method and compared the resource consumption with what you get when you, e.g., just GROUP and forget about trying to get the transactions into a single bag? You might not find the brute force method elegant, but think about what it's doing and whether there's really much inefficiency in it.
Your ideal way is to group by user and merge all the bags that get grouped together. This would mean sending for each input record the key and the bag of transactions to some reducer. From there you would have to iterate through the bags, pulling out each transaction and putting it into a final bag for that user.
The brute force method uses FLATTEN so that for each transaction in each input record, you send the key and the transaction to some reducer. There's some duplication here by repeatedly sending the user ID, but this isn't that big of a deal, particularly if the size of your transaction data is much larger than the size of your user ID. From there, you just add each transaction into a final bag for that user.
That doesn't sound particularly inefficient to me, and it doesn't involve any extra map-reduce steps. The size of the data sent from the mappers to the reducers is pretty close to the same. I suspect that you are not going to substantially increase performance by trying to keep the transactions for a day grouped together throughout the computation.

Redis multiple requests

I am writing a very simple social networking app that uses Redis.
Each user has a sorted set that contains ids of items in their feed. If I want to display their feed, I do the following steps:
use ZREVRANGE to get ids of items in their feed
use HMGET to get the feed (each feed item is a string)
But now, I also want to know if the user has liked a feed item or not. So I have a set associated with each feed item that contains ids of user who have liked a feed item.
If I get 15 feed items, now I have to execute an additional 15 requests to Redis to find out, for each feed item if current user has commented on it or not (by checking if id exists in each set for each feed).
So that will take 15+1 requests.
Is this type of querying considered 'normal' when using Redis? Are there better ways I can structure the data to avoid this many requests?
I am using redis-rb gem.
You can easily refactor your code to collapse the 15 requests in one by using pipelines (which redis-rb supports).
You get the ids from the sorted sets with the first request and then you use them to get the many keys you need based on those results (using the pipeline)
With this approach you should have 2 requests in total instead of 16 and keep your code quite simple.
As an alternative you can use a lua script and fetch everything in one request.
This kind of database (Non-relational database), you have to make a trade-off between multiple requests and include some data redundancy.
You should analyze each case separately and consider some aspects, like:
How frequently this data will be accessed?
How much space this redundancy will consume?
How many requests I will have to do, in order to have all data, without redundancy?
Performance is an issue?
In your case, I would suggest to keep a Set/Hash or just a JSON encoded data for each user with a historical of all recent user interaction, such as comments, likes, etc. Every time the user access the feeds you just have to read the feeds and the historical; only two requests.
One thing to keep in mind, every user interaction, you must update all redundant data as well.

Resources