APACHE NIFI lookup workaround - apache-nifi

End goal:
Be able to identify a flow file as an insert or update
Idea:
Bring in record from source,
compare the PK in the flowfile to that sheet,file,tbd of a query from the target DB that pulls all pk's for that view. Then I can route that file as an insert or update. This is for speed purposes.
Question for Stack:
Is there a way I could compare the PK of my current flowfile to that of a flowfile of all values generated(I'd only generate this flowfile like every 6,12,18, or 24 hours)

Related

The difference of a output table generated by aggregation is keyedTable and keyedStreamTable

When the output table generated by aggregation is keyedTable and keyedStreamTable, the results are different
When the aggregation engine uses the tables generated by keyedTable and keyedStreamTable to receive the results, the effect is different. The former can be received, but it cannot be used as a data source for a larger period; the latter does not play an aggregation role, but only intercepts the first record of ticks data per minute.
The code executed by the GUI is as follows:
barColNames=`ActionTime`InstrumentID`Open`High`Low`Close`Volume`Amount`OpenPosition`AvgPrice`TradingDay
barColTypes=[TIMESTAMP,SYMBOL,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT,DOUBLE,DOUBLE,DOUBLE,DATE]
Choose one of the following two lines of code, and find that the results are inconsistent
/////////// Generate a 1-minute K line (barsMin01) This is an empty table
share keyedTable(`ActionTime`InstrumentID,100:0, barColNames, barColTypes) as barsMin01
//////// This code can be used for aggregation, but it cannot be used as a data source for other periods
share keyedStreamTable(`ActionTime`InstrumentID,100:0, barColNames, barColTypes) as barsMin01
////////Choosing this code does not have an aggregation effect, and it is found that only the first tick of every minute is intercepted.
//////////define the data sources
metrics=<[first(LastPrice), max(LastPrice), min(LastPrice), last(LastPrice), sum(Volume), sum(Amount), sum(OpenPosition), sum(Amount)/sum(Volume)/300, last(TradingDay) ]>
////////////Aggregation engine
//////////// generate 1-min k line, Aggregation engine
nMin01=1*60000
tsAggrKlineMin01 = createTimeSeriesAggregator(name="aggr_kline_min01", windowSize=nMin01, step=nMin01, metrics=metrics, dummyTable=ticks, outputTable=barsMin01, timeColumn=`ActionTime, keyColumn=`InstrumentID,updateTime=500, useWindowStartTime=true)
/////////// subscribe and the 1-min k line will be generated
subscribeTable(tableName="ticks", actionName="act_tsaggr_min01", offset=0, handler=append!{getStreamEngine("aggr_kline_min01")}, batchSize=1000, throttle=1, hash=0, msgAsTable=true)
There are some diffenence between keyedTable and keyedStreamTable:
keyedTable: When adding a new record to the table, the system will automatically check the primary key of the new record. If the primary key of the new record is the same as the primary key of the existing record, the corresponding record in the table will be updated.
keyedStreamTable: When adding a new record to the table, the system will automatically check the primary key of the new record. If the primary key of the new record is the same as the primary key of the existing record, the corresponding record will not be updated.
That is, one of them is for updating and the other is for filtering.
The keyedStreamTable you mentioned "does not play an aggregation role, but intercepts the first record of ticks data per minute", is exactly because you set updateTime=500 in createTimeSeriesAggregator. If updateTime is specified, the calculations may occur multiple times in the current window.
You use keyedStreamTable here to subscribe to this result table, so updateTime cannot be used. If you want to force trigger, you can specify the forceTrigger parameter.

Passing parameter from different source into insert statement using Nifi

I'm still new in NiFi. What I want to achieve is to pass a parameter from a different source.
Scenario:
I have 2 datasource which is Json data and record id (from oracle function). I declared record id using extract text as "${recid}" and json string default is "$1" .
How to insert into table using sql statement insert into table1 (json,recid) value ('$1','${recid}')
After I run the processor. I'm not able to get both attribute into one insert statement.
Please help.
Nifi flowfile
Flowfile after mergecontent
you should merge these 2 flowfiles to make one.
Use mergeFlowfile processor with Attribute Strategy set to Keep All Unique Attributes
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeContent/index.html
Take a look at LookupAttribute with a SimpleDatabaseLookupService. You can pass your JSON flow file into that, look up the recid into an attribute, then do the ExtractText -> ReplaceText to get it into SQL form.

Insert Update and delete source flag records in target using informatica

I implemented mapping same as below. can someone suggest that is the good approach or not.
old records copied from production so they didn't provide flag for those records. only new records we will get flag.
Source data:
col1 col2 col3 DML_FLAG
1 a 123 NULL(old record)
2 b 456 I
3 c 678 U
Mapping:
Source...>SQ...>exp...>lkp(on target to identify new or update)
..>exp..>...>RTR(for insert and update)-->upd(for update)...>target
First time load I have to load all records i.e full load(old records (DML_flag is null) and new records
From 2nd run I have to capture only changed records from source.For this I am using mapping variables
Here I have a question like, we have already I and U flags are available in source again I am using LKP,with out lookup, I can use DML_FLAG with two groups I and U in RTR.
But I need to refresh the data in every 30mints,with in 30 mints one record inserted(I) and same record got updated then flag changed to 'U' in the source, same record not available in the target, in that case can how can I capture that new record with flag 'U' without lkp.
can someone suggest how can I do this without lookup?
From what I understand of your question, you want to make sure you apply the insert in your target before you apply the update to that same record - is this correct? If so the just use a target load plan having inserts routed to separate alias of same target and higher up in the load order than updates
The answer to whether this is correct as a design choice depends on what the target db is... for a datawarehouse fact table you would usually insert all records be they inserts or updates because you would be reporting on the event rather than the record state. For a dimension table it would depend on your slow changing dimension strategy

what happens when two update for same record comes in one file while loading in DB using INFORMATICA

Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).

neo4j performance for Merge queries on 100 thousand nodes

I have started working with neo4j recently and I have performance problem with Merge query for creating my graph.
I have a csv file with 100,000 records and want to load the data from this file.
My query for loading is as follows:
//Script to import global Actors data
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///D:/MOT/test_data.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName, Aliases: row.Aliases, Type:row.ActorType})
My system configuration:
8.00 GB RAM and Core i5-3330 CPU.
my neo4j config is as follows:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
mapped_memory_page_size=1048576
label_block_size=60
arrat_block_size=120
node_auto_indexing=False
string_block_size=120
when I run this query in neo4j browser it takes more than a day. Would you please help me to solve the problem? please let me know for example if I should change my JVM configuration or change my query or ... and how?
To increase the speed of MERGE queries you should create indexes on your MERGE properties:
CREATE INDEX ON :Country(Name)
CREATE INDEX ON :Actor(Name)
If you have unique node properties, you can increase performance even more by using uniqueness constraints instead of normal indexes:
CREATE CONSTRAINT ON (node:Country) ASSERT node.Name IS UNIQUE
CREATE CONSTRAINT ON (node:Actor) ASSERT node.Name IS UNIQUE
In general your query will be faster if you MERGE on a single, indexed property only:
//Script to import global Actors data
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///D:/MOT/test_data.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName})
// if necessary, you can set properties here
ON CREATE SET a.Aliases = row.Aliases, a.Type = row.ActorType
As already answered on the google group.
It should just take a few seconds.
I presume:
you use Neo4j 2.3.2 ?
you created indexes / constraints for the things you merge on ?
you configured your neo4j instance to run with at least 4G of heap?
you are using PERIODIC COMMIT ?
I suggest that you run a profile on your statement to see where the biggest issues show up.
Otherwise it is very recommended to split it up.
e.g. like this:
CREATE CONSTRAINT ON (c:Country) ASSERT c.Name IS UNIQUE;
CREATE CONSTRAINT ON (o:Organization) ASSERT o.Name IS UNIQUE;
CREATE CONSTRAINT ON (a:Actor) ASSERT a.Name IS UNIQUE;
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country
MERGE (c:Country {Name:Country});
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.AffiliationTo as AffiliationTo
MERGE (o:Organization {Name: AffiliationTo});
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (a:Actor {Name: row.ActorName}) ON CREATE SET a.Aliases=row.Aliases, a.Type=row.ActorType;
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country, row.ActorName as ActorName
MATCH (c:Country {Name:Country})
MATCH (a:Actor {Name:ActorName})
MERGE(c)<-[:IS_FROM]-(a);
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MATCH (o:Organization {Name: row.AffiliationTo})
MATCH (a:Actor {Name: row.ActorName})
MERGE (a)-[r:AFFILIATED_TO]->(o)
ON CREATE SET r.Start=row.AffiliationStartDate, r.End=row.AffiliationEndDate;

Resources