NiFi Create Indexes after Inserting Records into table - apache-nifi

I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works

You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.

I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out

Related

BigQuery streaming insert with deduplication only writes one row

I'm using the Go BigQuery library and the ValueSaver interface to perform a streaming insert to a table. I'm using a single field as an insert key to cause a dedupe, since it's possible that two processes will be trying to write the same data (Cloud Run with 3 instances triggered by a PubSub subscription where a 2nd process might pick up the message and start processing before a 1st can write).
When the job(s) is/are done, I can see the expected rows in the "Preview" tab of the BigQuery Web UI for the table, however, I can see under "Streaming buffer statistics" that it's still buffered data with "Estimated rows" to be about what I expect, but the "Number of rows" shows 0 and if I query, I only see the first row. All rows on "Preview" have a unique value for the field I'm using as the "InsertId". Eventually (like 90 minutes, as indicated by other questions here), the first row is written to the table and the buffer goes away along with the expected data. If I use bigquery.NoDedupeID I get "instant" writes, but duplicate data.
This leads me to two questions, though it's possible I'm missing the point entirely:
Am I misunderstanding how the "InsertId" is being used to dedupe data?
How do I "close" the insert buffer faster?

Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.

How does flink-sql deal with scenario like 'count(distinct )'

I need to calculate "Daily Active Users" in realtime using flink-sql and it is like a 'count(distinct )' operation on daily data.
My question is, if userA logined this morning at 1am and flink add 1 to DAU as expected. Now, userA logined again at 10pm, how could flink-sql know the userA has been processed this morning? Does it need to repeatly do count(distinct ) on the whole day's login log? If not, how does flink handle this senario?
Distinct is a very expensive operation in streaming. If you don't use the time-based windows (TUMBLE,SLIDE, SESSION), the runtime must store all values in state forever because it needs to assume that another record could arrive at any point in the future.
However, you can set the option table.exec.state.ttl (see here) how long you want to keep those records in state. This might be one of the most important options when designing a SQL pipeline with long-running queries where the value space of an operator input is not constant.
By real-time, I assume you mean in a Continuous Query?
See https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tuning/streaming_aggregation_optimization.html
By default, the unbounded aggregation operator processes input records
one by one, i.e., (1) read accumulator from state, (2)
accumulate/retract record to accumulator, (3) write accumulator back
to state, (4) the next record will do the process again from (1).
The accumulator does not only keep the end result, but also enough data to produce the next result without reading all previous records again.
I guess in the case of count(distinct), it means keeping all unique users per day in the accumulator.

Spring batch to read CSV and update data in bulk to MySQL

I've below requirement to write a spring batch. I would like to know the best approach to achieve it.
Input: A relatively large file with report data (for today)
Processing:
1. Update Daily table and monthly table based on the report data for today
Daily table - Just update the counts based on ID
Monthly table: Add today's count to the existing value
My concerns are:
1. since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
2. To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
Please suggest the approach I should follow or any example if there is any?
Thanks.
You can design a chunk-oriented step to first insert the daily data from the file to the table. When this step is finished, you can use a step execution listener and in the afterStep method, you will have a handle to the step execution where you can get the write count with StepExecution#getWriteCount. You can write this count to the monthly table.
since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
With a chunk oriented step, data is already written in bulk (one transaction per chunk). This model works very well even if your input file is huge.
To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
No need to store the info in a map, you can get the write count from the step execution after the step as explained above.
Hope this helps.

Data processing and updating of selected records

Basically, the needed job is for large amount of records on a data base, and more records can be inserted all the time:
Select <1000> records with status "NEW" -> process the records -> update the records to status "DONE".
This sounds to me like "Map Reduce".
I think that the job described above can may be done in parallel, even by different machines, but then my concern is:
When I select <1000> records with status "NEW" - how can I know that none of these records are already being processed by some other job ?
The same records should not be selected and processed more than once of course.
Performance is critical.
The naive solution is to do the mentioned basic job in a loop.
It seems related to big data processing / nosql / map reduce etc'.
Thanks
Since considering Performance issue... We can can achieve this.The main goal is to distribute records to clients such way that no to clients get same record.
I irrespective of database...
If you have one more column which is used for locking record. So on fetching those records you can set lock, To prevent from fetching for send time.
But if you don not have such capability then my bets bet would be to create another table or im-memory key-value store, with Record primary key and lock, and on fetching records you need to check of record does not exist in other table....
If you have HBase then it can be achieved easily first approach is achievable with performance.

Resources