How does flink-sql deal with scenario like 'count(distinct )' - hadoop

I need to calculate "Daily Active Users" in realtime using flink-sql and it is like a 'count(distinct )' operation on daily data.
My question is, if userA logined this morning at 1am and flink add 1 to DAU as expected. Now, userA logined again at 10pm, how could flink-sql know the userA has been processed this morning? Does it need to repeatly do count(distinct ) on the whole day's login log? If not, how does flink handle this senario?

Distinct is a very expensive operation in streaming. If you don't use the time-based windows (TUMBLE,SLIDE, SESSION), the runtime must store all values in state forever because it needs to assume that another record could arrive at any point in the future.
However, you can set the option table.exec.state.ttl (see here) how long you want to keep those records in state. This might be one of the most important options when designing a SQL pipeline with long-running queries where the value space of an operator input is not constant.

By real-time, I assume you mean in a Continuous Query?
See https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tuning/streaming_aggregation_optimization.html
By default, the unbounded aggregation operator processes input records
one by one, i.e., (1) read accumulator from state, (2)
accumulate/retract record to accumulator, (3) write accumulator back
to state, (4) the next record will do the process again from (1).
The accumulator does not only keep the end result, but also enough data to produce the next result without reading all previous records again.
I guess in the case of count(distinct), it means keeping all unique users per day in the accumulator.

Related

Updating Same Record with Multiple Times at the Same Time on Oracle

I have a Voucher Management System. And I also have a campaign on that system works first click first get order. First minutes on the start of campaign, so many people trying to get to vouchers. And every time when they try to do this, the same mechanism works on thedesign image.
At the updating number of given voucher, I realize a bottleneck. All this updates trying to update same row on a limited time. Because of that, transactions adding a queue and waiting for the current update.
After the campaign, I see some updates waited for 10 seconds. How can I solve this?
Firstly, I try to minimize query execution time, but it is already a simple query.
Assuming everyone is doing something like:
SELECT ..
FROM voucher_table
WHERE <criteria indicating the first order>
FOR UPDATE
then obviously everyone (except one person) is going to queue up until the first person commits. Effectively you end up with a single user system.
You might to check out the SKIP LOCKED clause, which allows a process to attempt to lock an eligible row but still skip over it to the next eligible row if the first one is locked.

NiFi Create Indexes after Inserting Records into table

I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works
You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.
I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out

Handling Duplicates in Data Warehouse

I was going through the below link for handling Data Quality issues in a data warehouse.
http://www.kimballgroup.com/2007/10/an-architecture-for-data-quality/
"
Responding to Quality Events
I have already remarked that each quality screen has to decide what happens when an error is thrown. The choices are: 1) halting the process, 2) sending the offending record(s) to a suspense file for later processing, and 3) merely tagging the data and passing it through to the next step in the pipeline. The third choice is by far the best choice.
"
In some dimensional feeds (like Client list), sometimes we get a same Client twice (the two records having difference in certain attributes). What is the best solution in this scenario?
I don't want to reject both records (as that would mean incomplete client data).
The source systems are very slow in fixing the issue, so we get the same issues every day. That means a manual fix to the problem also is tough as it has to be done every day (we receive the client list everyday).
Selecting a single record is not possible as we don't know what the correct value is.
Having both the records in our warehouse means our joins are disrupted. Because of two rows for the same ID, the fact table rows are doubled (in a join).
Any thoughts?
What is the best solution in this scenario?
There are a lot of permutations and combinations with your scenario. The big questions is "Are the differing details valid or invalid? as this will change how you deal with them.
Valid Data Example: Record 1 has John Smith living at 12 Main St, Record 2 has John Smith living at 45 Main St. This is valid because John Smith moved address between the first and second record. This is an example of Valid Data. If the data is valid you have options such as create a slowly changing dimension and track the changes (end date old record, start date new record).
Invalid Data Example: However if the data is INVALID (eg your system somehow creates duplicate keys incorrectly) then your options are different. I doubt you want to surface this data, as it's currently invalid and, as you pointed out, you don't have a way to identify which duplicate record is "correct". But you don't want your whole load to fail/halt.
In this instance you would usually:
Push these duplicate rows to a "Quarantine" area
Push an alert to the people who have the power to fix this operationally
Optionally select one of the records randomly as the "golden detail" record (so your system will still tally with totals) and mark an attribute on the record saying that it's "Invalid" and under investigation.
The point that Kimball is trying to make is that Option 1 is not desirable because it halts your entire system for errors that will happen, Option 2 isn't ideal because it means your aggregations will appear out of sync with your source systems, so Option 3 is the most desirable as it still leads to a data fix, but doesn't halt the process or the use of the data (but it does alert the users that this data is suspect).

Get a collection and then changes to it without gaps or overlap

How do I reliably get the contents of a table, and then changes to it, without gaps or overlap? I'm trying to end up with a consistent view of the table over time.
I can first query the database, and then subscribe to a change feed, but there might be a gap where a modification happened between those queries.
Or I can first subscribe to the changes, and then query the table, but then a modification might happen in the change feed that's already processed in the query.
Example of this case:
A subscribe 'messages'
B add 'messages' 'message'
A <- changed 'messages' 'message'
A run get 'messages'
A <- messages
Here A received a 'changed' message before it sent its messages query, and the result of the messages query includes the changed message. Possibly A could simply ignore any changed messages before it has received the query result. Is it guaranteed that changes received after a query (on the same connection) were not already applied in the previous query, i.e. are handled on the same thread?
What's the recommended way? I couldn't find any docs on this use case.
I know you said you came up with an answer but I've been doing this quite a bit and here is what I've been doing:
r.db('test').table('my_table').between(tsOne, tsTwo, {index: 'timestamp'});
So in my jobs, I run an indexed between query which captures data between last run time and that exact moment. You can run a lock on the config table which tracks the last_run_time for your jobs so that you can even scale with multiple processors! And because we are using between the next job that is waiting for the lock to complete will only grab data after the first processor ran. Hope that helps!
Michael Lucy of RethinkDB Wrote:
For .get.changes and .order_by.limit.changes you should be fine because we already send the initial value of the query for those. For other queries, the only way to do that right now is to subscribe to changes on the query, execute the query, and then read from the changefeed and discard any changes from before the read (how to do this depends on what read you're executing and what legal changes to it are, but the easiest way to hack it would probably be to add a timestamp field to your objects that you increment whenever you do an update).
In 2.1 we're planning to add an optional argument return_initial that will do what I just described automatically and without any need to change your document schema.

How oracle handles concurrency in clustered environment?

I have to implement a database solution wherein contention is handled in a clustered environment. There is a scenario wherein there are multiple users trying to access a bank account at the same time and deposit money into it if balance is less than $100, how can I make sure that no extra money is deposited? Basically , this query is supposed to fire :-
update acct set balance=balance+25 where acct_no=x ;
Since database is clustered , account ends up getting deposited multiple times.
I am looking for purely oracle based solution.
Clustering doesn't matter for the system which is trying to prevent the scenario you're fearing/seeing, which is locking.
Behold scenario user A and then user B trying to do an update, based on a check (less than 100 dollar in account):
If both the check and the update is done in the same transaction, locking will prevent that user B does a check, UNTIL user A has done both the check, and the actual insert. In other words, user B will find the check failing, and will not perform the asked action.
When a user says "at the same time", you should know that the computer does not know that concept, as all transactions are sequential, no matter what millisecond is identical. Behold the ID that is kept in the Redo Logs, there's only one counter. Transaction X and Y is done before or after each other, never at the same time.
That doesn't sound right ... When Oracle locks a row for update, the lock should be across all nodes. What you describe doesn't sound right. What version of Oracle are you using, and can you provide a step-by-step example of what you're doing?
Oracle 11 doc here:
http://docs.oracle.com/cd/B28359_01/server.111/b28318/consist.htm#CNCPT020

Resources