I am learning flink. I am confused by the use scenario . Does we must use flink , what's the different from a spring-boot(web framework) application ?
For example , I want count the page-view of a website in past 5 minuts . I can use flink's window to count . But I can recevie the request with springboot first and then save to a time-series database and excute a sql to count it . What's the different between them ?
In the official fraud-detection use case (https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/try-flink/datastream/) , I can detect the suspicion transaction with flink's state and window . But I can receive the transaction with a springboot application first and then save the transaction to a time-series database , then query all the transactions from the account to find the suspicion . What's the different between them ?
Is it about cluster scaling ? I think Springboot + kubernetes can dynamic-scale easily too .
I tried to understand flink's use case . And I want to know flink's benefit compare to traditional web+database stack .
Related
I am new to Kafka in Spring Boot, I have been through many tutorials and got fair knowledge about the same.
Currently I have been assigned a task and I am facing an issue. Hope to get some help here.
The scenario is as follows.
1)I have a DB which is getting updated continuously with millions of data.
2)I have to hit the DB after every 5 mins and pick the recently updated data and send it to Kafka.
Condition- The old data that I have picked in my previous iteration should not be picked in my next DB call and Kafka pushing.
I am done with the part of Spring Scheduling to pick the data by using findAll() of spring boot JPA, but how can I write the logic so that it does not pick the old DB records and just take the new record and push it to kafka.
My DB table also have a field called "Recent_timeStamp" of type "datetime"
Its hard to tell without really seeing your logic and the way you work with the database, but from what you've described you should do just "findAll" here.
Instead you should treat your DB table as a time-driven data:
Since it has a field of timestamp, make sure there is an index on it
Instead of "findAll" execute something like:
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP > ?
ORDER BY RECENT_TIMESTAMP ASC
In this case you'll get the records ordered by the increasing timestamp
Now the ? denotes the last memorized timestamp that you've handled
So you'll have to maintain the state here
Another option is to query the data whose timestamp is "less" than 5 minutes, in this case the query will look like this (pseudocode since the actual syntax varies):
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP < now() - 5 minutes
ORDER BY RECENT_TIMESTAMP ASC
The first method is more robust because if your spring boot application is "down" for some reason you'll be able to recover and query all your records from the point it has failed to send the data. On the other hand you'll have to save this kind of pointer in some type of persistent storage.
The second solution is "easier" in a sense that you don't have a state to maintain but on the other hand you will miss the data after the restart.
In both of the cases you might want to use some kind of pagination because basically you don't know how many records you'll get from the database and if the amount of records exceeds your memory limits, the application with end up with OutOfMemory error thrown.
A Completely different approach is throwing the data to kafka when you write to the database instead of when you read from it. At that point you might have a data chunk of (probably) reasonably limited size and in general you don't need the state because you can store to db and send to kafka from the same service, if the architecture of your application permits to do so.
You can look into kafka connect component if it serves your purpose.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch, or into batch systems–such as Hadoop for offline analysis.
Im using debezium embedded connector to listen to changes in database. It gives me a ChangeEvent<SourceRecord,SourceRecord> object.
I want to further use confluent plugin KCBQ which uses SinkRecord to put data to bigqery. But I'm not able to figure out how to join these two pieces.
Eventually, how do i ensure updates, deletes and schema changes from MySQL are propagated to BigQuery from Embedded Debezium
You will possibly have to use a single message transform if you have to any custom transforms. However for this scenario , since this seems to be a commonly used transform , the extract new state transform seems to accomplish this. May be worth having a look and trying something similar
https://issues.redhat.com/browse/DBZ-226
https://issues.redhat.com/browse/DBZ-1896
Use Case :-
We have an application which can have multiple users logged in simultaneously on a single database or multiple users logged in on multiple databases . The relationship can be 1:M , M:M or M:1 . The issue is we have a very rigorous Business Logic which authenticates these users before letting them log in . Each user will have a token of its own , plus generate his own session accordingly . I cannot fake user's as the app under test will not let it Log In .
I can put up a Load test using some authentic users that are already present in a single database and generate load using HTTP Thread - VM users from different machines and make the session go up periodically .
How do I go for this specific condition - Test for 5x - 150K concurrent Users, 250k Sessions/min . I cannot have that many database present which will give me a window of 150k concurrent Users . Please advise .
If the app allows concurrent logins it might be one option, like you have X pairs of credentials in the CSV file and the CSV Data Set Config set up for reading these credentials. By default each thread will read next line on next iteration so if the app won't kick out the previously logged in guy it might be a viable solution.
Another option is login once and save the token/session into a JMeter Property, JMeter Properties are global and can be read by multiple virtual users, it can be done using i.e. __setProperty() function
The best solution would be generating as many test users in the system as you need because each JMeter thread must represent real user using a real browser as close as possible.
Generating session is not an issue , but you cannot have data for 150k concurrent users . Plus there are conditions such as - what will we target multiple databases or a single one as we have a provision for both in our target app . Need an answer where some one would have executed a use case like this one . I could set userid's , password and other required information in CSV file and later read data from it for each user . But the question is how many users , I cannot create 150k users and then set each one for each iteration . Is there a way around it .
I have situtation when i need load dimension table to kafka.
Juts because i want expose all my application data through kafka, as common way over all company departments/products.
But my dimension is correct only as snaphsot, immpossible to process them in incremental mode. Because Kafka Stream i add "batch_id"(timestamp of load ops). I know that this is HACK, but it's work fine to me because i want stream only fact table which are very very big and also don't want have two different way to expose data.
So no i have abillity process my dimensions as stream with logical window by "bacth_id".
But now I need load dimmesion by time interval (e g. 30 secs). My dimmesions add/update/delete rate is very low. Some of dimesions do not updated for a quaters.
So my question does it possible to use bulk mode with some condition.
For example only if any record in table have changed column "update_datetime? Does it possible mix bulk + timestamp mode?
As #cricket_007 explain in his comment, threre is no such functionaliy.
So threre are two way for resolve this issue.
Writec custom puller or write custom plugin got kafka-conenct.
I take in work first way. Because i use k8s, which are very comfortable for maintain a lot of different services. And separate service is much better to monitor.
But if you don;t have comfortable ingrastructure (with resource negotiation, service discovery, auotamted ci/cd etc) for microservices. I recomed write custom plugin to kafka-connect.
I am working on a new project on some existing code . It uses HibernateTemplate.findByNamedQueryAndNamedParam to invoke a stored procedure in the database .
When I execute the stored procedure on the database , it executes in 2 or 3 seconds . But I execute it via the HibernateTemplate method , it takes anywhere between 2 to 34 minutes .
When I turn showsql on , I see that there are thousands of select statements being triggered . Any pointers on possible reasons for why this could happen .
Your problem may be caused by a bad configuration of 2nd level cache and query cache.
Check that expiration time of your query cache is the same as the expiration time for the 2nd level cache - for all entities fetched by your query -. (look into ehcache config file -or any other 2nd level cache provider you are using-)