In our project using WSO2 ESB(EI 6.1.0) we have some scenarios where we need to have to execute 2 tasks in parallel.
For example, we have a step for inserting steps of the execution in an audit table, and then proceed to the next steps.
For doing that, we use a clone mediator with a continueParent=true and inside the clone we have the sequence that inserts the audita data inside the db. As we set the continueParent=true, the next step doesn't wait the audit insertion to execute.
This is working fine, but in a high load test, we found that the heap is growing very large.
And checking the heap I see a lot of the instances of the clone mediator, and as it clones the context, it is growing a lot and causing OutOfMemory.
My question is:
Is there a way to have a similar behavior but not using the clone mediator?
Thanks,
If you need to execute asynchronous operations like inserting data in an audit table, you should consider the use of JMS queues in your primary mediation : only send those data in such queues and define a dedicated proxy service or jms inbound endpoint that will consume those messages and insert the data in your table
Related
I have a ReportGeneration lambda that takes request from client and adds following entries to a DDB table.
Customer ID <hash key>
ReportGenerationRequestID(UUID) <sort key>
ExecutionStartTime
ReportExecutionStatus < workflow status>
I have enabled DDB stream trigger on this table and a create entry in this table triggers the report generation workflow. This is a multi-step workflow that takes a while to complete.
Where ReportExecutionStatus is the status of the report processing workflow.
I am supposed to maintain the history of all report generation requests that a customer has initiated.
Now What I am trying to do is avoid concurrent processing requests by the same customer, so if a report for a customer is already getting generated don’t create another record in DDB ?
Option Considered :
query ddb for the customerID(consistent read) :
- From the list see if any entry is either InProgress or Scheduled
If not then create a new one (consistent write)
Otherwise return already existing
Issue: If customer clicks in a split second to generate report, two lambdas can be triggered, causing 2 entires in DDB and two parallel workflows can be initiated something that I don’t want.
Can someone recommend what will be the best approach to ensure that there are no concurrent executions (2 worklflows) for the same Report from same customer.
In short when one execution is in progress another one should not start.
You can use ConditionExpression to only create the entry if it doesn't already exist - if you need to check different items, than you can use DynamoDB Transactions to check if another item already exists and if not, create your item.
Those would be the ways to do it with DynamoDB, getting a higher consistency.
Another option would be to use SQS FIFO queues. You can group them by the customer ID, then you wouldn't have concurrent processing of messages for the same customer. Additionally with this SQS solution you get all the advantages of using SQS - like automated retry mechanisms or a dead letter queue.
Limiting the number of concurrent Lambda executions is not possible as far as I know. That is the whole point of AWS Lambda, to easily scale and run multiple Lambdas concurrently.
That said, there is probably a better solution for your problem using a DynamoDB feature called "Strongly Consistent Reads"
By default reads to DynamoDB (if you use the AWS SDK) are eventually consistent, causing the behaviour you observed: Two writes to the same table are made but your Lambda only was able to notice one of those writes.
If you use Strongly consistent reads, the documentation states:
When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
So your Lambda needs to do a strongly consistent read to your table to check if the customer already has a job running. If there is already a job running the Lambda does not create a new job.
I have a service that listens to multiple queues and saves the data to a database.
One queue gives me a person.
Now if I code it really simple. I just get one message from the queue at a time.
I do the following
Start transaction
Select from person table to check if it exists.
Either update existing or create a new entity
repository.save(entity)
End transaction
The above is clean and robust. But I get alot of messages its not fast enough.
To improve performance I have done this.
Fetch 100 messages from queue
then
Start transaction
Select all persons where id in (...) in one query using ids from incomming persons
Iterate messages and for each one check if it was selected above. If yes then update it if not then create a new
Save all changes with batch update/create
End transaction
If its a simple message the above is really good. It performs. But if the message is complicated or the logic I should do when I get the message is then the above is not so good since there is a change some of the messages will result in a rollback and the code becomes hard to read.
Any ideas on how to make it run fast in a smarter way?
Why do you need to rollback? Can't you just not execute whatever it is that then has to be rolled back?
IMO the smartest solution would be to code this with a single "upsert" statement. Not sure which database you use, but PostgreSQL for example has the ON CONFLICT clause for inserts that can be used to do updates if the row already exists. You could even configure Hibernate to use that on insert by using the #SQLInsert annotation.
I'm a newbie in ETL processing. I am trying to populate a data mart through ETL and have hit a bump. I have 4 ETL tasks(Each task filling a particular table in the Mart) and the problem is that I need to perform them in a particular order so as to avoid constraint violations like Foreign Key constraints. How can I achieve this? Any help is really appreciated.
This is a snap of my current ETL:
Create a separate Data Flow Task for each table you're populating in the Control Flow, and then simply connect them together in the order you need them to run in. You should be able to just copy/paste the components from your current Data Flow to the new ones you create.
The connections between Tasks in the Control Flow are called Precendence Constraints, and if you double-click on one you'll see that they give you a number of options on how to control the flow of your ETL package. For now though, you'll probably be fine leaving it on the defaults - this will mean that each Data Flow Task will wait for the previous one to finish successfully. If one fails, the next one won't start and the package will fail.
If you want some tables to load in parallel, but then have some later tables wait for all of those to be finished, I would suggest adding a Sequence Container and putting the ones that need to load in parallel into it. Then connect from the Sequence Container to your next Data Flow Task(s) - or even from one Sequence Container to another. For instance, you might want one Sequence Container holding all of your Dimension loading processes, followed by another Sequence Container holding all of your Fact loading processes.
A common pattern goes a step further than using separate Data Flow Tasks. If you create a separate package for every table you're populating, you can then create a parent package, and use the Execute Package Task to call each of the child packages in the correct order. This is fantastic for reusability, and makes it easy for you to manually populate a single table when needed. It's also really nice when you're testing, as you don't need to keep disabling some Tasks or re-running the entire load when you want to test a single table. I'd suggest adopting this pattern early on so you don't have a lot of re-work to do later.
I'm writing an application with Kafka Streams (v0.10.0.1) and would like to enrich the records I'm processing with lookup data. This data (timestamped file) is written into a HDFS directory on daily basis (or 2-3 times a day).
How can I load this in the Kafka Streams application and join to the actual KStream?
What would be the best practice to reread the data from HDFS when a new file arrives there?
Or would it be better switching to Kafka Connect and write the RDBMS table content to a Kafka topic which can be consumed by all the Kafka Streams application instances?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system. Additionally AFAIK I don't have a control when the compaction happens.
The recommend approach is indeed to ingest the lookup data into Kafka, too -- for example via Kafka Connect -- as you suggested above yourself.
But in this case how can I schedule the Connect job to run on a daily basis rather than continuously fetch from the source table which is not necessary in my case?
Perhaps you can update your question you do not want to have a continuous Kafka Connect job running? Are you concerned about resource consumption (load on the DB), are you concerned about the semantics of the processing if it's not "daily udpates", or...?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
Kafka Connect is safe, and the JDBC connector has been built for exactly the purpose of feeding DB tables into Kafka in a robust, fault-tolerant, and performant way (there are many production deployments already). So I would suggest to not fallback to "batch update" pattern just because "it looks safer"; personally, I think triggering daily ingestions is operationally less convenient than just keeping it running for continuous (and real-time!) ingestion, and it also leads to several downsides for your actual use case (see next paragraph).
But of course, your mileage may vary -- so if you are set on updating just once a day, go for it. But you lose a) the ability to enrich your incoming records with the very latest DB data at the point in time when the enrichment happens, and, conversely, b) you might actually enrich the incoming records with stale/old data until the next daily update completed, which most probably will lead to incorrect data that you are sending downstream / making available to other applications for consumption. If, for example, a customer updates her shipping address (in the DB) but you only make this information available to your stream processing app (and potentially many other apps) once per day, then an order processing app will ship packages to the wrong address until the next daily ingest will complete.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system.
The JDBC connector for Kafka Connect already handles this automatically for you: 1. it ensures that DB inserts/updates/deletes are properly reflected in a Kafka topic, and 2. Kafka's log compaction ensures that the target topic doesn't grow out of bounds. Perhaps you may want to read up on the JDBC connector in the docs to learn which functionality you just get for free: http://docs.confluent.io/current/connect/connect-jdbc/docs/ ?
I have 10 large files in production, and we need to read each line from the file and convert comma separated values into some value object and send it to JMS queue and also insert into 3 different table in the database
if we take 10 files we will have 33 million lines. We are using spring batch(MultiResourceItemReader) to read the earch line and have write to write it o db and also send it to JMS. it roughly takes 25 hrs to completed all.
Eventhough we have 10 system in production, presently we use only one system to run this job( i am new to spring batch, and not aware how spring supports in load balancing)
Since we have only one system we configured data source to connect to db and max connection is specified as 25.
To improve the performance we thought to use spring multi thread support. started to use 5 threads. we could see the performance improvement and could see everything completed in 10 hours.
Here i Have below questions:
1) if i process using 5 threads, we will publish huge amount of data into JMS queue. Will queue support huge data.Note we have 10 systems in production to read JMS Message from the queue.
2) Using thread(5) and 1 production system is good approach (or) instead of spring batch insert the data into db i can create a rest service and spring batch calls the rest api to insert the data into db and let spring api inserts data into JmS queue(again, if spring batch process file annd use rest to insert data into db, per second i will read 4 or 5 lines and will call the rest api. Note we have 10 production system). If use rest API approach will my system support(rest can handle huge request using load balancer, and also JMS can handle huge and huge message) or using thread in spring batch app using 1 production system is better approach.
Different JMS providers are going to have different limits, but in general messaging can easily handle millions of rows in a small period of time.
Messaging is going to be faster than inserting directly into the database because a message has very little data to manage (other than JMS properties) instead of the overhead of a complete RDBMS or NoSQL database or whatever, messaging out performs them all.
Assuming the individual lines can be processed in any order, then sending all data to the same queue and have n consumers working the back-end is a sound solution.
Your big bottleneck, however, is getting the data into the database. If the destination table(s) have m/any keys/indices on them, there is going to be serious contention because each insert/update/delete needs to rebuild the indices, so even though you have n different consumers trying to update the database, they're going to trounce on each other as the transactions are completed.
One solution I've seen is disabling all database constrains before you start and enabling at the end, and hopefully if things worked the data is consistent and usable; of course, the risk is there was bad data that you didn't catch and now you need to clean up or reattempt the load
A better solution might be to transform the files into a single file that can be batch loaded into the database using a platform-specific tool. These tools often disable indexes, contraint checking, and anything else that's going to slow things down - often times bypassing SQL itself - to get performance.