Pump data into ActiveMQ from a JDBC data source - jdbc

We have an application provided by a third party which takes a stream of market data (provided by said third party), and writes it into a JDBC compatible database.
The only configuration parameters it has are the JDBC connection string, plus settings allowing us to pick what pieces of data we'd like to be stored in this database.
This is very good for static data, but we'd like to feed this data into our internal ActiveMQ messaging fabric (in addition to writing it into the DB).
The database updates are triggered by pushes of market data to us. I'd like to have this application write the data directly to a set of MQ topics by implementing some kind of jdbc "facade" that would re-route the data directly into MQ.
What I don't want to do is poll the database for new information - as I want to keep the same fluidity of the data (e.g. fast moving stocks will generate a lot more data than slow moving - and we'd want to retain this).
Advice and pointers are very much welcome!

Camel is the answer, but potentially only if you're ok with polling the database. It's great for integration issues like this. If there was some other trigger that you could work with, you could use that to cause the database to be read.

Related

ElasticSearch: Jest vs Rest vs TransportClient vs NodeClient

I have gone through the official documentation at https://www.elastic.co/blog/found-interfacing-elasticsearch-picking-client
But it does not give any benchmarks or performance numbers to help choose among the clients. And I am finding it non-trivial to setup a TransportClient or setup a NodeClient because the documentation for that is also really sparse with little to no examples whatsoever.
So if someone has already done some benchmarking on choosing a client, I would really appreciate that and focus more on tuning an established client rather than evaluating what client to choose.
Our application is a write-heavy application and we plan to have a 50-shard, 50-replica ES cluster for that.
All those clients are fine for querying and they all have their pros and cons (below list is not exhaustive):
A Node client provides a single hop into the cluster but since it will also be part of the cluster it can also induce too much chatter within the cluster
A Transport client is not part of the cluster, hence requires a two-hop roundtrip, and communicates with a single node at a time in a round-robin fashion (from the list provided during its construction)
Jest is basically the missing client for the ES REST interface
If you feel like you don't need all what Jest has to offer and simply want to interact with a few endpoints, you might as well create your own REST client by using Spring REST template, Apache HTTP, etc
If you're going to have a write-heavy application I suggest you don't even use any of those clients at all. The main reason is that they are all synchronous in nature and if any component of your architecture or the network were to fail for some reason, then you'd lose data, and that might not be an option for you.
If you have plenty of data to ingest, you normally go the asynchronous way, i.e. storing your data in a temporary (yet durable) queue (Kafka, Redis, JMS, etc) and then let another process stream it to ES. There are many ways to do that, but a very simple one is to use Logstash for that.
Whether you decide to store your data in Kafka or JMS or Redis, you can then let Logstash consume your data and stream it to ES, i.e. you let Logstash worry about the heavy write part, which it does very well. That can be achieved very easily with
a kafka or redis or stomp input
a few filters to massage your data
an elasticsearch output to forward the resulting data to ES via the bulk endpoint.
With that kind of well-tuned setup, you can handle very heavy write loads without needing to worry about which client you want to use and how you need to tune it. The question is still open for querying, though, but since the write part is paramount in your case, you need to make it solid, the only serious way is by going asynchronous and let a well-developed and tested ETL (such as Logstash, or fluentd, etc) do it for you.
UPDATE
It is worth noting that as of ES 5.0, there will be a new Java REST client available.

How to overcome data mismatch on several database

In my system I have more than one project, each project connect with individual DB .When Insert transaction occur in any project then record insert on all of the db,but when update event occur in any project then respective update occur only it’s DB not impact rest of the project db.it’s my system process.After continue this process data become difference on each db.With out change this process what I do to overcome this data mismatch problem.
Suppose on system-1 transaction activity :
Transaction -->Update -->Modification occur only on system1 db not in system-2,sytem-3 db
Any type of suggestion will be acceptable,if have any query please ask,thanks in advanced.
I'm currently working in almost the same Project architecture. Our solution is to create Orchestration module that will manage Single_entry_point module. Last one is responsible to unify the information from the Upstream (cluster of different DataBases and Service systems) and after it to upload/distribute it to a Downstream (Single_Data_Warehouse). By doing so - you can guarantee that all your information is actual in every moment. The Orchestrator communicates with Service massages when dealing with all other modules.
This design is based on Pipes and Filters Pattern concept.
I think that in your case, you can only add logic for Update DB information and reuse all that you have at this point. If you spend some time on such Single_entry_point module, which to deal with not only Insert, but with Transaction Update too.
When it comes to Databases “eyeballing” validation (done by SQL scripting) you definatelly have to consider the use of Informatica. To be more specific - when data as it is being moved into production systems. The data in your production systems has to be right in order to support your business decision making. Informatica Data Validation Option provides the ETL testing automation and management capabilities to ensure that your production systems are not compromised by the data update process.
If you find that this options doesn't suits your needs, here are resources I found about this topic:
database-synchronization-an-overview-of-approaches
MSDN Synchronizing Databases
how-to-synchronize-databases-in-different-servers-in-sql-server-2008
sql-comparison-sdk-synchronizing-databases

How can couchbase be used as a caching layer on top of oracle?

I have Oracle as my main RDBMS for read and write, but I want to use couchbase as caching layer as it has map-reduce as can be used as memcache. Any idea as to how i can implement that, and how to transfer and update data in the caching layer, when Oracle is updated or inserted etc.
You are not telling anything about your current performance issues.
I have seen too many applications which do not really take advantage of RDBMS/SQL features, especially if an ORM sits in between.
The cure is to put another cache on top of a database, and to synchronize this in a cluster manually using IP multicasts (SwarmCache for example), message queues (JMS) or nightly import jobs. It could create more problems in the end. And it increases system complexity.
So my answer to your question is: I would not do it, as long as there is room for improvement regarding your data model and/or queries.
I believe your question is about Database synchronization. This can be done through a combination of using DB dependencies and "right-thru" features that I am not too sure about whether couchbase offers. So with DB dependency you have cached items dependent upon Db items and if the DB items are updated or deleted the corresponding dependent item in the cache is removed and at the same time you can write a "right-thru" handler executed at the server level; and the main purpose of this handler is loading fresh copies of the removed items in the cache. So, basically, you'll write the handler once and registerit with the cache server and the cache server will execute it when needed to sync. new items in the DB with the cache. This reading on Db synchronization can be useful . Its based on a product Ncache.
So your question is not directly related to Couchbase, but as other stated more about how you can be alerted when data are changing into your Oracle instance.
One thing that is not well known is the Oracle Database Change Notification feature that is quite cool for this:
http://docs.oracle.com/cd/E11882_01/java.112/e16548/dbchgnf.htm
So you can create an application that is listening to your changes and pushes the data into Couchbase.

Can effective database replication be done through an asynchronous messaging system?

Given a pre-production oracle database and a production oracle database and if around 300K records need to be transferred from the former to the latter, would using a messaging system such as an ESB/JMS/TIBCO be a good option?
I don't know Oracle, but if I was trying to asynchronously replicate data with SQL Server, I would use their own internal tools to accomplish it. I would imagine Oracle has similar tools to run jobs to copy between two Oracle databases.
However, I do have quite a bit of experience using an ESB (Mule) with ActiveMQ to replicate data across database technologies. Specifically I've done SQL Server->Mongo and MySQL->Mongo with Mule and ActiveMQ.
So far I've found Mule to be a wonderful solution - especially coupled with ActiveMQ. I've been able to replicate about 400k Wordpress blog posts (from MySQL) to Mongo in about 20 minutes. To transfer 100k articles from a CMS system we were able to get it done in about 30 minutes.
I figured I'd weigh in because you mentioned and ESB and messaging. I would go that route if the integration points are heterogenous. If you do go down that route, Mule is awesome.
If you are trying to move data from an old database to a new one instead of doing it asynchronously, possibly a simpler method would be sql injection. Assuming your old database allows you to "export" your database, when you export it you will download a sql file. Then you can just open that sql file in a program like notepad and copy-paste that code in the sql executor at your new database and it will re-create all your tables and populate them with the old data.
Actually using the database tools will be the recommended method for replicating data between databases.
When using messaging, one does not get the guarantee that the data will be transferred in the same sequence as it was sent and honor relationships between tables, potentially resulting in replication errors, unless one builds up some mechanism on the JMS receiver side to maintain the sequence. But that looks rather like an overhead.

Performance problems with external data dependencies

I have an application that talks to several internal and external sources using SOAP, REST services or just using database stored procedures. Obviously, performance and stability is a major issue that I am dealing with. Even when the endpoints are performing at their best, for large sets of data, I easily see calls that take 10s of seconds.
So, I am trying to improve the performance of my application by prefetching the data and storing locally - so that at least the read operations are fast.
While my application is the major consumer and producer of data, some of the data can change from outside my application too that I have no control over. If I using caching, I would never know when to invalidate the cache when such data changes from outside my application.
So I think my only option is to have a job scheduler running that consistently updates the database. I could prioritize the users based on how often they login and use the application.
I am talking about 50 thousand users, and at least 10 endpoints that are terribly slow and can sometimes take a minute for a single call. Would something like Quartz give me the scale I need? And how would I get around the schedular becoming a single point of failure?
I am just looking for something that doesn't require high maintenance, and speeds at least some of the lesser complicated subsystems - if not most. Any suggestions?
This does sound like you might need a data warehouse. You would update the data warehouse from the various sources, on whatever schedule was necessary. However, all the read-only transactions would come from the data warehouse, and would not require immediate calls to the various external sources.
This assumes you don't need realtime access to the most up to date data. Even if you needed data accurate to within the past hour from a particular source, that only means you would need to update from that source every hour.
You haven't said what platforms you're using. If you were using SQL Server 2005 or later, I would recommend SQL Server Integration Services (SSIS) for updating the data warehouse. It's made for just this sort of thing.
Of course, depending on your platform choices, there may be alternatives that are more appropriate.
Here are some resources on SSIS and data warehouses. I know you've stated you will not be using Microsoft products. I include these links as a point of reference: these are the products I was talking about above.
SSIS Overview
Typical Uses of Integration Services
SSIS Documentation Portal
Best Practices for Data Warehousing with SQL Server 2008

Resources