I want to take all BizTalk tracking data and move it into Elasticsearch.
Is there a way to access the data before its put into the tracking database?
Or do I have to extract it from the database and then into Elasticsearch?
Can I use the BAM API for this?
You should not change any BizTalk Stored Procedures.
How current do you need the data?
Here's two thoughts:
Take the Archives and load them into another database for Elasticsearch.
Do option 1 and point Elasticsearch to the DTA database as well. Querying DTA separately is acceptable. Just be sure to not block BizTalk's write operations.
Related
I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.
How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?
These are my current options:
Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
Aggregate all update/delete records and process them on an hourly basis "somehow".
Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.
Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.
Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.
Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.
The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.
The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).
I found the answer valid on my use case when:
Keep Redshift as up to date as possible without causing blocking.
Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.
This is the architecture:
So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.
After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.
For further detail on the UPSERT operation, I've found this source very useful.
How do I transfer huge amount of data(nearly 10TB) from Oracle DB to Snowflake in hours? I see some options like Hevo and fivetran which are paid. However, I need the data to be moved fast so that I need not keep the production system down.
The fastest way to get data into Snowflake is in 10MB to 100MB chunk files. Then, you can leverage a big warehouse to COPY INTO all of the chunk files at one time. I can't speak to how to get the data out of Oracle DB quickly to S3/Azure Blob, though, especially while the system is running its normal workload.
I recommend you look at this document from Snowflake for reference on the Snowflake side: https://docs.snowflake.net/manuals/user-guide/data-load-considerations-prepare.htm
Is there a network speed issue?
Anyways, the data should be compressed when transferred over the network.
There are three locations involved in the staging:
Oracle database,
the extraction client,
and the cloud storage.
You have two data transfers:
between database and client,
and between client and cloud storage.
If the Oracle version is 12cR2 or newer, the DB client can compress data when getting it out of the database. The data should then be compressed again and transferred to cloud storage at your Snowflake destination.
The final step is to load the data from cloud storage into Snowflake (within the same data center)...
Ideally you shouldn't need to keep the production database down. You should be able to categorise the data into
1 - historical data that will not change. You can extract this data at your own leisure, and should not require database to be down.
2 - static data that is fairly stable. You can also extract this data at your leisure
You only need to keep your database fairly stable (not down) when you are extracting the rest of the data. This will require you to build some way to track and validate all your datasets. There is no reason why you couldn't let users continue to read from the database, while you are performing the extract from Oracle.
I'm working on a project where we have some legacy data in MySQL and now we want to deploy ES for better full text search.
We still want to use MySQL as the backend data storage because the current system is closely coupled with that.
It seems that most of the available solutions suggest syncing the data between the two, but this would result in storing all the documents twice in both ES and MySQL. Since some of the documents can be rather large, I'm wondering if there's a way to have only a single copy of the documents?
Thanks!
Impossible. This is analogous to asking the following: if you have legacy data in an Excel spreadsheet, can I use a MySQL database to query the data without also storing it in MySQL?
Elasticsearch is not just an application layer that interprets userland queries and turns them into database queries, it is itself a database system (in fact, it can be used as your primary data store, though it's not recommended due to various drawbacks). Its search functionality fundamentally depends on how its own backing storage is organized. Elasticsearch cannot query other databases.
You should consider what portions of your data actually need to be stored in Elasticsearch, i.e. what fields need text completion. You will need to build a component which syncs that view of the data between Elasticsearch and your MySQL database.
Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for
Working on Solr incremental data import from an existing normalized mssql database. I'm unable to decide on the strategy I need to implement, or not knowing whether there are existing tools to do the same, so I don't need to reinvent the wheel.
I need to import a document into Solr 3.6 to build a Solr Data, which is saved in MSSQL in heavily normalized fashion. To retrieve the data for single document, there are many joins required which is killing performance. I have appx. 1 million such document in db. So full import into Solr is not an option for me.
While deciding the approach I have two issues to consider:
Incremental data import, so that SQL server doesn't have heavy load while fetching data from db.
Updating data that has been changed in SQL Server into Solr data once a day
I am looking after your help in deciding the strategy and tool for incremental data import into Solr. I think, I have following options:
Custom develop application to fetch data from MSSQL and pass it to Solr. I need to keep track of data as what all records are inserted into Solr and what are pending. Again, 2% data records in MSSQL keeps updated on daily basis, so need to track what data has changed since then, and then update them again at some point of time into Solr.
Use any existing tool or utility in Solr to do the same, like DIH. I'm not sure how this will address both of the issue of incremental data retrieval and how it will track what data has change in SQL server? Again, not sure how DIH will handle complex joins requires to fetch data from db.
Or use something like Lusql with DIH, bust still not sure about how it will address both the issues. Although Lusql will give ability to do complex joins in db, so I hope this might fit my purpose.
I'm in favor of using LuSQL with DIH in Solr, if it can fit the purpose, but still not sure how it keep track of what data has change? Or for this part I have to manage manually by maintaining the document id where the change is made, and then supplying it to LuSQL to fetch data from SQL and import into Solr.
I am also looking forward for your suggestions beyond this to handle this kind of situations.
I will share with you the way i do this.
Mainly I have the same requierements and until this week I used solr dataimport with delta imports. I have a program that regularly updates a status for the new items from 0 to 1 and then calls solr data import to get all the documents with status 1. Solrdataimport uses a stored procedure to join and get the documents with status 1 from db. If the import finish successfully I then update the status to 2 and I know that this documents are in solr. If a documents get changed I simply change from status 2 to status 0 and then the import process updates the document in solr.
Everything works fine for me using this process. I always get the new documents in solr without having to fetch all the data form the database.
Now my requirements have changed because we decide to keep the date archived in the database, as we only need it in solr. So I need to have a program that deserialize the data and then have it sent to solr.
My approach now is to add all the new/updated documents via update handler and after I added all the documents to commit them, and if the commit is successful then I update the status in the database. With this approach I have no experience yet so I don't know if it will work or not but I will just try and see what happens.
I researched in the past a better way to do this but I couldn't find anything so if you find a better solution please share it with me.
Good luck :)
We had to index from a heavily normalized schema with 25+ tables, half of them contain over 5M records each. Largest ~20M.
We use informatica to load these records from oracle to solr. ETL tools like informatica provide ways to join tables/results of a query outside the relational database. It has a sorter transformation to sort out side database. An aggregate transformation to group by records outside db. There is also a lookup transformation..
Essentially, our data is de-normalized in stages and loading/indexing process is distributed.
There are open source ETL tools of course. There is a Microsoft ETL tool..
Indexing to solr happens through update handler.. Delta indexing is very similar to full indexing with additional logic for change data capture. ETL activity is scheduled.