Debezium Oracle schema.include.list don't work - oracle

We are using Debezium 1.9.4-Final to capture the changes and send it to ElasticSearch.
It is working, but there's one thing that's intrigued me.
We set the schema.include.list to filter only the schema that we expect the changes to be captured, but when we start the connector, the log shows that the hole database is been scanned and added to the database.server.name topic that is created. The database is huge, so it takes a lot of time to really starting capture the only table that we set to be captured.
This part configuration is like that:
database.server.name: server_name
database.dbname: server_name.database
table.include.list: ATBSCH.TB_DROP
schema.include.list: ATBSCH
Is there a way to make Debezium only watch the schema said in the schema.include.list attribute?

After some investigation, I have found the solution to my problem. I think it was the same as yours. Set the following:
debezium.source.database.history.store.only.captured.tables.ddl=true
The documentation for it: https://debezium.io/documentation/reference/1.9/connectors/oracle.html#oracle-property-database-history-store-only-captured-tables-ddl

Related

Kafka connect API SourceRecord to SinkRecord transformation

Im using debezium embedded connector to listen to changes in database. It gives me a ChangeEvent<SourceRecord,SourceRecord> object.
I want to further use confluent plugin KCBQ which uses SinkRecord to put data to bigqery. But I'm not able to figure out how to join these two pieces.
Eventually, how do i ensure updates, deletes and schema changes from MySQL are propagated to BigQuery from Embedded Debezium
You will possibly have to use a single message transform if you have to any custom transforms. However for this scenario , since this seems to be a commonly used transform , the extract new state transform seems to accomplish this. May be worth having a look and trying something similar
https://issues.redhat.com/browse/DBZ-226
https://issues.redhat.com/browse/DBZ-1896

Elasticsearch to index RDBMS data

These are three simple questions which was surprisingly hard to find definite answers.
Does ElasticSearch support indexing data in RDBMS tables ( Oracle/SQLServer/Informix) out of the box?
If yes, can you please point me to documentation on how to do it
If not, what are alternate ways (plugins like Rivers - deprecated) with good reputation
I'm surprised there isn't any solid answer as yet for this. So here's the solution. Logstash directly gives us the ability to push data from a RDBMS into Elasticsearch.
Here's a link to a tutorial which tell you how to go about it. Briefly(all details in link 1), you simply need a JDBC driver for the relational database you'll be using (Postgres, MySQL etc) and make a config file specifying your input as the Relational Database and your output as Elasticsearch. You can also specify a cron which would allow you to keep updating one regular intervals.
Here's the article which mentions the configuration and gets you started (See Example 2): https://www.elastic.co/blog/logstash-jdbc-input-plugin
Here's the article which tells you how to configure the Cronjob as such: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_scheduling

Spring Boot application with Postgres: indexes not being used during first use

I have a Spring Boot application that is using a Postgres database. When the application is deployed I need to run a transactional operation that uploads a zip file that is used to populate the database. The application is checking for duplicate rows before inserting them (because users can upload duplicate data that should just be ignored).
The problem I am having is that the first time I upload the file, even thought the indexes are created, they are not being used when checking for the existence of a row. My theory is that this happens because the query plan is deciding not to use the index because it is checking the original statistics, which show that the tables are empty. If I upload a small zip file first, then the problem goes away because the tables now have data.
I have two questions. First, is my theory correct or is there some other reason for this behaviour? Also, if so, is there a way to force Postgres to update the query plan it uses at some predefined interval within the same transaction and can this be done using JPA? Any ideas are appreciated.
Just in case someone runs into this issue, I'll post the solution I found. It appears my theory was correct. The queries will not use the indexes until some statistics are collected. One way to force this is to call ANALYZE after a number of rows have been written to the database. You can do this using a native query like this:
entityManager.createNativeQuery("ANALYZE " + tbl).executeUpdate();
You can wrap this call in a try catch and ignore any exceptions that might occur if you change the database engine. I couldn't find a way of doing this in a database-independent way but this approach works fine and now the initial upload performs as expected.

Realistic Data Backup method for Parse.com

We are building an iOS app with Parse.com, but still can't figure out the right way to backup data efficiently.
As a premise, we have and will have a LOT of data store rows.
Say we have a class with 1million rows, assume we have it backed up, then want to bring it back to Parse, after a hazardous situation (like data loss on production).
The few solutions we have considered are the following:
1) Use external server for backup
BackUp:
- use the REST API to constantly back up data to a remote MySQL server (we chose MySQL for customized analytics purpose, since it's way faster and easier to handle data with MySQL for us)
ImportBack:
a) - recreate JSON objects from MySQL backup and use the REST API to send back to Parse.
Say we use the batch operation which permits 50 simultaneous objects to be created with 1 query, and assume it takes 1 sec for every query, 1million data sets will take 5.5hours to transfer to Parse.
b) - recreate one JSON file from MySQL backup and use the Dashboard to import data manually.
We just tried with 700,000 records file with this method: it took about 2 hours for the loading indicator to stop and show the number of rows in the left pane, but now it never opens in the right pane (it says "operation time out") and it's over 6hours since the upload started.
So we can't rely on 1.b, and 1.a seems to take too long to recover from a disaster (if we have 10 million records, it'll be like 55 hours = 2.2 days).
Now we are thinking about the following:
2) Constantly replicate data to another app
Create the following in Parse:
- Production App: A
- Replication App: B
So while A is in production, every single query will be duplicated to B (using background job constantly).
The downside is of course that it'll eat up the burst limit of A as it'll simply double the amount of query. So not ideal thinking of scaling up.
What we want is something like AWS RDS which gives an option to automatically backup daily.
I wonder how this could be difficult for Parse since it's based on AWS infra.
Please let me know if you have any idea on this, will be happy to share know-hows.
P.S.:
We’ve noticed an important flaw in the above 2) idea.
If we replicate using REST API, all the objectIds of all Classes will be changed, so every 1to1 or 1toMany relations will be broken.
So we think about putting a uuid for every object class.
Is there any problem about this method?
One thing we want to achieve is
query.include(“ObjectName”)
( or in Obj-C “includeKey”),
but I suppose that won’t be possible if we don’t base our app logic on objectId.
Looking for a work around for this issue;
but will uuid-based management be functional under Parse’s Datastore logic?
Parse has never lost production data. While we don't currently offer automated backups, you can request one any time you like, and we're working on making all of this even nicer. Additionally, it's easier in most cases to import the JSON export file through the data browser rather than using the REST batch.
I can confirm that today, Parse did lost my data. Or at least it appeared to be so.
After several errors where detected on multiple apps (agreed by Parse Status twitter account), we could not retrieve data for an app, without any error.
It was because an entire column of one of our class (type pointer) disappeared and data was not present anymore in the dashboard.
We are using this pointer column to filter / retrieve data, so the returned queries and collections were empty.
So we decided to recreate the column manually. By chance, recreating the column, with the same name and type, solved the issue and the data was still there... I can't explain it but I really thought, and the app reacted as if, data were lost.
So an automated backup and restore option is mandatory, it is not an option.
On December 2015 parse.com released a new dashboard with an improved export feature.
Just select your app, click on "App Settings" -> "General" -> "Export app data". Parse generates a json-file for every class in your app and sends an email to you, if the export-progress is done.
UPDATE:
Sad but true, parse.com is winding down: http://blog.parse.com/announcements/moving-on/
I had the same issue of backing up parse server data. As parse server is using mongodb that is why backing up data is not an issue I have just done a simple thing. downloaded the mongodb backup from the server. And then restored it using
mongorestore /path-to-mongodump (extracted files)
As parse has been turned to open source.Therefore we can adopt this technique.
For accidental deletes, writing a cloud function 'beforedelete' to backup the current row to another class would work.
For regular backups, manual export of changed records (use filter) will be useful. For recovery this requires you to write scripts / use import option (not so sure) in data browser. You could also write a cloud function replicate data on your backup server (haven't tried this yet).
However there are some limitations to cloud code that you should consider before venturing into it:
https://parse.com/docs/cloud_code_guide#functions-resource

Can DB2 tell a web-app when a table data is updated?

I have a table of non trivial size on a DB2 database that is updated X times a day per user input in another application. This table is also read by my web-app to display some info to another set of users. I have a large number of users on my web app and they need to do lots of fuzzy string lookups with data that is up-to-the-minute accurate. So, I need a server side cache to do my fuzzy logic on and to keep the DB from getting hammered.
So, what's the best option? I would hate to pull the entire table every minute when the data changes so rarely. I could setup a trigger to update a timestamp of a smaller table and poll that to see if I need refresh my cache, but that seems hacky to.
Ideally I would like to have DB2 tell my web-app when something changes, or at least provide a very lightweight mechanism to detect data level changes.
I think if your web application is running in WebSphere, setting up MQ would be a pretty good solution.
You could write triggers that use the MQ Series routines to add things to a queue, and your web app could subscribe to the queue and listen for updates.
If your web app is not in WebSphere then you could still look at this option but it might be more difficult.
A simple solution could be to have a timestamp (somewhere) for the latest change on to table.
The timestamp could be located in a small table/view that is updated by either the application that updates the big table or by an update-trigger on the big table.
The update-triggers only task would be to update the "help"-timestamp with currenttimestamp.
Then the webapp only checks this timestamp.
If the timestamp is newer then what the webapp has then the data is reread from the big table.
A "low-tech"-solution thats fairly non intrusive to the exsisting system.
Hope this solution fits your setup.
Regards
Sigersted
Having the database push a message to your webapp is certainly doable via a variety of mechanisms (like mqseries, etc). Similar and easier is to write a java stored procedure that gets kicked off by the trigger and hands the data to your cache-maintenance interface. But both of these solutions involve a lot of versioning dependencies, etc that could be a real PITA.
Another option might be to reconsider the entire approach. Is it possible that instead of maintaining a cache on your app's side you could perform your text searching on the original table?
But my suggestion is to do as you (and the other poster) mention - and just update a timestamp in a single-row table purposed to do this, then have your web-app poll that table. Similarly you could just push the changed rows to this small table - and have your cache-maintenance program pull from this table. Either of these is very simple to implement - and should be very reliable.

Resources