Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Our configuration has two connectors. Each connector is connected to its own elasticsearch. But the two connectors are reading from the same couchbase bucket. We have noticed that if one of the connector is started first and reads all of the documents from the bucket, then the second connector after starting is not able to feed anything into its elasticsearch. Could this be due to checkpoint document added by first connector into the source bucket
Make sure the two connectors have different group names, otherwise they will share the same replication checkpoint (and weird things will happen if they run at the same time).
Here's the relevant section of the config file:
[group]
name = 'example-group'
Each connector group must be assigned a unique name (in order to keep its replication checkpoints separate). The group name is required even if there is only one connector instance in the group.
Reference: https://docs.couchbase.com/elasticsearch-connector/4.2/configuration.html#group-membership
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am studying a use case where we are going to move datas from a SQL database (600TB ~100 tables) into a transformed format into hadoop. We don't have logs enabled in the SQL DB. We decided to copy the datas as a datamart view and to refresh this view every week. The copied datas will be erased every week to be rewritten.
This SQL DB is used for reporting purposes that is derived from the datalake. This OLTP database is an old system we are replacing progressivly. The dataset that is copied is deleted every week and copied again (refreshed).
80% of data copy is straight with no transformation.
20% has redesign.
We identified 3 options :
AirFlow + Beam for the processing
ETL (informatica) was excluded
Kafka (connect, stream, sink into hadoop) with optionnaly CDC Debezium
What do you think is a best approach regarding : performance, overall time to deliver, data architecture ?
Thanks for help !
My thoughts - for what they are worth:
I would definitely not be looking to copy 600TB per week. Given that the majority of this data will not have changed from week to week (I assume) then you should be looking to only copy across the data that has changed. As your data in Hadoop will be partitioned then you would mainly be inserting new data into new partitions - for those records that have changed you will just be dropping/reloading a few partitions
I would copy all the necessary data into a staging area in Hadoop as-is (without transformation) and then process it on the Hadoop platform to produce the data you actually need - you can then drop the staging area data if you want
Data processing tool - if you already have experience of a specific toolset within your company then use that; don't multiply the toolsets in use unless there is critical functionality required that is not available within existing tools. If this one process is all you are going to be using this toolset for then it probably doesn't matter which one you use - pick one that is quickest to learn/deploy. If this toolset is going to be expanded to other use cases then I would definitely use a dedicated ETL/ELT tool rather than use a coding solution (why have you discarded Informatica as a solution?)
The following is definitely an opinion...
If you are building a new analytical platform, I am surprised that you are using Hadoop. Hadoop is legacy technology that has been superseded by more modern and capable Cloud data platforms (Snowflake, etc.).
Also, Hadoop is a horrible platform to try and run analytics on (it's ok as just a data lake to hold data while you decide what you want to do with it). Trying to run queries on it that don't align with how that data is partitioned gives really bad performance (for non-trivial dataset sizes). For example, if your transactions are partitioned by date then running a query to sum transaction values in the last week will run quickly. However, running a query to sum transactions for a specific account (or group of accounts) will perform very badly
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Hi I want to support the option of GET more than 1000 records from DynamoDB and in addition add an option to send via APIgetaway a list of records to dynamodb.
(Both things are not possible at this moment).
Is there a way to do that? Is a suitable Lambda function is the only option?
DynamoDB does not have a limit of getting up to 1000 items - I don't know what in the other layers you use impose this specific limit "1000".
If you want to read all items in the table, or all the items of a partition, you have the Scan and Query requests, respectively, which can bring you back even billions of records - but not in one call of course (you need to do consecutive requests, in what is known as pagination, and there is also the option for a parallel scan.
But it seems what you are really looking for is to read a bunch of unrelated items given their keys. The request for that is BatchGetItem. This request is actually limited to just 100 item keys (much smaller than the limit you mentioned, 1000), and even that number 100 is only guaranteed to work if the items being read are fairly small - otherwise you go over the response size limit and get back responses for only some of the items. But this is hardly a problem - your application can always split up a 10,000-item request into 100 separate requests, sending all those batch requests in sequence or even in parallel.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Populating messages from different tables to a single topic. But need to separate out messages and sink to destination table set with same names.
Using debezium/connect, I am populating the topic from source table with jdbc source connector. With jdbc sink connector, I populate the destination table. (from topic to destination table) In thi scenario, I need 1 topic for 1 table. Since there are 100's of tables, I need to have a single topic to sync many tables
Any idea how to achieve this?
You can use Kafka Connect transforms to modify the topic / table names in flight.
It seems you may want to use a RegexRouter to collect multiple topics together.
https://docs.confluent.io/current/connect/transforms/regexrouter.html
These are three simple questions which was surprisingly hard to find definite answers.
Does ElasticSearch support indexing data in RDBMS tables ( Oracle/SQLServer/Informix) out of the box?
If yes, can you please point me to documentation on how to do it
If not, what are alternate ways (plugins like Rivers - deprecated) with good reputation
I'm surprised there isn't any solid answer as yet for this. So here's the solution. Logstash directly gives us the ability to push data from a RDBMS into Elasticsearch.
Here's a link to a tutorial which tell you how to go about it. Briefly(all details in link 1), you simply need a JDBC driver for the relational database you'll be using (Postgres, MySQL etc) and make a config file specifying your input as the Relational Database and your output as Elasticsearch. You can also specify a cron which would allow you to keep updating one regular intervals.
Here's the article which mentions the configuration and gets you started (See Example 2): https://www.elastic.co/blog/logstash-jdbc-input-plugin
Here's the article which tells you how to configure the Cronjob as such: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_scheduling
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a document-oriented db with a Ruby API that has SQLite-like properties:
self-contained,
serverless,
zero-configuration.
Are there light alternatives to MongoDB or CouchDB?
Is RDDB a possibility?
If not, what are the best paths to walk then?
I know, the question was asked 5 years ago, but just for completeness' sake, embedded MongoDB has happened since:
https://github.com/hamiltop/MongoLiteDB
It's not ready yet, but embeddable version of CouchDB are on the long term roadmap.
Replication is intended to enable offline applications with CouchDB. If you ended up with very specific needs you could replicate data from couchdb to a local datastructure, store it locally, update it, and push the data back via replication but it would take some code.
If you were using Perl, I'd recommend DBM::Deep, which stores arbitrary data structures on disk, including transactions with commit/rollback, and it's a non-C one-Perl-module install. Doesn't get much lighter than that.
I almost feel you could do some sort of hack to achieve this.
Have a table using sqlite's row ids along with a field for collection name and text blob that would be json code.
Have another table for indexing with fields in a collection (collection name, field name, field value, document row id).
You could do some wrapper class to handle things like updates and lookups. Would be interesting.