should I use MaterializedPostgreSQL instead of create table in clickhouse? is that give the same performance? - clickhouse

as the question here what is difference between MaterializedPostgreSQL Engine and PostgreSQL Engine in Clickhouse?
MaterializedPostgreSQL will use replication slots and will physically replicate data from PostgreSQL to ClickHouse
PostgreSQL it's just a proxy table engine.
But the more important thing is. With the advantage of materializedPosgresSQL. Should I write data to Postgres instead of Clickhouse, and use Clickhouse with MaterializedPostgresSQL as the calculator to get report?
is that speed in clickhouse and speed of MaterializedPostgresSQL is the same? I think it should be the same. Because It feel like clickhouse do copy data from Postgres. But I'm not sure is it correct

Related

How to know how much memory is used by a query in clickhouse

I am trying to test performance of clickhouse to get sense how much memory i need for a dedicated server.
Currently I'm using PostgreSQL in production and now I want to migrate to clickhouse, so I inserted some of production data into a clickhouse server locally and executing the most used queries on production on clickhouse.
But I do not know how much memory does clickhouse use to execute these queries.
After some research I found the answer hope it help others.
clickhouse has table called 'system.query_log' that is used for storing statistics of each executed query like duration or memory usage
system.query_log
also there is table 'system.processes' that has information about current queries
system.processes
I'm using the following query to inspect recent queries. It returns memory use, query duration, number of read rows, used functions, and more:
SELECT * FROM system.query_log
WHERE type != 'QueryStart' AND NOT has(databases, 'system')
ORDER BY event_time_microseconds DESC
LIMIT 20;

Avoid data replication when using Elasticsearch + MySQL backend?

I'm working on a project where we have some legacy data in MySQL and now we want to deploy ES for better full text search.
We still want to use MySQL as the backend data storage because the current system is closely coupled with that.
It seems that most of the available solutions suggest syncing the data between the two, but this would result in storing all the documents twice in both ES and MySQL. Since some of the documents can be rather large, I'm wondering if there's a way to have only a single copy of the documents?
Thanks!
Impossible. This is analogous to asking the following: if you have legacy data in an Excel spreadsheet, can I use a MySQL database to query the data without also storing it in MySQL?
Elasticsearch is not just an application layer that interprets userland queries and turns them into database queries, it is itself a database system (in fact, it can be used as your primary data store, though it's not recommended due to various drawbacks). Its search functionality fundamentally depends on how its own backing storage is organized. Elasticsearch cannot query other databases.
You should consider what portions of your data actually need to be stored in Elasticsearch, i.e. what fields need text completion. You will need to build a component which syncs that view of the data between Elasticsearch and your MySQL database.

downloading large datasets using jdbc on redshift

I'm using the Amazon Redshift JDBC driver to connect to Redshift on SQL Workbench/J.
I want to get my hands on a large dataset query result (several million rows).
WbExport seems to have the answer at first glance (http://www.sql-workbench.net/manual/command-export.html).
However, it seems to want to load the entire result set into memory before trying to export it into a file, gives me a memory warning and aborts the query on me without even creating the output file, so this approach seems to not work.
Is there a better approach that doesn't involve ditching SQL Workbench and the JDBC connection? If not, what's a suggested viable alternative that minimizes the amount of new tools or access necessary?
Strongly recommend that you do not try to retrieve millions of rows from Redshift as a query result. This is very inefficient and it will slow down your cluster while it runs.
Instead use an UNLOAD query to extract the data to S3 in parallel. UNLOAD will be 100x-1000x faster. https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
Though not very effecient for redshift, but if you really need it, you should be able to set fetchsize for workbench- http://www.sql-workbench.net/manual/profiles.html.
Aws redshift documentation regarding same- https://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter
You can use fastest approach.
1.Unload data to S3.
2.Then download data from S3.

Is aggregating outside of Hive a better choice?

I have more of a conceptual question. I'm using Hive to pull data and then I want to insert all the retrieved values into IBM BigSQL (basically DB2) so that aggregating data would be easier/faster. So I want to create a view in Hive that I will use nightly perform CTAS so that I can take the table and migrate it to db2 and do the rest of the aggregation.
Is there a better practice?
I wanted to do everything including aggregation in Hive but it is extremely slow.
Thanks for your suggestions!
Considering that you are using Cloudera, is there a reason why you don't perform the aggregations in Impala? convert the json data to parquet (I would recommend this if there is not a lot of nested structure) shouldn't be really expensive. Another alternative depending the kind of aggregations that you are doing is use Spark to convert the data (also will depend a lot of your cluster size). I would like to give you more specific hints but without know what aggregations you are doing is be complicated

Move data from Oracle to Cassandra and/or MongoDB

At work we are thinking to move from Oracle to a NoSQL database, so I have to make some test on Cassandra and MongoDB. I have to move a lot of tables to the NoSQL database the idea is to have the data synchronized between this two platforms.
So I create a simple procedure that make selects into the Oracle DB and insert into mongo. Some of my colleagues point that maybe there is an easier(and more professional) way to do it.
Anybody had this problem before? how do you solve it?
If your goal is to copy your existing structure from Oracle to a NoSQL database then you should probably reconsider your move in the first place. By doing that you are losing any of the benefits one sees from going to a non-relational data store.
A good first step would be to take a long look at your existing structure and determine how it can be modified to affect positive impact on your application. Additionally, consider a hybrid system at the same time. Cassandra is great for a lot of things, but if you need a relational system and already are using a lot of Oracle functionality, it likely makes sense for most of your database to stay in Oracle, while moving the pieces that require frequent writes and would benefit from a different structure to Mongo or Cassandra.
Once you've made the decisions about your structure, I would suggest writing scripts/programs/add a module to your existing app, to write the data in the new format to the new data store. That will give you the most fine-grained control over every step in the process, which in a large system-wide architectural change, I would want to have.
You can also consider using components of Hadoop ecosystem to perform this kind of (ETL) task .For that you need to model your Cassandra DB as per the requirements.
Steps could be to migrate your oracle table data to HDFS (using SQOOP preferably) and then writing Map-Reduce job to transform this data and insert into Cassandra Data Model .

Resources