Greenplum download dump to local cluster in parallel - jdbc

Is there any more effective way to fetch the whole Greenplum's dump than doing it through multiple JDBC connections to master node?
I need to download the whole dump of Greenplum through JDBC. To do the job quicker I am going to use Spark parallelism (fetching data in parallel through multiple JDBC connections). As I understand, I will have multiple JDBC connections to Greenplum's single master node. I am going to store the data at HDFS in parquet format.

For parallel exporting, you can try gphdfs writable external table.
Gpdb segments can parallel write/read External sources.
http://gpdb.docs.pivotal.io/4340/admin_guide/load/topics/g-gphdfs.html

Now, you can use Greenplum-Spark connector to parallelize data transfer between Greenplum segments and Spark executors.
This greenplum-spark connector speeds up data transfer as it leverage parallel processing in Greenplum segments and Spark workers. Definitely, it is faster than using JDBC connector that transfer data via Greenplum master node.
Reference:
http://greenplum-spark.docs.pivotal.io/100/index.html

Related

Can you use HDFS as your principal storage?

Is its reliable to save your data in Hadoop and consume it using Spark/Hive etc?
What are the advantages of using HDFS as your main storage?
HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.
If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available
Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).
Hive:
HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.
HDFS/HBase:
Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.
Related post:
When to use Hadoop, HBase, Hive and Pig?

I want to ingest data using NIFI to two directions one in HDFS and one in Oracle Database. Is it Possible?

We are using Nifi to ingesting data in HDFS. Can at same time same data be ingested in Oracle or any other database using NIFI?
I need to publish same data two places (HDFS and Oracle Database) and do not want to write two subscribe program.
NiFi has processors to get data from an RDBMS (Oracle, e.g.) such as QueryDatabaseTable and ExecuteSQL, and also from HDFS (ListHDFS, FetchHDFS, etc.). It also has processors to put data into an RDBMS (PutDatabaseRecord, PutSQL, etc.) or HDFS (PutHDFS, e.g.). So you can get your data from multiple sources and send it to multiple targets with NiFi.

Impala vs Hive. How Impala circumvents MapReduce?

How is Impala able to achieve lower latency than Hive in query processing?
I was going through http://impala.apache.org/overview.html, where it is stated:
To avoid latency, Impala circumvents MapReduce to directly access the
data through a specialized distributed query engine that is very
similar to those found in commercial parallel RDBMSs. The result is
order-of-magnitude faster performance than Hive, depending on the type
of query and configuration.
How Impala fetches the data without MapReduce (as in Hive)?
Can we say that Impala is closer to HBase and should be compared with HBase instead of comparing with Hive?
Edit:
Or can we say that as classically, Hive is on top of MapReduce and does require less memory to work on while Impala does everything in memory and hence it requires more memory to work by having the data already being cached in memory and acted upon on request?
Just read Impala Architecture and Components
Impala is a massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts.... Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries.
It circumvents MapReduce containers by having a long running daemon on every node that is able to accept query requests. There is no singular point of failure that handles requests like HiveServer2; all impala engines are able to immediately respond to query requests rather than queueing up MapReduce YARN containers.
Impala however does rely on the Hive Metastore service because it is just a useful service for mapping out metadata stored in the RDBMS to the Hadoop filesystem. Pig, Spark, PrestoDB, and other query engines also share the Hive Metastore without communicating though HiveServer.
Data is not "already cached" in Impala. Similar to Spark, you must read the data into a large portion of memory in order for operations to be quick. Unlike Spark, the daemons and statestore services remain active for handling subsequent queries.
Impala can query HBase, but it is not similar in architecture and in my experience, a well designed HBase table is faster to query than Impala. Impala is probably closer to Kudu.
Also worth mentioning that it's not really recommended to use MapReduce Hive anymore. Tez is far better, and Hortonworks states Hive LLAP is better than Impala, although as you quoted, it largely "depends on the type of query and configuration."
Impala use "Impala Daemon" service to read data directly from the dataNode (it must be installed with the same hosts of dataNode) .he cache only the location of files and some statistics in memory not the data itself.
that why impala can't read new files created within the table . you must invalidate or refresh (depend on your case) to tell impala to cache the new files and be able to read them directly
since impala is in memory , you need to have enough memory for the data read by the query , if you query will use more data than your memory (complexe query with aggregation on huge tables),use hive with spark engine not the default map reduce
set hive.execution.engine=spark; just before the query
you can use the same query in hive with spark engine
impala is cloudera product , you won't find it for hortonworks and MapR (or others) .
Tez is not included with cloudera for exemple.
it all depends on the platform you are using

Clarification of Sqoop and Flume

I am very new to big data and i have little confusion regarding Sqoop and Flume
So i get that difference between the Sqoop and Flume
Sqoop is for transferring bulk data from RDBMS
Flume is for streaming of data such as log files
My confusion is because big data architecture i am looking at (which i have no virtual copy of) grouped structured data and its transferred by Sqoop and Unstructured streamed by Flume.
My question regard that is does that mean Flume is only for streaming?
What about high frequency data? and does Flume support transfer of unstructured data that are non-log files (i.e. audio, video) or would Sqoop be able to handle that?
Final question is can Sqoop work with federated data sources? if yes with both real and virtual?
Thanks,
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases(it imports data, transform the data in Hadoop MapReduce, and then export the data).
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Source: sqoop-vs-flume-battle-of-the-hadoop
Reference: INGESTION AND STREAMING
Flume is efficient with streams and if you want to just dump data from RDBMS why not use sqoop?
By high frequency data if you mean social media yes flume can handle it. Unstructured data yes, flume may handle that too.
sqoop is essentially a tool to ingest data in HDFS from RDBMS. Under the hood, it generates simple Java code which submit a query to a RDBMS and writes the result to HDFS. This means that you can import with sqoop everything which can be accessed via JDBC connection and which has a Java driver available. For this reason, you can't use it for files (like logs) or things like that.
Then sqoop can't handle video or audio files.
Flume, instead, is used to monitor and ingesting in real time informations. You can ingest everything for which there is a Flume source available (https://flume.apache.org/FlumeUserGuide.html#flume-sources).

When should I use Greenplum Database versus HAWQ?

We are having use case for retail industry data. We are into making of EDW.
We are currently doing reporting from HAWQ.But We wanted to shift our MPP database from Hawq into Greenplum.
Basically,We would like to make changes into current data pipeline.
Our confusion points about gpdb :
HOW gpdb layer going to affect our existing data pipeline. Here data
pipeline is external system --> talend -->hadoop-hawq-->tableau. We
want to transform our data pipeline as external system --> talend
-->hadoop-hawq-->greenplum -->tableau.
How Greenplum is physically or logically going to help in SQL
transformation and reporting.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
How is data file processed from GPDB . so, that
it also bring faster reporting and predictive analysis.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Any help on it would be much appreciated?
This query is sort of like asking, "when should I use a wrench?" The answer is also going to be subjective as Greenplum can be used for many different things. But, I will do my best to give my opinion because you asked.
HOW gpdb layer going to affect our existing data pipeline. Here data pipeline is external system --> talend -->hadoop-hawq-->tableau. We want to transform our data pipeline as external system --> talend -->hadoop-hawq-->greenplum -->tableau.
There are lots of ways to do the data pipeline your goal of loading data into Hadoop first and then load it to Greenplum is very common and works well. You can use External Tables in Greenplum to read data in parallel, directly from HDFS. So the data movement from the Hadoop cluster to Greenplum can be achieved with a simple INSERT statement.
INSERT INTO greenplum_customer SELECT * FROM hdfs_customer_file;
How Greenplum is physically or logically going to help in SQL transformation and reporting.
Isolation for one. With a separate cluster for Greenplum, you can provide analytics to your customers without impacting the performance of your Hadoop activity and vice-versa. This isolation also can provide an additional security layer.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
With your data pipeline as you suggested, I would make the data format decision in Greenplum based on performance. So large tables, partition the tables and make it column oriented with quicklz compression. For smaller tables, just make it append optimized. And for tables that have lots of updates or deletes, keep it the default heap.
How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.
Greenplum is an MPP database. The storage is "shared nothing" meaning that each node has unique data that no other node has (excluding mirroring for high-availability). A segment's data will always be on the local disk.
In HAWQ, because it uses HDFS, the data for the segment doesn't have to be local. Day 1, when you wrote the data to HDFS, it was local but after failed nodes, expansion, etc, HAWQ may have to fetch the data from other nodes. This makes Greenplum's performance a bit more predictable than HAWQ because of how Hadoop works.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Push, no but pull, Yes. As I mentioned above, you can create an External Table in Greenplum to SELECT data from HDFS. You can also create Writable External Tables in Greenplum to push data to HDFS.

Resources