When should I use Greenplum Database versus HAWQ? - greenplum

We are having use case for retail industry data. We are into making of EDW.
We are currently doing reporting from HAWQ.But We wanted to shift our MPP database from Hawq into Greenplum.
Basically,We would like to make changes into current data pipeline.
Our confusion points about gpdb :
HOW gpdb layer going to affect our existing data pipeline. Here data
pipeline is external system --> talend -->hadoop-hawq-->tableau. We
want to transform our data pipeline as external system --> talend
-->hadoop-hawq-->greenplum -->tableau.
How Greenplum is physically or logically going to help in SQL
transformation and reporting.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
How is data file processed from GPDB . so, that
it also bring faster reporting and predictive analysis.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Any help on it would be much appreciated?

This query is sort of like asking, "when should I use a wrench?" The answer is also going to be subjective as Greenplum can be used for many different things. But, I will do my best to give my opinion because you asked.
HOW gpdb layer going to affect our existing data pipeline. Here data pipeline is external system --> talend -->hadoop-hawq-->tableau. We want to transform our data pipeline as external system --> talend -->hadoop-hawq-->greenplum -->tableau.
There are lots of ways to do the data pipeline your goal of loading data into Hadoop first and then load it to Greenplum is very common and works well. You can use External Tables in Greenplum to read data in parallel, directly from HDFS. So the data movement from the Hadoop cluster to Greenplum can be achieved with a simple INSERT statement.
INSERT INTO greenplum_customer SELECT * FROM hdfs_customer_file;
How Greenplum is physically or logically going to help in SQL transformation and reporting.
Isolation for one. With a separate cluster for Greenplum, you can provide analytics to your customers without impacting the performance of your Hadoop activity and vice-versa. This isolation also can provide an additional security layer.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
With your data pipeline as you suggested, I would make the data format decision in Greenplum based on performance. So large tables, partition the tables and make it column oriented with quicklz compression. For smaller tables, just make it append optimized. And for tables that have lots of updates or deletes, keep it the default heap.
How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.
Greenplum is an MPP database. The storage is "shared nothing" meaning that each node has unique data that no other node has (excluding mirroring for high-availability). A segment's data will always be on the local disk.
In HAWQ, because it uses HDFS, the data for the segment doesn't have to be local. Day 1, when you wrote the data to HDFS, it was local but after failed nodes, expansion, etc, HAWQ may have to fetch the data from other nodes. This makes Greenplum's performance a bit more predictable than HAWQ because of how Hadoop works.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Push, no but pull, Yes. As I mentioned above, you can create an External Table in Greenplum to SELECT data from HDFS. You can also create Writable External Tables in Greenplum to push data to HDFS.

Related

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.
Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

Loading data into HIVE to support front end application

We have a datawarehousing application which we are planning to convert to Hadoop.
Currently, there are 20 feeds that we receive on daily basis and load this data into MySQL database.
As the data is getting large, we are planning to move to Hadoop for faster query processing.
As the first step we are planning to load the data into HIVE on a daily basis instead of MySQL.
Question:-
1.Can I convert Hadoop similar to a DWH application to process files on daily basis?
2.When I load the data in Master Node, will it be sync'd automatically?
It really depends on the size of your data. The Question is a bit complex but in general you will have to design your own pipeline.
If you are analyzing raw logs HDFS will be a good choice to start from. You can use Java, Python or Scala to schedule the Hive jobs on daily basis and use Sqoop if you still need some MySQL data.
In Hive you will have to create partitioned table to be synced and available upon query execution. Partition creation can be also scheduled.
I would suggest to go with Impala instead of Hive as it is more tunable, fault tolerant and easier to use.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Lookup data in Relation database from Hadoop side

I am converting SSIS solution to Hadoop for ETL processing in the data-warehouse.
My expected system:
ETL - landing & staging (Hadoop) ----put-data---> Data-warehouse(MySQL)
The problem is: in transform phrase, I need to lookup data in MySQL from hadoop side (pig or mapreduce job). There are 2 solutions:
1st: Clone all tables need to lookup from MySQL into Hadoop. It means that we need to maintain data from 2 places.
2nd: query directly to MySQL. I am worried about many connections come to MySQL server.
What is solution/best practise for this problem? Are there any other solutions.
You will have to have some representation of your dimensional tables in Hadoop. Depending on the way how you do ETL of the dimension data, you might actually have them as a side effect of the ETL.
Are you planning to store the most granular fact data in MySQL? I my experience, Hive + Hadoop beat realational databases when it comes to storing and analyzing the fact data. If you need a realtime access to the results of the queries, you then can "cache" the summary results by storing them in MySQL.

Resources