I have this erratic client who wants to push data from HAWQ to GREENPLUM after some pre processing. Is there any way to do this? If not, Is it possible to create an external table in greenplum that reads it from the HDFS in which HAWQ is running?
Any help will be appreciated.
The simplest you can do - push the data from HAWQ to HDFS using external writable table and then read it from Greenplum using external readable table using gphdfs protocol. In my opinion this would be the fastest option.
Another option would be to store the data in gzipped CSV files on HDFS and work with them directly from HAWQ. This way when you need this data in Greenplum you can just query it in the same way, as an external table
HAWQ is same as Greenplum, only underlying storage is hdfs,
One way is You can create a externale(writable) table in HAWQ which will write your data into a file, now after this you can create a external(readable) table in Greenplum which will read data from that created file
Another way You can copy from one server to another using Standard Input/Output, I use it many times when required to puch data from development environment to Prodcution or vice-versa
Another way You can table a backup using pg_dump/gp_dump for particular table/tables then restore using pg_restore/gp_restore
Thanks
Related
I am new to Hadoop ecosystem and self learning it through online articles.
I am working on very basic project so that I can get hands-on on what I have learnt.
My use-case is extremely: Idea is I want to present location of user who login to portal to app admin.So, I have a server which is continuously generating logs, logs have user id, IP address, time-stamp. All fields are comma separated.
My idea to do this is to have a flume agent to streaming live logs data and write to HDFS. Have HIVE process in place which will read incremental data from HDFS and write to HIVE table. Use scoop to continuously copy data from HIVE to RDMBS SQL table and use that SQL table to play with.
So far I have successfully configured flume agent which read logs from a given location and write to hdfs location. But after this I am confused as how should I move data from HDFS to HIVE table. One idea that's coming to my mind is to have a MapRed program that will read files in HDFS and write to HIVE tables programatically in Java. But I also want to delete files which are already processed and make sure that no duplicate records are read by MapRed. I searched online and found command that can be used to copy file data to HIVE but that's sort of a manual once activity. In my usecase I want to push data as soon as it's available in HDFS.
Please guide me how to achieve this task. Links will be helpful.
I am working on Version: Cloudera Express 5.13.0
Update 1:
I just created an external HIVE table pointing to HDFS location where flume is dumping logs. I noticed that as soon as table is created, I can query HIVE table and fetch data. This is awesome. But what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed? Similarly, will hive read new logs which are not processed and ignore the ones which it has already processed?
how should I move data from HDFS to HIVE table
This isn't how Hive works. Hive is a metadata layer over existing HDFS storage. In Hive, you would define an EXTERNAL TABLE, over wherever Flume writes your data to.
As data arrives, Hive "automatically knows" that there is new data to be queried (since it reads all files under the given path)
what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed
Depends how you've setup Flume. AFAIK, it will checkpoint all processed files, and only pick up new ones.
will hive read new logs which are not processed and ignore the ones which it has already processed?
Hive has no concept of unprocessed records. All files in the table location will always be read, limited by your query conditions, upon each new query.
Bonus: Remove Flume and Scoop. Make your app produce records into Kafka. Have Kafka Connect (or NiFi) write to both HDFS and your RDBMS from a single location (Kafka topic). If you actually need to read log files, Filebeat or Fluentd take less resources than Flume (or Logstash)
Bonus 2: Remove HDFS & RDBMS and instead use a more real-time ingestion pipeline like Druid or Elasticsearch for analytics.
Bonus 3: Presto / SparkSQL / Flink-SQL are faster than Hive (note: the Hive metastore is actually useful, so keep the RDBMS around for that)
Is Hive and Impala integration possible?
After data processing in hive i want to store result data in impala for better read, is it possible?
If yes can you please share one example.
Both hive and impala, do not store any data. The data is stored in the HDFS location and hive an impala both are used just to visualize/transform the data present in the HDFS.
So yes, you can process the data using hive and then read it using impala, considering both of them have been setup properly. But since impala needs to be refreshed, you need to run the invalidate metadata and refresh commands
Impala uses the HIVE metastore to read the data. Once you have a table created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data.
Hope this helps :)
Hive and impala are two different query engines. Each query engine is unique in terms of its architecture as well as performance. We can use hive metastore to get metadata and running query using impala. The common usecase is to connect impala/hive from tableau. If we are visualizing hive from tableau, we can get the latest data without any work around. If we keep on loading the data continuously, metadata will be updated as well. Impala does not aware of those changes. So we should run metadata invalidate query by connecting impalad to refresh its state and sync with the latest info available in metastore. So that user will get the same results as hive when the run the same query from tableau using impala engine.
There is no configuration parameter available now to run this invalidation query periodically. This blog reads well to execute meta data invalidation query through oozie scheduler periodically to handle such problems, Or simply we can set up a cronjob from the server itself.
I have a requirement, I need to refresh the production HAWQ database to QA environment on daily basis.
How to move the every day delta into QA cluster from Production.
Appreciate your help
Thanks
Veeru
Shameless self-plug - have a look at the following open PR for using Apache Falcon to orchestrate a DR batch job and see if it fits your needs.
https://github.com/apache/incubator-hawq/pull/940
Here is the synopsis of the process:
Run hawqsync-extract to capture known-good HDFS file sizes (protects against HDFS / catalog inconsistency if failure during sync)
Run ETL batch (if any)
Run hawqsync-falcon, which performs the following steps:
Stop both HAWQ masters (source and target)
Archive source MASTER_DATA_DIRECTORY (MDD) tarball to HDFS
Restart source HAWQ master
Enable HDFS safe mode and force source checkpoint
Disable source and remote HDFS safe mode
Execute Apache Falcon-based distcp sync process
Enable HDFS safe mode and force remote checkpoint
There is also a JIRA with the design description:
https://issues.apache.org/jira/browse/HAWQ-1078
There isn't a built-in tool to do this so you'll have to write some code. It shouldn't be too difficult to write either because HAWQ doesn't support UPDATE or DELETE. You'll only have to append new data to QA.
Create writable external tables in Production for each table that puts data in HDFS. You'll use the PXF format to write the data.
Create readable external tables in QA for each table that reads this data.
Day 1, you write everything to HDFS and then read everything from HDFS.
Day 2+, you find the max(id) from QA. Remove files from HDFS for the table. Insert into writable external table but filter the query so you get only records larger than the max(id) from QA. Lastly, execute an insert in QA by selecting all data from the external table.
We have small hadoop and Greenplum cluster.
Current data pipeline flow is :
External table >> hadoop-hawq external readbale table >>hawq internal table.
Output :
1.WE are trying to extend data pipeline using GREENPLUM. Basically wan to push HAWQ Internal table or external readable table data directly into greenplum.
Reason is because we want to edit our file. also, HAWQ does not support Update and delete. is there any alternate way to approach or push the data. Please guide.
2.How to access HDFS data via GPDB external table with gphdfs protocol
Thanks in Advance!
If you want to push data in HAWQ internal table to Greenplum Database, you can:
1) Unload data in HAWQ internal table to file on HDFS using writable external table. Here is an example for doing the unload: http://gpdb.docs.pivotal.io/4380/admin_guide/load/topics/g-unloading-data-using-a-writable-external-table.html
2) Then load data in HDFS file to Greenplum Database using readable external table with protocol like gphdfs, gpfdist, etc. You can refer to http://gpdb.docs.pivotal.io/4320/admin_guide/load.html for details.
If you want to push data in readable external table in HAWQ to Greenplum Database, you can directly use readable external table in Greenplum Database as in HAWQ.
For gphdfs, here are some example which would help:
http://gpdb.docs.pivotal.io/4380/admin_guide/load/topics/g-example-1-greenplum-file-server-gpfdist.html
Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable