HAWQ data to replicate between clusters - hawq

I have a requirement, I need to refresh the production HAWQ database to QA environment on daily basis.
How to move the every day delta into QA cluster from Production.
Appreciate your help
Thanks
Veeru

Shameless self-plug - have a look at the following open PR for using Apache Falcon to orchestrate a DR batch job and see if it fits your needs.
https://github.com/apache/incubator-hawq/pull/940
Here is the synopsis of the process:
Run hawqsync-extract to capture known-good HDFS file sizes (protects against HDFS / catalog inconsistency if failure during sync)
Run ETL batch (if any)
Run hawqsync-falcon, which performs the following steps:
Stop both HAWQ masters (source and target)
Archive source MASTER_DATA_DIRECTORY (MDD) tarball to HDFS
Restart source HAWQ master
Enable HDFS safe mode and force source checkpoint
Disable source and remote HDFS safe mode
Execute Apache Falcon-based distcp sync process
Enable HDFS safe mode and force remote checkpoint
There is also a JIRA with the design description:
https://issues.apache.org/jira/browse/HAWQ-1078

There isn't a built-in tool to do this so you'll have to write some code. It shouldn't be too difficult to write either because HAWQ doesn't support UPDATE or DELETE. You'll only have to append new data to QA.
Create writable external tables in Production for each table that puts data in HDFS. You'll use the PXF format to write the data.
Create readable external tables in QA for each table that reads this data.
Day 1, you write everything to HDFS and then read everything from HDFS.
Day 2+, you find the max(id) from QA. Remove files from HDFS for the table. Insert into writable external table but filter the query so you get only records larger than the max(id) from QA. Lastly, execute an insert in QA by selecting all data from the external table.

Related

Streaming live data from HDFS to Hive

I am new to Hadoop ecosystem and self learning it through online articles.
I am working on very basic project so that I can get hands-on on what I have learnt.
My use-case is extremely: Idea is I want to present location of user who login to portal to app admin.So, I have a server which is continuously generating logs, logs have user id, IP address, time-stamp. All fields are comma separated.
My idea to do this is to have a flume agent to streaming live logs data and write to HDFS. Have HIVE process in place which will read incremental data from HDFS and write to HIVE table. Use scoop to continuously copy data from HIVE to RDMBS SQL table and use that SQL table to play with.
So far I have successfully configured flume agent which read logs from a given location and write to hdfs location. But after this I am confused as how should I move data from HDFS to HIVE table. One idea that's coming to my mind is to have a MapRed program that will read files in HDFS and write to HIVE tables programatically in Java. But I also want to delete files which are already processed and make sure that no duplicate records are read by MapRed. I searched online and found command that can be used to copy file data to HIVE but that's sort of a manual once activity. In my usecase I want to push data as soon as it's available in HDFS.
Please guide me how to achieve this task. Links will be helpful.
I am working on Version: Cloudera Express 5.13.0
Update 1:
I just created an external HIVE table pointing to HDFS location where flume is dumping logs. I noticed that as soon as table is created, I can query HIVE table and fetch data. This is awesome. But what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed? Similarly, will hive read new logs which are not processed and ignore the ones which it has already processed?
how should I move data from HDFS to HIVE table
This isn't how Hive works. Hive is a metadata layer over existing HDFS storage. In Hive, you would define an EXTERNAL TABLE, over wherever Flume writes your data to.
As data arrives, Hive "automatically knows" that there is new data to be queried (since it reads all files under the given path)
what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed
Depends how you've setup Flume. AFAIK, it will checkpoint all processed files, and only pick up new ones.
will hive read new logs which are not processed and ignore the ones which it has already processed?
Hive has no concept of unprocessed records. All files in the table location will always be read, limited by your query conditions, upon each new query.
Bonus: Remove Flume and Scoop. Make your app produce records into Kafka. Have Kafka Connect (or NiFi) write to both HDFS and your RDBMS from a single location (Kafka topic). If you actually need to read log files, Filebeat or Fluentd take less resources than Flume (or Logstash)
Bonus 2: Remove HDFS & RDBMS and instead use a more real-time ingestion pipeline like Druid or Elasticsearch for analytics.
Bonus 3: Presto / SparkSQL / Flink-SQL are faster than Hive (note: the Hive metastore is actually useful, so keep the RDBMS around for that)

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

How can i copy files from external Hadoop cluster to Amazon S3 without running any commands on the cluster

I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use

How to copy hbase table from hbase-0.94 cluster to hbase-0.98 cluster

We have a hbase-0.94 cluster with hadoop-1.0.1. We don't want to have downtime for this cluster while upgrading to hbase-0.98 with hadoop-2.5.1
I have provisioned another hbase-0.98 cluster with hadoop-2.5.1 and want to copy hbase-0.94 tables to hbase-0.98. Hbase CopyTable does not seem to work for this purpose.
Please suggest a way to perform theabove task.
These are available options, out of which you can choose.
You can use org.apache.hadoop.hbase.mapreduce.Export tool to
export tables to HDFS and then you can use hadoop distcp to move data to
another cluster. When data is place on second cluster you can use
org.apache.hadoop.hbase.mapreduce.Import tool to import tables.
Please look at http://hbase.apache.org/book.html#export.
Second option is to us CopyTable tool, please look at:
http://hbase.apache.org/book.html#copytable
Have a look at pivotal
Third option is to enable hbase Snapshots, create table
snapshots, and then use ExportSnapshot tool to move them to second cluster. When snapshots are on second cluster you can clone tables from snapshots. Please look: http://hbase.apache.org/book.html#ops.snapshots
HBase Snapshots allow you to take a snapshot of a table without too
much impact on Region Servers. Snapshot, Clone and restore operations
don't involve data copying. Also, Exporting the snapshot to another
cluster doesn't have impact on the Region Servers
I was using 1 and 3 for moving data between clusters and I in my case 3
was better solution.
Also, have a look at my answer posted
Run below command on source cluster, make sure you have cross cluster authentication enabled.
/usr/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Ddfs.nameservices=nameservice1,devnameservice -Ddfs.ha.namenodes.devnameservice=devnn1,devnn2 -Ddfs.namenode.rpc-address.devnameservice.devnn1=<destination_namenode01_host>:<destination_namenode01_port> -Ddfs.namenode.rpc-address.devnameservice.devnn2=<destination_namenode02_host>:<destination_namenode02_port> -Ddfs.client.failover.proxy.provider.devnameservice=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider -Dmapred.map.tasks.speculative.execution=false --peer.adr=<destination_zookeeper host>:<port>:/hbase --versions=<n> <table_name>

How to push data from HAWQ into GREENPLUM?

I have this erratic client who wants to push data from HAWQ to GREENPLUM after some pre processing. Is there any way to do this? If not, Is it possible to create an external table in greenplum that reads it from the HDFS in which HAWQ is running?
Any help will be appreciated.
The simplest you can do - push the data from HAWQ to HDFS using external writable table and then read it from Greenplum using external readable table using gphdfs protocol. In my opinion this would be the fastest option.
Another option would be to store the data in gzipped CSV files on HDFS and work with them directly from HAWQ. This way when you need this data in Greenplum you can just query it in the same way, as an external table
HAWQ is same as Greenplum, only underlying storage is hdfs,
One way is You can create a externale(writable) table in HAWQ which will write your data into a file, now after this you can create a external(readable) table in Greenplum which will read data from that created file
Another way You can copy from one server to another using Standard Input/Output, I use it many times when required to puch data from development environment to Prodcution or vice-versa
Another way You can table a backup using pg_dump/gp_dump for particular table/tables then restore using pg_restore/gp_restore
Thanks

Resources