What happens sqoop fails in between proceed of data - sqoop

What happens when Sqoop import job fails while importing data into RDBMS-HDFS and vice-versa?

Sqoop can export data from HDFS into an RDBMS using parallel data transfer tasks. Each task will open a connection to the database, insert into the database via transactions, and commit periodically. This means that before the entire export job is complete, partial data will be available in the database.
If an export map task fails even after multiple retries, the entire job will fail. The reasons for task failures could include network connectivity issues, database integrity constraints, malformed records on HDFS, cluster capacity issues etc. In such a failure case, the already committed data will still be available in the database.

Related

Is Hive and Impala integration possible?

Is Hive and Impala integration possible?
After data processing in hive i want to store result data in impala for better read, is it possible?
If yes can you please share one example.
Both hive and impala, do not store any data. The data is stored in the HDFS location and hive an impala both are used just to visualize/transform the data present in the HDFS.
So yes, you can process the data using hive and then read it using impala, considering both of them have been setup properly. But since impala needs to be refreshed, you need to run the invalidate metadata and refresh commands
Impala uses the HIVE metastore to read the data. Once you have a table created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data.
Hope this helps :)
Hive and impala are two different query engines. Each query engine is unique in terms of its architecture as well as performance. We can use hive metastore to get metadata and running query using impala. The common usecase is to connect impala/hive from tableau. If we are visualizing hive from tableau, we can get the latest data without any work around. If we keep on loading the data continuously, metadata will be updated as well. Impala does not aware of those changes. So we should run metadata invalidate query by connecting impalad to refresh its state and sync with the latest info available in metastore. So that user will get the same results as hive when the run the same query from tableau using impala engine.
There is no configuration parameter available now to run this invalidation query periodically. This blog reads well to execute meta data invalidation query through oozie scheduler periodically to handle such problems, Or simply we can set up a cronjob from the server itself.

HAWQ data to replicate between clusters

I have a requirement, I need to refresh the production HAWQ database to QA environment on daily basis.
How to move the every day delta into QA cluster from Production.
Appreciate your help
Thanks
Veeru
Shameless self-plug - have a look at the following open PR for using Apache Falcon to orchestrate a DR batch job and see if it fits your needs.
https://github.com/apache/incubator-hawq/pull/940
Here is the synopsis of the process:
Run hawqsync-extract to capture known-good HDFS file sizes (protects against HDFS / catalog inconsistency if failure during sync)
Run ETL batch (if any)
Run hawqsync-falcon, which performs the following steps:
Stop both HAWQ masters (source and target)
Archive source MASTER_DATA_DIRECTORY (MDD) tarball to HDFS
Restart source HAWQ master
Enable HDFS safe mode and force source checkpoint
Disable source and remote HDFS safe mode
Execute Apache Falcon-based distcp sync process
Enable HDFS safe mode and force remote checkpoint
There is also a JIRA with the design description:
https://issues.apache.org/jira/browse/HAWQ-1078
There isn't a built-in tool to do this so you'll have to write some code. It shouldn't be too difficult to write either because HAWQ doesn't support UPDATE or DELETE. You'll only have to append new data to QA.
Create writable external tables in Production for each table that puts data in HDFS. You'll use the PXF format to write the data.
Create readable external tables in QA for each table that reads this data.
Day 1, you write everything to HDFS and then read everything from HDFS.
Day 2+, you find the max(id) from QA. Remove files from HDFS for the table. Insert into writable external table but filter the query so you get only records larger than the max(id) from QA. Lastly, execute an insert in QA by selecting all data from the external table.

How to expose Hadoop job and workflow metadata using Hive

What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8
If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]
If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.

How to push data from SQL to HDFS

I have the following use case:
We have several SQL databases in different locations and we need to load some data them to HDFS.
The problem is that we do not have access to the servers from our Hadoop cluster(due to security concerns), but we can push data to our cluster.
Is there ant tool like Apache Sqoop to do such bulk loading.
Dump data as files from your SQL databases in some delimited format for instance csv and then do a simple hadoop put command and put all the files to hdfs.
Thats it.
Let us assume I am working in a small company on 30 node cluster daily 100GB data processing. This data will comes from the different sources like RDBS such as Oracle, MySQL, IBMs Netteza, DB2 and etc. We need not to install SQOOP on all 30 nodes. The minimum number of nodes should be isntalled by SQOOP is=1. After installing on one machine now we will access those machines. Using SQOOP we will import that data.
As per the security is considered no import will be done untill and unless the administartor has to put the following two commands.
MYSQL>grant all privileges on mydb.table to ''#'IP Address of Sqoop Machine'
MYSQL>grant all privileges on mydb.table to '%'#'IP Address of Sqoop Machine'
these two commands should be fire by admin.
Then we can use our sqoop import commands and etc.

Cloudera Impala INVALIDATE METADATA

As has been discussed in impala tutorials, Impala uses a Metastore shared by Hive. but has been mentioned that if you create or do some editions on tables using hive, you should execute INVALIDATE METADATA or REFRESH command to inform impala about changes.
So I've got confused and my question is: if the Database of Metadata is shared, why there is a need for executing INVALIDATE METADATA or REFRESH by impala?
and if it is for caching of metadata by impala, why the daemons do not update their cache in the occurrence of cache miss themselves and without need to refresh metadata manually?
any help is appreciated.
Ok! Let's start with your question in the comment that what is the benefit of a centralized meta store.
Having a central meta store don't require the user to maintain meta data at two different locations, one each for Hive and Impala. User can have a central repository and both the tools can access this location for any metadata information.
Now, the second part, why there is a need to do INVALIDATE METADATA or REFRESH when the meta store is shared?
Impala utilizes Massively Parallel Processing paradigm to get the work done. Instead of reading from the centralized meta store for each and every query, it tends to keep the metadata with executor nodes so that it can completely bypass the COLD STARTS where a significant amount of time may be spent in reading the metadata.
INVALIDATE METADATA/REFRESH propagates the metadata/block information to the executor nodes.
Why do it manually?
In the earlier version of Impala, catalogd process was not present. The meta data updates were need to be propagated via the aforementioned commands. Starting Impala 1.2, catalogd is added and this process relays the metadata changes from Impala SQL statements to all the nodes in a cluster.
Hence removing the need to do it manually!
Hope that helps.
It is shared, but Impala caches the metadata and uses its statistics in its optimizer, but if it's changed in hive, you have to manually tell impala to refresh its cache, which is kind of inconvenient.
But if you create/change tables in impala, you don't have to do anything on the hive side.
#masoumeh when you modify a table via Impala SQL statements no need for INVALIDATE METADATA or REFRESH, this job is done by catalogd.
But when you insert :
a NEW table through HIVE i.e sqoop import .... --hive-import ... then you have to do : INVALIDATE METADATA tableName via Impala-Shell.
new data files into an existing table (append data) then you have to : REFRESH tableName because the only thing you want is the metadata for the last added info.

Resources