How to get metadata of source DB using sqoop - hadoop

Sqoop reads the metadata of the source DB before storing the data into HDFS/HIVE.
Is there any method by which we can get this metadata information from sqoop ?

Answering my question:
To get the metadata from sqoop , we can use the sqoop java apis and connect to the destination and get the following metadata
Table name
Db name
Column details etc.

Related

Sqoop mapping all datatype as string

I'm importing a table from oracle to a s3 directory using Amazon EMR. The files are being imported as avro and Sqoop exports the avsc file with all columns as String.
Does anyone knows what to do to Sqoop map the correct datatype ?
Use --map-column-java to map to the appropriate data type. For hive you can use --map-column-hive

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Can Hive deal with binary data?

Can Hive deal with unstructured data .
If we are having image file in oracle database and we have to run sqoopout to load that image from oracle to another source database and export as well in hive table.
Could you please help me on same how to handled that image file in hive?????
Your Oracle data is probably stored as BLOB.
In Hive it should be stored as BINARY.
Here is an Hortonworks article demonsrating sqoop import of oracle blob into hive
https://community.hortonworks.com/content/supportkb/49145/how-to-sqoop-import-oracle-blobclob-data-into-hive.html
Here is an example for processing of binary type using Hive UDF
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFBase64.java

How to encrypt data in hdfs , and then create hive or impala table to query it?

Recently, I came accross a situation:
There is a data file in remote hdfs, we need to encrpt the data file and then create impala table to query the data in local hdfs system, how impala query encryped data file, I don't know how to solve it.
It can be done by creating User Defined Functions(UDF) in hive. You can create UDF functions by using UDF Hive interface. Then, make jar out of your UDF class, put hive lib.

Access Hive Data from Java

I need to acces the data in Hive, from Java.According to the documentation for Hive JDBC Driver,the current JDBC driver can only be used to query data from default database of Hive.
Is there a way to access data from a Hive database other than the default one , through Java?
For example, you have a hive table:
create table visit (
id int,
url string,
ref string
)
partitioned by (date string)
Then you can use the statement
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM visit WHERE date='2013-05-15';
to load the data to the hdfs then write a mapred job to handle it. Or you can use the statement
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hdfs_out' SELECT * FROM visit WHERE date='2013-05-15';
to load the data to the local file system and write a normal java program to handle it.
The JDBC documentation is found in Hive confluence documentation. For using the JDBC driver you need to have an access by a hive server.
But there are other possibilities to access the data... This all depends on your setup. You could for example also use spark assuming the Hive config and hadoop configs are set appropriately.
the JDBC documentation

Resources