What is the best way to store Blob data type in a Hive table, as a string or Binary? - hadoop

What is the best way to store Blob data type in a Hive table, as a string or Binary?
We have archived RDBMS table into Hive using Sqoop. Which is having a column of type BLOB, So in Hive we kept in Binary. But We are not able to read the binary content into PDF or any document. So Do we have any possibility to read that Hive binary data as a document?
Storing BLOB data into Hive Binary is recommendable approach or do we have any other ways?
Is there any Big data Component like HBase,Cassandra will support BLOB types?

It is better to use HIVE binary to store blob data into HIVE. You can follow the below link Import blob from oracle to HIVE
You can also use Cassandra or parallel nosql to store the blob data. Again it's based on your use case whether to chose HIVE or nosql databases.

Related

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Can Hive deal with binary data?

Can Hive deal with unstructured data .
If we are having image file in oracle database and we have to run sqoopout to load that image from oracle to another source database and export as well in hive table.
Could you please help me on same how to handled that image file in hive?????
Your Oracle data is probably stored as BLOB.
In Hive it should be stored as BINARY.
Here is an Hortonworks article demonsrating sqoop import of oracle blob into hive
https://community.hortonworks.com/content/supportkb/49145/how-to-sqoop-import-oracle-blobclob-data-into-hive.html
Here is an example for processing of binary type using Hive UDF
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFBase64.java

Where does the Hive data gets stored?

I am a little confused on where does the hive stores it's data.
Does it stores it's data in HDFS or in a RDBMS ??
Does Hive Meta store uses a RDBMS to store the hive tables metadata ??
Thanks in Advance !!
Hive data are stored in one of Hadoop compatible filesystem: S3, HDFS or other compatible filesystem.
Hive metadata are stored in RDBMS like MySQL, see supported RDBMS.
The location of Hive tables data in S3 or HDFS can be specified for both managed and external tables.
The difference between managed and external tables is that DROP TABLE statement, in managed table, will drop the table and delete table's data. Whereas, for external table DROP TABLE will drop only the table and data will remain as is and can be used for creating other tables over it.
See details here: Create/Drop/Truncate Table
Here is the answer to your question. But I will suggest you to read hive books or apache hive site for better understanding.
Does it stores it's data in HDFS or in a RDBMS ?? - The Data for HIVE is always stored in HDFS. For managed tables the data is stored in hive warehouse by default which is a directory in HDFS. For HIVE External table user can specify the location anywhere in HDFS.
Does Hive Meta store uses a RDBMS to store the hive tables metadata ?? - Yes HIVE uses RDBMS to store the metadata.

Reading BLOB data which is stored as Binary datatype in Hive

We have Oracle BLOB and VARBINARY (SQL Server/Progress) data in hive which is stored as String or Binary datatype. We have brought data from respective RDBMS using sqoop. Now that we have data in hdfs, we like to see the actual attachments like pdf , images, doc etc.. How can we deserialize the hive binary format data to corresponding files?
In short we need to convert binary data in hive to corresponding attachments (pdf,jpg,doc) assuming we know the file type.

create a Parquet backed Hive table by using a schema file

Cloudera documentation, shows a simple way to "create a Avro backed Hive table by using an Avro schema file." This works great. I would like to do the same thing for a Parquet backed Hive table, but the relevant documentation in this case lists out every column type rather than reading from a schema. Is it possible to read the Parquet columns from a schema, in the same way as Avro data?
Currently, the answer appears to be no. There is an open issue with Hive.
https://issues.apache.org/jira/browse/PARQUET-76
The issue has been active recently, so hopefully in the near future Hive will offer the same functionality for Parquet as it does for Avro.

Resources