Reading BLOB data which is stored as Binary datatype in Hive - hadoop

We have Oracle BLOB and VARBINARY (SQL Server/Progress) data in hive which is stored as String or Binary datatype. We have brought data from respective RDBMS using sqoop. Now that we have data in hdfs, we like to see the actual attachments like pdf , images, doc etc.. How can we deserialize the hive binary format data to corresponding files?
In short we need to convert binary data in hive to corresponding attachments (pdf,jpg,doc) assuming we know the file type.

Related

Oracle CLOB data type to Redshift data type

we are in the process of migrating Oracle tables to redshift tables. We found that few tables are having CLOB data type. In redshift we converted CLOB to Varchar(65535) type. While doing copy command , we are getting
The length of the data column investigation_process is longer than the length defined in the table. Table: 65000, Data: 90123.
Which data type we need to use? Please share your suggestion.
Redshift isn't designed to store CLOB (or BLOB) data. Most databases that do store the CLOB separately from the table contents to not burden all queries with the excess data. A CLOB reference is stored in the table contents and a replacement of CLOB for reference is performed at result generation.
CLOBs should be stored in S3 and references to the appropriate CLOB (S3 key) stored in the Redshift table. The issue is that there isn't a prepackaged tool for doing the CLOB for reference replacement with Redshift AFAIK. Your solution will need some retooling to perform this replacement actions for all data users. It's doable, it's just going to take a data layer that performs the needed replacement.

What does "the container format for fields in a row" mean for a file format?

From Hadoop: The Definitive Guide:
There are two dimensions that govern table storage in Hive: the row
format and the file format.
The row format dictates how rows, and the
fields in a particular row, are stored. In Hive parlance, the row
format is defined by a SerDe, a portmanteau word for a
Serializer-Deserializer. When acting as a deserializer, which is the
case when querying a table, a SerDe will deserialize a row of data
from the bytes in the file to objects used internally by Hive to
operate on that row of data. When used as a serializer, which is the
case when performing an INSERT or CTAS (see “Importing Data” on page
500), the table’s SerDe will serialize Hive’s internal representation
of a row of data into the bytes that are written to the output file.
The file format dictates the container format for fields in a row. The
simplest format is a plain-text file, but there are row-oriented and
column-oriented binary formats avail‐ able, too.
What does "the container format for fields in a row" mean for a file format?
How is a file format different from a row format?
Read also guide about SerDe
Hive uses SerDe (and FileFormat) to read and write table rows.
HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files
You can create tables with a custom SerDe or using a native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified
File Format represents file container, it can be Text or binary format like ORC or Parquet.
Row format can be simple delimited text or rather complex regexp/template-based or JSON for example.
Consider JSON formatted records in a Text file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
Or JSON records in a sequence file:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS SEQUENCEFILE
Everything is a Java Class actually. What is very confusing for beginners is that there are shortcuts possible in the DDL, this allows you to write DDL without specifying long and complex class names for all formats. Some classes have no corresponding shortcuts embedded in the DDL language.
STORED AS SEQUENCEFILE is a shortcut for
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
These two classes determine how to read/write file container.
And this class determines how the row should be stored and read (JSON):
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
And now DDL with row format and file format without shortcuts:
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
And for even better understanding the difference, look at the SequenceFileOutputFormat class (extends FileOutputFormat) and JsonSerDe (implements SerDe) You can dig deep and try to understand methods implemented and base classes/interfaces, look at the source code, serialize and deserialize methods in JsonSerDe class.
And "the container format for fields in a row" is FileInputFormat plus FileOutputFormat mentioned in the above DDLs. In case of ORC file for example, you cannot specify row format (delimited or other SerDe). ORC file dictates that OrcSerDe will be used only for this type of file container, which has it's own internal format for storing rows and columns. Actually you can write ROW FORMAT DELIMITED STORED AS ORC in Hive, but row format delimited will be ignored in this case.

What is the best way to store Blob data type in a Hive table, as a string or Binary?

What is the best way to store Blob data type in a Hive table, as a string or Binary?
We have archived RDBMS table into Hive using Sqoop. Which is having a column of type BLOB, So in Hive we kept in Binary. But We are not able to read the binary content into PDF or any document. So Do we have any possibility to read that Hive binary data as a document?
Storing BLOB data into Hive Binary is recommendable approach or do we have any other ways?
Is there any Big data Component like HBase,Cassandra will support BLOB types?
It is better to use HIVE binary to store blob data into HIVE. You can follow the below link Import blob from oracle to HIVE
You can also use Cassandra or parallel nosql to store the blob data. Again it's based on your use case whether to chose HIVE or nosql databases.

Schema on read in hive for tsv format file

I am new on hadoop. I have data in tsv format with 50 columns and I need to store the data into hive. How can I create and load the data into table on the fly without manually creating table using create table statementa using schema on read?
Hive requires you to run a CREATE TABLE statement because the Hive metastore must be updated with the description of what data location you're going to be querying later on.
Schema-on-read doesn't mean that you can query every possible file without knowing metadata beforehand such as storage location and storage format.
SparkSQL or Apache Drill, on the other hand, will let you infer the schema from a file, but you must again define the column types for a TSV if you don't want everything to be a string column (or coerced to unexpected types). Both of these tools can interact with a Hive metastore for "decoupled" storage of schema information
you can use Hue :
http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/
or with Spark you can infer the schema of csv file and you can save it as a hive table.
val df=spark.read
.option("delimiter", "\t")
.option("header",true)
.option("inferSchema", "true") // <-- HERE
.csv("/home/cloudera/Book1.csv")

Can Hive deal with binary data?

Can Hive deal with unstructured data .
If we are having image file in oracle database and we have to run sqoopout to load that image from oracle to another source database and export as well in hive table.
Could you please help me on same how to handled that image file in hive?????
Your Oracle data is probably stored as BLOB.
In Hive it should be stored as BINARY.
Here is an Hortonworks article demonsrating sqoop import of oracle blob into hive
https://community.hortonworks.com/content/supportkb/49145/how-to-sqoop-import-oracle-blobclob-data-into-hive.html
Here is an example for processing of binary type using Hive UDF
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFBase64.java

Resources