Handling blob in hive - hadoop

I want to store and retrieve blob in hive.Is it possible to store blob in hive?
If it is not supported what alternatives i can go with?
Blob may reside inside an relation DB also.
I did some research but not finding relevant solution

I think it is possible to store blob in Hive. I was importing LOBs from Oracle DB into Hive throught Sqoop and all I needed to do was cast LOB into string:
sqoop import --map-column-java $LOB=String
More info about LOBs in Sqoop you can find here.
Hope it helps

Starting with Hive 0.8.0 you can use the binary data type in hive. This is the ideal fit for a blob. I cannot find the max length of a binary, but I know it's 2GB for String, so that is my best guess for binary too.

Related

Query MinIO database without converting the files with Pandas

I would like to know if there is any option available in order to query a MinIO database that stores DeltaTables in parquet format.
Currently I am using pyarrow with pandas but is really slow when the data become larger.
I saw that PySpark can be used to query the DeltaTables but I would like to know if there are any other options.
Thanks
It could depend how big the scale of the data you are dealing with, for big enough data sets you could try using presto for SQL syntax queries of from a MinIO source parquet files, using Hive Connector here is a how to:
https://blog.min.io/interactive-sql-query-with-presto-on-minio-cloud-storage/
Also, when you hit a large dataset could take advantage of Hive partition folder naming convention (ie. s3://bucketname/year=2019/ )to reduce the size of the data set needed to be queried, here is the docs regarding partitioning in in hive connector.
Unrelated note: credits to this question for help me remember the convention name

HBase Table Data Convert to CSV

How to HBase table data to convert .CSV file, im trying to convert table data to csv format , but i couldn't get any code
hbase001> list
Table
sample_data
Creating an external Hive table mapped on to HBase table using HBaseStorageHandler can solve your problem ,you can now use "select * from table_name" to get data into a csv table (stored as textfile fields terminted by ','). Please refer the below link for reference.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Usage
There are plenty of ways to solve your task. You can use spark, regular mapreduce or special tools like sqoop. This task is rather trivial and you can implement it by yourself if you learn hadoop. The quickest way for starters to do it is probably sqoop. Please get youself familiar with this power tool and play with it.
Good luck!

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.
Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

Can we use Sqoop to move any structured data file apart from moving data from RDBMS?

This question was asked to me in a recent interview.
As per my knowledge we can use Sqoop to transfer data between RDBMS and hadoop ecosystems(hdfs, hive,pig,hbase).
Can someone please help me in finding answer?
As per my understanding, Sqoop can't move any structured data file (like CSV) to HDFS or other Hadoop ecosystem component like Hive, HBase, etc.
Why would you use Sqoop for this?
You can simply put any data file directly into HDFS using it's REST, Web or Java API.
Sqoop is not meant for this type of use case.
Main purpose of sqoop import is to fetch data from RDBMS in parallel.
Apart from that, Sqoop has Sqoop Import Mainframe.
The import-mainframe tool imports all sequential datasets in a partitioned dataset(PDS) on a mainframe to HDFS. A PDS is akin to a directory on the open systems. The records in a dataset can contain only character data. Records will be stored with the entire record as a single text field.

Are Binary and String the only datatypes supported in Hbase?

While using a tool, I got a confusion that if Binary and String are the only datatypes supported in HBase.
The tool explains Hbase Storage type and mentioned it's possible values as Binary and String.
Can anyone let me know if this is correct?
In hbase every thing is kept as byte arrays. You can check this link:
How to store primitive data types in hbase and retrieve

Resources