Are Binary and String the only datatypes supported in Hbase? - hadoop

While using a tool, I got a confusion that if Binary and String are the only datatypes supported in HBase.
The tool explains Hbase Storage type and mentioned it's possible values as Binary and String.
Can anyone let me know if this is correct?

In hbase every thing is kept as byte arrays. You can check this link:
How to store primitive data types in hbase and retrieve

Related

Does all of three: Presto, hive and impala support Avro data format?

I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.
Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.

Apache Solr support for ORC file format

I have a bunch of tables in Hive, stored as ORC. I want to index their data in a SolrCloud collection.
Is there any support for indexing data stored in ORC format in Solr?
I've googled around but nothing came out.
Looks like you want SolR to read data from a specific Hive file format.
You might look at the problem the other way i.e. use Hive to write data to SolR -- and thus let Hive take care of the complexity of the actual input file format (whether ORC, Parquet, AVRO, whatever -- even HBase data files).
In the LucidWorks GitHub repo you will find a project labeled hive-solr. Have a look.
I'll accept Samson's answer.
Anyway, I'm not fully satisfied about this solution. In fact, now I still need to create an external table manually declaring all fields in the original table. In terms of operations, it is not different from creating a new table (stored ad textfile) starting from the original one, indexing the new text files and finally dropping them (of course, this may be a problem for very large tables, which is not my case).
Being ORC a self-describing format, it would be great for Solr to read both field names and data directly from the compressed files.

Deserialize protobuf column with Hive

I am really new to Hive, I apologize if there are any misconceptions in my question.
I need to read a hadoop Sequence File into a Hive table, the sequence file is thrift binary data, which could be deserialized using SerDe2 that comes with Hive.
The problem now is: One column in the file is encoded with Google protobuf, so when thrift SerDe processes the sequence file it does not process the protobuf encoded column properly.
I wonder if there's a way in Hive to deal with this kind of protobuf encoded columns that are nested inside a thrift sequence file, so that each column could be parsed properly?
Thank you so much for any possible help!
I believe you should use some other serde to deserialize the proto buff format,
may be you can refer this,
https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

Is there maximum size of string data type in Hive?

Google a ton but haven't found it anywhere. Or does that mean Hive can support arbitrary large string data type as long as cluster is allowed? If so, where I can find what is the largest size of string data type that my cluster can support?
Thanks in advance!
The current documentation for Hive lists STRING as a valid datatype, distinct from VARCHAR and CHAR See official apache doc here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Strings
It wasn't immediately apparent to me that STRING was indeed it's own type, but if you scroll down you'll see several cases where it's used distinctly from the others.
While perhaps not authoritative, this page indicates the max length of a STRING is 2GB. http://www.folkstalk.com/2011/11/data-types-in-hive.html
By default, the columns metadata for Hive does not specify a maximum data length for STRING columns.
The driver has the parameter DefaultStringColumnLength, default is 255 maximum value.
A connection string with this parameter set to maximum size would look like this: jdbc:hive2://localhost:10000;DefaultStringColumnLength=32767;
(https://github.com/exasol/virtual-schemas/issues/118)
"In the “looser” world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant." from
https://learning.oreilly.com/library/view/programming-hive/9781449326944/ch03.html#Collection-Data-Types

Handling blob in hive

I want to store and retrieve blob in hive.Is it possible to store blob in hive?
If it is not supported what alternatives i can go with?
Blob may reside inside an relation DB also.
I did some research but not finding relevant solution
I think it is possible to store blob in Hive. I was importing LOBs from Oracle DB into Hive throught Sqoop and all I needed to do was cast LOB into string:
sqoop import --map-column-java $LOB=String
More info about LOBs in Sqoop you can find here.
Hope it helps
Starting with Hive 0.8.0 you can use the binary data type in hive. This is the ideal fit for a blob. I cannot find the max length of a binary, but I know it's 2GB for String, so that is my best guess for binary too.

Resources