Is there maximum size of string data type in Hive? - hadoop

Google a ton but haven't found it anywhere. Or does that mean Hive can support arbitrary large string data type as long as cluster is allowed? If so, where I can find what is the largest size of string data type that my cluster can support?
Thanks in advance!

The current documentation for Hive lists STRING as a valid datatype, distinct from VARCHAR and CHAR See official apache doc here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Strings
It wasn't immediately apparent to me that STRING was indeed it's own type, but if you scroll down you'll see several cases where it's used distinctly from the others.
While perhaps not authoritative, this page indicates the max length of a STRING is 2GB. http://www.folkstalk.com/2011/11/data-types-in-hive.html

By default, the columns metadata for Hive does not specify a maximum data length for STRING columns.
The driver has the parameter DefaultStringColumnLength, default is 255 maximum value.
A connection string with this parameter set to maximum size would look like this: jdbc:hive2://localhost:10000;DefaultStringColumnLength=32767;
(https://github.com/exasol/virtual-schemas/issues/118)
"In the “looser” world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant." from
https://learning.oreilly.com/library/view/programming-hive/9781449326944/ch03.html#Collection-Data-Types

Related

Using Parquet metatada to find specific key

I have a bunch of Parquet files containing data where each row has the form [key, data1, data2, data3,...]. I need to know in which file a certain key is located, without actually opening each file and searching. Is it possible to get this from the Parquet metadata?
The keys are formatted as strings.
I already tried accessing the metadata using PyArrow, but didn't get the data I wanted.
Short answer is no.
Longer answer: Parquet has two types of metadata that help in eliminating data, min/max statistics and optionally BloomFilters. With these two you can definitively determine if a file does not contain your key, but can't determine if 100% does (unless your key happens to be a min/max value). Pyarrow currently only really exposes row group statistics and doesn't support BloomFilter reading/writing at all.
Also, if the key is of low enough cardinality then dictionary encoding might be used to encode the column. If all data in a column is dictionary encoded, the it might be possible through some lower level APIs (likely not pyarrow) to retrieve the dictionaries and scan them instead of the entire file.
If you are in control of the writing process then sorting data based on key/limiting the number of keys per file would help make these methods even more efficient.

How to store streaming data in cassandra

I am new in Cassandra, I am very confused.I know that cassandra write speed is very fast.I want to store twitter data coming from storm.I googled, Every time I got make sstable and load into cluster. If every time I have to make sstable then how it possible to store twitter data streaming in cassandra.
please help me.
How I can store log data, which is generated at 1000log per second.
please correct me if I am wrong
I think Cassandra single node can handle 1000 logs per second without bulk loading if your schema is good. Also depends on the size of each log.
Or you could use Cassandra's Copy From CSV command.
For this you need to create a table first.
Here's an example from datastax website :
CREATE TABLE airplanes (
name text PRIMARY KEY,
manufacturer text,
year int,
mach float
);
COPY airplanes (name, manufacturer, year, mach) FROM 'temp.csv';
You need to specify the names of the columns based on the order in which they will be stored in the CSV. And for values with comma(,) you could enclose them in double quotes (") or use a different delimiter.
For more details refer http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/copy_r.html

Are Binary and String the only datatypes supported in Hbase?

While using a tool, I got a confusion that if Binary and String are the only datatypes supported in HBase.
The tool explains Hbase Storage type and mentioned it's possible values as Binary and String.
Can anyone let me know if this is correct?
In hbase every thing is kept as byte arrays. You can check this link:
How to store primitive data types in hbase and retrieve

How does HBase enable Random Access to HDFS?

Given that HBase is a database with its files stored in HDFS, how does it enable random access to a singular piece of data within HDFS? By which method is this accomplished?
From the Apache HBase Reference Guide:
HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.
Scanning both chapters didn't reveal a high-level answer for this question.
So how does HBase enable random access to files stored in HDFS?
HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key.
For example: a table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.
hbase acess hdfs file by using hfile . you can check the url to get the detail: http://hbase.apache.org/book/hfilev2.html

file larger than field limit with Hcatalog

I'm working in standalone (our cluster is not configured yet). I try to create a new table from a file with HCatalog, but I have the following error.
field larger than field limit (131072)
This value seems to be the value of the io.file.buffer.size, which is configured to 131072. Am I right? But, the description for this option is Size of read/write buffer used in SequenceFiles, so I'm not sure at all. My file is a text file. So I'm not sure, this is the good property to change.
Any idea?
I guess it's either because,
Your field delimiter set in Hive create statement is not set to a right one, so field read in buffer exceeded maximum allowed length.
Your field delimiter is set right, but some field is really long, or missing the right delimiter. If that's the case, you need somehow pre-process the file to make sure it won't have such cases.
A similar question and answer here

Resources