file larger than field limit with Hcatalog

file larger than field limit with Hcatalog - hadoop

I'm working in standalone (our cluster is not configured yet). I try to create a new table from a file with HCatalog, but I have the following error.
field larger than field limit (131072)
This value seems to be the value of the io.file.buffer.size, which is configured to 131072. Am I right? But, the description for this option is Size of read/write buffer used in SequenceFiles, so I'm not sure at all. My file is a text file. So I'm not sure, this is the good property to change.
Any idea?

I guess it's either because,
Your field delimiter set in Hive create statement is not set to a right one, so field read in buffer exceeded maximum allowed length.
Your field delimiter is set right, but some field is really long, or missing the right delimiter. If that's the case, you need somehow pre-process the file to make sure it won't have such cases.
A similar question and answer here

Related

CSV Blob Sink - Skip Writing File when 0 Rows Present

This is a relatively simple problem with (I'm hoping) a similarly-simple solution.
In my ADF ETLs, any time there's a known and expected yet unrecoverable row-based error, I don't want my full ETL to fail. Instead, I'd rather pipe those rows off to a log, which I can then pick up at the end of the ETL for manual inspection. To do this, I use conditional splits.
Most of the time, there shouldn't be any rows like this. When this is the case, I don't want my blob sink to write a file. However, the current behavior writes a file no matter what -- it's just that the file only contains the table header.
Is there a way to skip writing anything to a blob sink when there are no input rows?
Edit: Somehow I forgot to specify -- I'm specifically referring to a Mapping Data Flow with a blob sink.

You can use Lookup activity(don't check first row only) to get all your table data firstly. Then use If condition to check the count of Lookup activity's output. If its count > 0, execute next activity(or data flow).

Checksum doesn't match: corrupted data.: while reading column `cid` at /opt/clickhouse//data/click

I am using clickhouse to store data, and I'm getting the following error while querying the column cid from the click table.
Checksum doesn't match: corrupted data.
I don't have any replicate for now, any suggestions for recovery?

The error comes down to the fact the checksum of the CityHash128 and the compressed data doesn't match and throws this exception in the readCompressedData function.
You can try to disable this check using the disable_checksum via the disableChecksumming method.
It could work, but a corrupted most probably means that something is wrong with your raw data and there is small chances for recovery unless you did backups.

Usually, you will get data part name and column name in exception message.
You could locate specific data part, remove files related to that single column, and restart the server. You will lose (already corrupted) data for one column in one data part (it will be filled with default values on read), but all other data will remain.

Is there maximum size of string data type in Hive?

Google a ton but haven't found it anywhere. Or does that mean Hive can support arbitrary large string data type as long as cluster is allowed? If so, where I can find what is the largest size of string data type that my cluster can support?
Thanks in advance!

The current documentation for Hive lists STRING as a valid datatype, distinct from VARCHAR and CHAR See official apache doc here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Strings
It wasn't immediately apparent to me that STRING was indeed it's own type, but if you scroll down you'll see several cases where it's used distinctly from the others.
While perhaps not authoritative, this page indicates the max length of a STRING is 2GB. http://www.folkstalk.com/2011/11/data-types-in-hive.html

By default, the columns metadata for Hive does not specify a maximum data length for STRING columns.
The driver has the parameter DefaultStringColumnLength, default is 255 maximum value.
A connection string with this parameter set to maximum size would look like this: jdbc:hive2://localhost:10000;DefaultStringColumnLength=32767;
(https://github.com/exasol/virtual-schemas/issues/118)
"In the “looser” world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant." from
https://learning.oreilly.com/library/view/programming-hive/9781449326944/ch03.html#Collection-Data-Types

maxCombinedSplitSize property in hive?

There is a property in pig named
'pig.maxCombinedSplitSize' – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined until this size is reached.
Is there a similar property in hive for specifying the size of data to be processed by a single map?
I am trying the below command but it doesn't work.
'SET hive.maxCombinedSplitSize=64mb';
Any suggestions?

Try this:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size=67108864;

Deriving FileName from data in Apache Pig

I am working on a situation where I want to store my data in pig script into a file. This is pretty straight forward to do that, but I want file name to be derived from the data itself. So, I have a field in data as timestamp. I want to use say MAX(timestamp) as filename to store all the data for that day.
I know the usage of
STORE data INTO '$outputDir' USING org.apache.pig.piggybank.storage.MultiStorage('$outputDir', '2', 'none', ',');
But this variable "outputDir should be passed as the parameter. I want to set this value with a derived value of the field.
Any pointers will be really helpful.
Thanks & Regards,
Atul Aggarwal

In MultiStorage you specify a root directory because typically a HDFS installation is shared by many users, so you do not want data written anywhere. Hence you cannot change the root directory but you can specify which field is used to generate directory names within that directory (in your case 2). The Javadoc is helpful but I am guessing you have seen that already?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

file larger than field limit with Hcatalog - hadoop

Related

CSV Blob Sink - Skip Writing File when 0 Rows Present

Checksum doesn't match: corrupted data.: while reading column `cid` at /opt/clickhouse//data/click

Is there maximum size of string data type in Hive?

maxCombinedSplitSize property in hive?

Deriving FileName from data in Apache Pig

Categories

Resources