hive incorrect header check - hadoop

I want to query from a .gz file which i had imported to hive table but when i use some queries which require Map-reduce job for example:
select count(*) from test;
it shows below errors:
java.io.IOException: incorrect header check
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:228)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
I checked and found that Z LIB is a default compressor codec.
I tried with bzip file and it was OK.
but how can i use .gz file.
how can I change the default codec that can support the gz file?

I had the similar problem, in my case the issue was the files on the folder are of different formats like few were csv and others were parquet. once I keep single file format the issue was resolved.

I faced the same error, although I can read initial few records, but count no. of records failing with same error.
I solved the problem just by renaming my plain (un-compressed) file to .txt. Previously my file name was ; I renamed it to .txt. Also if you un-compress any file test you can read data from it.
And if you want to test run count number of records as explained above, it will do complete scan which will tell you exactly if data is loaded correctly or not.
I posted this solution at one other place

Related

How to ignore empty parquet files when reading using Hive

I am using Hive 3.1.0 and my query reads a bunch of parquet files from certain path every hour. I don't have control over how these files are generated as these are created by some external process. In some rare case it happens that within the specified path a certain parquet file may exist with zero size. I would like Hive to ignore this but my hive queries fail with the following error:-
<filename>.parquet is not a Parquet file (too small length: 0)
How do I avoid this ? There could be too many files landing in an hour , so it would be an overkill to create automation to detect and delete empty files. I believe there should be some simpler option in Hive to make it just ignore such files.
Try to use the property $file_size. If it is more than 0 then process the data load. It would be better if you can provide the query as how you are trying to access.
I don't know how to do this as hive property. If ever, you might want to handle empty files in a separate directory before pushing to final storage using:
find ./your-directory -type f -empty -print -delete
or if not possible, handle deleting files in your final storage.
Try to list the files to be deleted for sanity check.

Ab-Initio dml-to-hive

I am using AbInitio and attempting to have my results from my query in my Input Table populated into hdfs. I am wanting the format in parquet. I tried using the dml to hive text but the following is my results and I am not sure what this means.
$ dml-to-hive text $AI_DML/myprojectdml.dml
Usage: dml-to-avro <record_format> <output_file>
or: dml-to-avro help
<record-format> is one of:
<filename> Read record format from file
-string <string> Read record format from string
<output_file> is one of:
<filename> Output Avro schema to file
- Output Avro schema to standard output
I also tried using the Write Hive Table component but I receive the following error:
[B276]
The internal charset "XXcharset_NONE" was encountered when a valid character set data
structure was expected. One possible cause of this error is that you specified a
character set to the Co>Operating System that is misspelled or otherwise incorrect.
If you cannot resolve the error please contact Customer Support.
Any help would be great, I am trying to have my output to hdfs in parquet.
Thanks,
Chris Richardson
I know this is a late reply, but if you're still working on this or somebody else stumbles onto this like I did, I think I've found a solution.
I used dml-to-hive to create a DML for parquet format and write it to a file.
dml-to-hive parquet current.dml > parquet.dml
Once this dml is created, you can use it on the in port of the "Write HDFS" component. Double click the component, go to Port tab, click Radio button "Use File" and then point it to parquet.dml
Then, just set the WRITE_FORMAT choice to parquet and give it a whirl. I was able to create parquet, orc, and avro files using the above process.

ORA-29285: file write error

I'm trying to extract data from an Oracle table. I'm using utl file for that and I'm receiving the error ORA-29285: file write error. The weird here is if I try extract the data directly from the table return the error, if I extract the data using a simple view the error is returned as well, BUT if I extract the data using a view with an ORDER BY the extraction is well succeed. I can't understand where the error is, I already look for the length of lines and nothing. Any suggestion from which can be?
I extract a lot of other data through the utl_file and I'm well succed. This data in specific is at the first time uploaded to Oracle table directly from a csv file with ANSI encoding. However I have other data uploaded by the same way and then I can export correctly. I checked the encoding too in order to reduce the possible mistakes and I found nothing.
Many thanks,
Priscila Ferreira

When opening text file as table not entire file is saved

After opening a text file larger then 2.5MB, datagrip will open the file in read-only mode. If I then edit the text file as a table and export the table with dump data to file, it will only write the first 2.5MB to file and the rest of the file will never be processed.
How do I make datagrip export the entire file, instead of just the first 2.5mb?
I already tried increasing the file limit in the config, however If I go past the 100mb it requires more then 8GB's of ram to continue.
I am sorry, but there is no workaround for now. Can you please describe the whole task? Perhaps I can help you.

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Resources