I am using AbInitio and attempting to have my results from my query in my Input Table populated into hdfs. I am wanting the format in parquet. I tried using the dml to hive text but the following is my results and I am not sure what this means.
$ dml-to-hive text $AI_DML/myprojectdml.dml
Usage: dml-to-avro <record_format> <output_file>
or: dml-to-avro help
<record-format> is one of:
<filename> Read record format from file
-string <string> Read record format from string
<output_file> is one of:
<filename> Output Avro schema to file
- Output Avro schema to standard output
I also tried using the Write Hive Table component but I receive the following error:
[B276]
The internal charset "XXcharset_NONE" was encountered when a valid character set data
structure was expected. One possible cause of this error is that you specified a
character set to the Co>Operating System that is misspelled or otherwise incorrect.
If you cannot resolve the error please contact Customer Support.
Any help would be great, I am trying to have my output to hdfs in parquet.
Thanks,
Chris Richardson
I know this is a late reply, but if you're still working on this or somebody else stumbles onto this like I did, I think I've found a solution.
I used dml-to-hive to create a DML for parquet format and write it to a file.
dml-to-hive parquet current.dml > parquet.dml
Once this dml is created, you can use it on the in port of the "Write HDFS" component. Double click the component, go to Port tab, click Radio button "Use File" and then point it to parquet.dml
Then, just set the WRITE_FORMAT choice to parquet and give it a whirl. I was able to create parquet, orc, and avro files using the above process.
Related
I'm trying to extract data from an Oracle table. I'm using utl file for that and I'm receiving the error ORA-29285: file write error. The weird here is if I try extract the data directly from the table return the error, if I extract the data using a simple view the error is returned as well, BUT if I extract the data using a view with an ORDER BY the extraction is well succeed. I can't understand where the error is, I already look for the length of lines and nothing. Any suggestion from which can be?
I extract a lot of other data through the utl_file and I'm well succed. This data in specific is at the first time uploaded to Oracle table directly from a csv file with ANSI encoding. However I have other data uploaded by the same way and then I can export correctly. I checked the encoding too in order to reduce the possible mistakes and I found nothing.
Many thanks,
Priscila Ferreira
I want to query from a .gz file which i had imported to hive table but when i use some queries which require Map-reduce job for example:
select count(*) from test;
it shows below errors:
java.io.IOException: incorrect header check
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:228)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
I checked and found that Z LIB is a default compressor codec.
I tried with bzip file and it was OK.
but how can i use .gz file.
how can I change the default codec that can support the gz file?
I had the similar problem, in my case the issue was the files on the folder are of different formats like few were csv and others were parquet. once I keep single file format the issue was resolved.
I faced the same error, although I can read initial few records, but count no. of records failing with same error.
I solved the problem just by renaming my plain (un-compressed) file to .txt. Previously my file name was ; I renamed it to .txt. Also if you un-compress any file test you can read data from it.
And if you want to test run count number of records as explained above, it will do complete scan which will tell you exactly if data is loaded correctly or not.
I posted this solution at one other place
Is it possible to get the filename of a record in Hive? That would be incredibly helpful for debugging.
In my particular case, I've an incorrect values in a table that is mapped to a folder with > 100 large files. To use grep is very inefficient
HIVE supports virtual columns, for example INPUT__FILE__NAME. It gives the input file's name for a mapper task.
Have a look at the documentation here. It provides some example on how to do this.
Unfortunately, I'm unable to test the same now. Let me know if this is working or not.
how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table
I'm attempting to copy a table to a file using hadoop fs -copyToLocal. The command works swimmingly, minus the fact that all my fields are merged together. Is there a way to specify a delimiter?
I have seen exact same issues where coping Hive tables to local file system adds all the fields together in one giant line and '\n' character is not honored at the end of each row in table.
You best option is to use custom SerDe (Serializer and DeSerializer) to export hive to CVS as described here. You can get the source code from github as well.
Are you copying from the Hive table? And
Are you copying directly from the warehouse directory? Please provide the full command that you are using.