I am new to this ETL tool but the flow is logical so following it fine so far. I have this raw data file which has data in fixed width columns. I entered all the details based on the file definition and everything seems to be working fine, except I could see only first column being read.
Here is what my positional file read job details look like:
The final output (just first column being read):
Here is what the flat file looks like:
Strangely, I can see all the records being read fine
The issue was with the datatypes. For some reason, Talend wasn't throwing the correct error. I corrected the datatypes by matching them to the target schema and everything worked out fine.
Realized that all the float datatypes needed to be converted to BigDecimal. Mostly that fixed it.
Related
I am using AbInitio and attempting to have my results from my query in my Input Table populated into hdfs. I am wanting the format in parquet. I tried using the dml to hive text but the following is my results and I am not sure what this means.
$ dml-to-hive text $AI_DML/myprojectdml.dml
Usage: dml-to-avro <record_format> <output_file>
or: dml-to-avro help
<record-format> is one of:
<filename> Read record format from file
-string <string> Read record format from string
<output_file> is one of:
<filename> Output Avro schema to file
- Output Avro schema to standard output
I also tried using the Write Hive Table component but I receive the following error:
[B276]
The internal charset "XXcharset_NONE" was encountered when a valid character set data
structure was expected. One possible cause of this error is that you specified a
character set to the Co>Operating System that is misspelled or otherwise incorrect.
If you cannot resolve the error please contact Customer Support.
Any help would be great, I am trying to have my output to hdfs in parquet.
Thanks,
Chris Richardson
I know this is a late reply, but if you're still working on this or somebody else stumbles onto this like I did, I think I've found a solution.
I used dml-to-hive to create a DML for parquet format and write it to a file.
dml-to-hive parquet current.dml > parquet.dml
Once this dml is created, you can use it on the in port of the "Write HDFS" component. Double click the component, go to Port tab, click Radio button "Use File" and then point it to parquet.dml
Then, just set the WRITE_FORMAT choice to parquet and give it a whirl. I was able to create parquet, orc, and avro files using the above process.
I'm trying to extract data from an Oracle table. I'm using utl file for that and I'm receiving the error ORA-29285: file write error. The weird here is if I try extract the data directly from the table return the error, if I extract the data using a simple view the error is returned as well, BUT if I extract the data using a view with an ORDER BY the extraction is well succeed. I can't understand where the error is, I already look for the length of lines and nothing. Any suggestion from which can be?
I extract a lot of other data through the utl_file and I'm well succed. This data in specific is at the first time uploaded to Oracle table directly from a csv file with ANSI encoding. However I have other data uploaded by the same way and then I can export correctly. I checked the encoding too in order to reduce the possible mistakes and I found nothing.
Many thanks,
Priscila Ferreira
I am new to Hadoop and I am using a single node cluster (for development) to pull some data from a relational database.
Specifically, I am using Spark (version 1.4.1), Java API, to pull data for a query and write to Hive. I have run into various problems (and have read the manuals and tried searching online) but I think I might be misunderstanding some fundamental part of this because I am having problems.
First, I thought I'd be able to read data into Spark, optionally run some Spark methods to manipulate the data and then write it to Hive through a HiveContext object. But, there doesn't seem to be any way to write straight from Spark to Hive. Is that true?
So I need an intermediate step. I have tried a few different methods of storing the data first before writing to Hive and settled on writing an HDFS text file since it seemed to work best for me. However, writing the HDFS file, I get square brackets in the files, like this: [A,B,C]
So, when I load the data into Hive using the "LOAD DATA INPATH..." HiveQL statement, I get the square brackets in the Hive table!!
What am I missing? Or more appropriately, can someone please help me understand the steps I need to do to:
Run a SQL on SQL Server or Oracle DB
Write the data out to a Hive table that can be accessed by a dashboard tool.
My code right now, looks something like this:
DataFrame df= sqlContext.read().format("jdbc").options(getSqlContextOptions(driver, dburl, query)).load(); // This step seem to work fine.
JavaRDD<Row> rdd = df.javaRDD();
rdd.saveAsTextFile(getHdfsUri() + pathToFile); // This works, but writes the rows in square brackets, like: [1, AAA].
hiveContext.sql("CREATE TABLE BLAH (MY_ID INT, MY_DESC STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE");
hiveContext.sql("LOAD DATA INPATH '" + getHdfsUri() + hdfsFile + "' OVERWRITE INTO TABLE `BLAH`"); // Get's written like:
MY_INT MY_DESC
------ -------
AAA]
The INT column doesn't get written at all because the leading [ makes it no longer a numeric value and the last column shows the "]" at the end of the row in the HDFS file.
Please help me understand why this isn't working or what a better way would be. Thanks!
I am not locked into any specific approach, so all options would be appreciated.
Ok, I figured out what I was doing wrong. I needed to use the write function on the HiveContext and needed to use the com.databricks.spark.csv to write a sequencefile in Hive. This does not require an intermediate step of saving a file in HDFS, which is great, and writes to Hive successfully.
DataFrame df = hiveContext.createDataFrame(rdd, struct);
df.select(cols).write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("TABLENAME");
I did need to create a StructType object, though, to pass into the createDataFrame method for the proper mapping of the data types (Something like is shown in the middle of this page: Support for User Defined Types for java in Spark). And the cols variable is an array of Column objects which is really just an array of column names (i.e. something like Column[] cols = {new Column("COL1"), new Column("COL2")};
I think "Insert" is not yet supported.
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
To get rid of brackets in text file, you should avoid saveAsTextFile. Instead try writing the contents using HDFS API i.e FSDataInputStream
Is it possible to get the filename of a record in Hive? That would be incredibly helpful for debugging.
In my particular case, I've an incorrect values in a table that is mapped to a folder with > 100 large files. To use grep is very inefficient
HIVE supports virtual columns, for example INPUT__FILE__NAME. It gives the input file's name for a mapper task.
Have a look at the documentation here. It provides some example on how to do this.
Unfortunately, I'm unable to test the same now. Let me know if this is working or not.
I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.