Issue with Comma as a Delimiter in Latin Pig for free text column - syntax

I am loading a file to PigStorage. The file has a column Newvalue, a free text column which includes commas in it. When I specify comma as delimiter this gives me problem. I am using following code.
inpt = load '/home/cd36630/CRM/1monthSample.txt' USING PigStorage(',')
AS (BusCom:chararray,Operation:chararray,OperationDate:chararray,
ISA:chararray,User:chararray,Field:chararray,Oldvalue:chararray,
Newvalue:chararray,RecordId:chararray);
Any help is appreciated.

If the input is in csv form then you can use CSVLoader to load it. This may fix your issue.
If this doesn't work then you can load into a single chararray and then write a UDF to split the total line in a way that respects the spaces in Newvalue. EG:
register 'myudfs.py' using jython as myudfs ;
A = LOAD '/home/cd36630/CRM/1monthSample.txt' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.prepare_input(total) ;

Related

Talend tInputFileDelimited component java.lang.NumberFormatException for CSV file

As a beginner to TOS for BD, I am trying to read two csv files in Talend OS, i have inferred the metadata schema from the same CSV file, and setup the first row to be header, and delimiter as comma (,)
In my code:
The tMap will read the csv file, and do a lookup on another csv file, and generate two output files passed and reject records.
But while running the job i am getting below error.
Couldn't parse value for column 'Product_ID' in 'row1', value is '4569,Laptop,10'. Details: java.lang.NumberFormatException: For input string: "4569,Laptop,10"
I believe it is considering entire row as one string to be the value for "Product_ID" column
I don't know why that is happening when i have set the delimiter and row separator correctly.
Schema
I can see no rows are going from the first tInputFileDelimited due to above error.
Job Run
Input component
Any idea what else can i check?
Thanks in advance.
In your last screenshot, you can see that the Field separator of your tFileInputDelimited_1 is ; and not ,.
I believe that you haven't set up your component to use the metadata you created for your csv file.
So you need to configure the component to use the metadata you've created by selecting Repository under Property Type, and selecting the delimited file metadata.

how to use header(first row) as field names in Pig

Given a csv file with first row which can be taken as the header, how can one load the field names dynamically in Pig using these headers? i.e.
id,year,total
1,1999,190
2,1998,20
a = LOAD '/path/to/file.csv' USING PigStorage() AS --use first row as field names
> describe a;
> id:bytearray,year:bytearray,total:bytearray
As this is a CSV file and you want to use first row as a header, you should use CSVLoader() for it.It will treat first row as header. Your script will be like this.
--Register the piggybank jar
REGISTER piggybank.jar
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/path/to/file.csv' using CSVLoader AS(id:int,year:chararray,total:int);

In pig while using Load data flow step what is difference with using (Using PigStorage) and with out using it?

In pig while using Load data flow step what is difference with using (Using PigStorage) and with out using it?
want to know the difference between below steps.
movie2 = load 'movie/part-m-00000' as (mid:int, mname:chararray, myr:int);
movie2 = load 'movie/part-m-00000' using PigStorage(',') as (mid:int, mname:chararray, myr:int);
The default is to use PigStorage, which is a textfile in which fields are separated by a delimeter, with the tab character as the delimeter.
Specifying using PigStorage(',') changes the delimeter to a comma.
Adding to answer of rsp, there are 2 advantages of using PigStorage
Option to specify the file delimiter
Option to load the schema of the input or not.
More details here: http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html

Reading files in PIG where delemeter comes in data

I want to read a CSV file using PIG what should i Do?. I used load n pigstorage(',') but it fails to read CSV file properly because where it encounters comma (,) in data it splits it.How should i give delimeter now if i have comma in data also?
It's generally impossible to distinguish comma in data from comma as a delimiter.
You will need to escape that comma that is in your 'data' and custom load function (for Pig) that can recognize escaped commas.
Take a look here:
http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
http://pig.apache.org/docs/r0.7.0/udf.html#Load%2FStore+Functions
Have you had a look at the CSVLoader loader in the PiggyBank if you want to read a CSV file? (of course the file format needs to be valid)
First make sure you have a valid CSV file. In the case you haven't try to change the source file through Excel (if the file is small) or other tool and export a new CSV with a good delimiter for your data (Ex: \t tab, ; , etc). Even better can be do another extract with a "good" delimiter.
Example of your load can be then something like this:
TABLE = LOAD 'input.csv' USING PigStorage(';') AS ( site_id: int,
name: chararray, ... );
Example of your DUMP:
STORE TABLE INTO 'clean.csv' using PigStorage(','); <- delimiter that suits you best

using PIG to load a file

I am very new to PIG and I am having what feels like a very basic problem.
I have a line of code that reads:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A, I get: (Money, coins, loans, debt,,,)
I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!
Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is "Money, coins, loans, debt" are getting stuck in your first column, word1. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.
To fix this, you should specify PigStorage to load by comma by doing:
A = LOAD '...' USING PigStorage(',') AS (...);

Resources