Reading files in PIG where delemeter comes in data - hadoop

I want to read a CSV file using PIG what should i Do?. I used load n pigstorage(',') but it fails to read CSV file properly because where it encounters comma (,) in data it splits it.How should i give delimeter now if i have comma in data also?

It's generally impossible to distinguish comma in data from comma as a delimiter.
You will need to escape that comma that is in your 'data' and custom load function (for Pig) that can recognize escaped commas.
Take a look here:
http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
http://pig.apache.org/docs/r0.7.0/udf.html#Load%2FStore+Functions

Have you had a look at the CSVLoader loader in the PiggyBank if you want to read a CSV file? (of course the file format needs to be valid)

First make sure you have a valid CSV file. In the case you haven't try to change the source file through Excel (if the file is small) or other tool and export a new CSV with a good delimiter for your data (Ex: \t tab, ; , etc). Even better can be do another extract with a "good" delimiter.
Example of your load can be then something like this:
TABLE = LOAD 'input.csv' USING PigStorage(';') AS ( site_id: int,
name: chararray, ... );
Example of your DUMP:
STORE TABLE INTO 'clean.csv' using PigStorage(','); <- delimiter that suits you best

Related

SQLLoader control file for reading comma or tab or pipe separated/delimited data?

I have three kinds of data, that is comma or tab or pipe separated data, trying to ingest file with comma or tab or pipe separated data with single control file?
Is it possible to load different kind of delimited data using single control file?
Egg:
Test1.csv
content:
firstname,lastname
rachel,green
chandler,bing
Test2.tsv
content:
firstname lastname
rachel green
chandler bing
Test3.psv
content:
firstname|lastname
rachel|green
chandler|bing
My current control file:
test.ctl
load data into table USERNAMES APPEND fields terminated by '\t' (firstname,lastname)
Expecting something like:
load data into table USERNAMES APPEND fields optionally terminated by '\t' or "," or "|" (firstname,lastname)
Nope, unfortunately you can't make it using same control file
docs
I believe sqlldr only allows for one delimiter, so you could pre-reprocess the file with a script to make all delimiters the same first (you'll need to know if those characters could be in the data though-could get ugly) and technically this changes the incoming data.
Alternatively you could load the file after pre-processing as above into a staging table where some sanity checks could be performed, then load into the main data table if sanity checks and validation passes.

Talend tInputFileDelimited component java.lang.NumberFormatException for CSV file

As a beginner to TOS for BD, I am trying to read two csv files in Talend OS, i have inferred the metadata schema from the same CSV file, and setup the first row to be header, and delimiter as comma (,)
In my code:
The tMap will read the csv file, and do a lookup on another csv file, and generate two output files passed and reject records.
But while running the job i am getting below error.
Couldn't parse value for column 'Product_ID' in 'row1', value is '4569,Laptop,10'. Details: java.lang.NumberFormatException: For input string: "4569,Laptop,10"
I believe it is considering entire row as one string to be the value for "Product_ID" column
I don't know why that is happening when i have set the delimiter and row separator correctly.
Schema
I can see no rows are going from the first tInputFileDelimited due to above error.
Job Run
Input component
Any idea what else can i check?
Thanks in advance.
In your last screenshot, you can see that the Field separator of your tFileInputDelimited_1 is ; and not ,.
I believe that you haven't set up your component to use the metadata you created for your csv file.
So you need to configure the component to use the metadata you've created by selecting Repository under Property Type, and selecting the delimited file metadata.

How do I export a spreadsheet (csv) in excel using ascii control characters as the delimiters?

I have this csv file that I would like to parse with Ruby. The file's data is a cluster with commas and new lines in the fields but Excel still reads it properly. If the file could be exported from excel using the unit and record separators as the delimiters for the columns and rows, I'd be golden.
Anybody know how to specify those characters in excel? Thanks!
Use Ruby CSV with this option:
:col_sep
The String placed between each field. This String will be transcoded
into the data’s Encoding before parsing.
See more here: http://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html
I ended up having Google Sheets export the file as json. Steps I followed here There were 10,000 records and the browser tab crashed when it tried to do all of them. So I had to piece meal it. I'm sure there's a better way to do it.

Issue with Comma as a Delimiter in Latin Pig for free text column

I am loading a file to PigStorage. The file has a column Newvalue, a free text column which includes commas in it. When I specify comma as delimiter this gives me problem. I am using following code.
inpt = load '/home/cd36630/CRM/1monthSample.txt' USING PigStorage(',')
AS (BusCom:chararray,Operation:chararray,OperationDate:chararray,
ISA:chararray,User:chararray,Field:chararray,Oldvalue:chararray,
Newvalue:chararray,RecordId:chararray);
Any help is appreciated.
If the input is in csv form then you can use CSVLoader to load it. This may fix your issue.
If this doesn't work then you can load into a single chararray and then write a UDF to split the total line in a way that respects the spaces in Newvalue. EG:
register 'myudfs.py' using jython as myudfs ;
A = LOAD '/home/cd36630/CRM/1monthSample.txt' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.prepare_input(total) ;

using PIG to load a file

I am very new to PIG and I am having what feels like a very basic problem.
I have a line of code that reads:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A, I get: (Money, coins, loans, debt,,,)
I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!
Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is "Money, coins, loans, debt" are getting stuck in your first column, word1. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.
To fix this, you should specify PigStorage to load by comma by doing:
A = LOAD '...' USING PigStorage(',') AS (...);

Resources