how to use header(first row) as field names in Pig - hadoop

Given a csv file with first row which can be taken as the header, how can one load the field names dynamically in Pig using these headers? i.e.
id,year,total
1,1999,190
2,1998,20
a = LOAD '/path/to/file.csv' USING PigStorage() AS --use first row as field names
> describe a;
> id:bytearray,year:bytearray,total:bytearray

As this is a CSV file and you want to use first row as a header, you should use CSVLoader() for it.It will treat first row as header. Your script will be like this.
--Register the piggybank jar
REGISTER piggybank.jar
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/path/to/file.csv' using CSVLoader AS(id:int,year:chararray,total:int);

Related

Talend tInputFileDelimited component java.lang.NumberFormatException for CSV file

As a beginner to TOS for BD, I am trying to read two csv files in Talend OS, i have inferred the metadata schema from the same CSV file, and setup the first row to be header, and delimiter as comma (,)
In my code:
The tMap will read the csv file, and do a lookup on another csv file, and generate two output files passed and reject records.
But while running the job i am getting below error.
Couldn't parse value for column 'Product_ID' in 'row1', value is '4569,Laptop,10'. Details: java.lang.NumberFormatException: For input string: "4569,Laptop,10"
I believe it is considering entire row as one string to be the value for "Product_ID" column
I don't know why that is happening when i have set the delimiter and row separator correctly.
Schema
I can see no rows are going from the first tInputFileDelimited due to above error.
Job Run
Input component
Any idea what else can i check?
Thanks in advance.
In your last screenshot, you can see that the Field separator of your tFileInputDelimited_1 is ; and not ,.
I believe that you haven't set up your component to use the metadata you created for your csv file.
So you need to configure the component to use the metadata you've created by selecting Repository under Property Type, and selecting the delimited file metadata.

Issue with Comma as a Delimiter in Latin Pig for free text column

I am loading a file to PigStorage. The file has a column Newvalue, a free text column which includes commas in it. When I specify comma as delimiter this gives me problem. I am using following code.
inpt = load '/home/cd36630/CRM/1monthSample.txt' USING PigStorage(',')
AS (BusCom:chararray,Operation:chararray,OperationDate:chararray,
ISA:chararray,User:chararray,Field:chararray,Oldvalue:chararray,
Newvalue:chararray,RecordId:chararray);
Any help is appreciated.
If the input is in csv form then you can use CSVLoader to load it. This may fix your issue.
If this doesn't work then you can load into a single chararray and then write a UDF to split the total line in a way that respects the spaces in Newvalue. EG:
register 'myudfs.py' using jython as myudfs ;
A = LOAD '/home/cd36630/CRM/1monthSample.txt' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.prepare_input(total) ;

In pig while using Load data flow step what is difference with using (Using PigStorage) and with out using it?

In pig while using Load data flow step what is difference with using (Using PigStorage) and with out using it?
want to know the difference between below steps.
movie2 = load 'movie/part-m-00000' as (mid:int, mname:chararray, myr:int);
movie2 = load 'movie/part-m-00000' using PigStorage(',') as (mid:int, mname:chararray, myr:int);
The default is to use PigStorage, which is a textfile in which fields are separated by a delimeter, with the tab character as the delimeter.
Specifying using PigStorage(',') changes the delimeter to a comma.
Adding to answer of rsp, there are 2 advantages of using PigStorage
Option to specify the file delimiter
Option to load the schema of the input or not.
More details here: http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html

Reading files in PIG where delemeter comes in data

I want to read a CSV file using PIG what should i Do?. I used load n pigstorage(',') but it fails to read CSV file properly because where it encounters comma (,) in data it splits it.How should i give delimeter now if i have comma in data also?
It's generally impossible to distinguish comma in data from comma as a delimiter.
You will need to escape that comma that is in your 'data' and custom load function (for Pig) that can recognize escaped commas.
Take a look here:
http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
http://pig.apache.org/docs/r0.7.0/udf.html#Load%2FStore+Functions
Have you had a look at the CSVLoader loader in the PiggyBank if you want to read a CSV file? (of course the file format needs to be valid)
First make sure you have a valid CSV file. In the case you haven't try to change the source file through Excel (if the file is small) or other tool and export a new CSV with a good delimiter for your data (Ex: \t tab, ; , etc). Even better can be do another extract with a "good" delimiter.
Example of your load can be then something like this:
TABLE = LOAD 'input.csv' USING PigStorage(';') AS ( site_id: int,
name: chararray, ... );
Example of your DUMP:
STORE TABLE INTO 'clean.csv' using PigStorage(','); <- delimiter that suits you best

Pass a relation to a PIG UDF when using FOREACH on another relation?

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).
We are loading the data in PIG like this:
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);
Then our generate is:
output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);
However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`
It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?
Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?
EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:
someProduct\t35 521 225
the output would be:
someProduct\t6009
I don't think you can do it this wait in Pig.
A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.
DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
output = FOREACH data GENERATE title, MAP_PRODUCT(category);
This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

Resources