using MultiStorage to store records in separate files - hadoop

I'm trying to store a set of records like these:
2342514224232 | some text here whatever
2342514224234| some more text here whatever
....
into separate files in the output folder like this:
output / 2342514224232
output / 2342514224234
the value of the idstr should be the file name and the text should be inside the file. Here's my pig code:
REGISTER /home/bytebiscuit/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
A = LOAD 'cleantweets.csv' using PigStorage(',') AS (idstr:chararray, createdat:chararray, text:chararray,followers:int,friends:int,language:chararray,city:chararray,country:chararray,lat:chararray,lon:chararray);
B = FOREACH A GENERATE idstr, text, language, country;
C = FILTER B BY (country == 'United States' OR country == 'United Kingdom') AND language == 'en';
texts = FOREACH C GENERATE idstr,text;
STORE texts INTO 'output/query_results_one' USING org.apache.pig.piggybank.storage.MultiStorage('output/query_results_one', '0');
Running this pig script gives me the following error:
<file pigquery1.pig, line 12, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.MultiStorage' with arguments '[output/query_results_one, idstr]'
Any help is really appreciated!

Try this option:
MultiStorage('output/query_results_one', '0', 'none', ',');

In case anybody stumbles across this post like I did, the problem for me was that my pig script looked like:
DEFINE MultiStorage org.apache.pig.piggybank.storage.MultiStorage();
...
STORE stuff INTO 's3:/...' USING MultiStorage('s3:/...','0','none',',');
The DEFINE statement was incorrectly not specifying inputs/outputs. Foregoing the DEFINE statement and directly putting the following fixed my problem.
STORE stuff INTO 's3:/...' USING org.apache.pig.piggybank.storage.MultiStorage('s3:/...','0','none',',');

Related

Is there a way to read fixed length files using csv.reader() module in Python 2.x

I have a fixed length file like:
0001ABC,DEF1234
The file definition is:
id[1:4]
name[5:11]
phone[12:15]
I need to load this data into a table. I tried to use CSV module and defined the fixed lengths of each field. It is working fine except for the name field.
For the NAME field, only the value till ABC is getting loaded. The reason is:
As I am using CSV module, it is treating 0001ABC, as a value and only parsing till that.
I tried to use escapechar = ',' while reading the file, but it removes the ',' from the data. I also tried quoting=csv.QUOTE_ALL but that didnt work either.
with open("xyz.csv") as csvfile:
readCSV = csv.reader(csvfile)
writeCSV = open("sample_csv", 'w');
output = csv.writer(writeCSV, dialect='excel', lineterminator="\n")
for row in readCSV:
print(row) # to debug #
data= str(row[0])
print(data) # to debug #
id = data[0:4]
name = data[5:11]
phone = data[12:15]
output.writerow([id,name,phone])
writeCSV.close()
Output of the print commands:
row: ['0001ABC','DEF1234']
data: 0001ABC
Ideally, I expect to see the entire set 0001ABC,DEF1234 in the variable: data.
I can then use the parsing as mentioned in the code to break it into different fields.
Can you please let me know where I am going wrong?

Pig filter not working

I have the following pig script,
meta_file = LOAD 'meta_file' USING PigStorage(',');
DUMP meta_file;
meta = FOREACH meta_file GENERATE (chararray)$0 AS is_vta:chararray, (chararray)$1 AS id:long;
DUMP meta;
new_d = FILTER meta BY (is_vta == 't');
DUMP new_d;
Contents of meta_file:
"t","7181397"
"t","6331589"
"f","7266217"
"t","6051440"
"t","6901437"
"t","6805292"
"f","7144764"
"t","6820265"
"f","7515321"
"t","4777938"
DUMP of meta_file is exactly fine and is same as the contents of file, so are the contents of meta, but new_d is empty. I can see that there are is_vta in meta with values t, but still new_d is empty. Why isn't meta getting filtered properly? What am I doing wrong here? I am new to Pig Latin and am not able to figure out what might be the problem here.
Thanks for all your help.
simple way:
new_d = FILTER meta BY is_vta MATCHES '.*t.*';
another solution:
remquotes = FOREACH meta GENERATE REPLACE($0, '\\"', '') AS is_vta:chararray, id;
new_d = FILTER remquotes BY is_vta == 't';
I think quotes are causing problem: two ways to handle them here
1: use piggybank to handle quotes: rest your quote should work.
REGISTER 'piggybank.jar' -- > this jar handles quotes by default.
A = LOAD 'fil.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',') as (---Your Schema --- );
or
2:
use regex and trim quotes.
Remove single quotes from data using Pig

Dump is not working

I am using IBM BigInsights.
When I execute the DUMP command in Pig Grunt shell, I am not getting any result.
Sample Input file:
s_no,name,DOB,mobile_no,email_id,country_code,sex,disease,age
11111,bbb1,12-10-1950,1234567890,bbb1#xxx.com,1111111111,M,Diabetes,78
11112,bbb2,12-10-1984,1234567890,bbb2#xxx.com,1111111111,F,PCOS,67
11113,bbb3,712/11/1940,1234567890,bbb3#xxx.com,1111111111,M,Fever,90
11114,bbb4,12-12-1950,1234567890,bbb4#xxx.com,1111111111,F,Cold,88
11115,bbb5,12/13/1960,1234567890,bbb5#xxx.com,1111111111,M,Blood Pressure,76
INFO [JobControl] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
My code is as follow:
A = LOAD 'healthcare_Sample_dataset1.csv' as(s_no:long,name:chararray,DOB:datetime,mobile_no:long,email_id:chararray,country_code:long,sex:chararray,disease:chararray,age:int);
B = FOREACH A GENERATE name;
C = LIMIT B 5;
DUMP C;
Kindly help me to resolve this.
Thanks and Regards!!!
From your script I can see that you are using CSV File. If you are working with CSV File then you should use CSVLoader() in your pig script. Your script should be like this:
--Register piggybank jar which contains UDF of CSVLoader
REGISTER piggybank.jar
-- Define the UDF
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
--Load data using CSVLoader
A = load '/user/biadmin/test/CBTTickets.csv' using CSVLoader AS (
Type:chararray,
Id:int,
Summary:chararray,
OwnedBy:chararray,
Status:chararray,
Prio‌​rity:chararray,
Severity:chararray,
ModifiedDate:datetime,
PlannedFor:chararray,
Time‌​Spent:int);
B = FOREACH A GENERATE Type;
C = LIMIT B 5;
DUMP C;
Please provide your input data if it not works for you.
You have not mentioned the whole address of healthcare_Sample_dataset1.csv that's why dump is not working properly.
Load data by Writing full path of that file than Dump will work!!
I think you need to load all fields as bytearray, then remove first row (i.e. header), because they don't match the data types you want to impose on those fields.
OR
remove first row using text editor and use your own code.

Pig: Unable to Load BAG

I have a record in this format:
{(Larry Page),23,M}
{(Suman Dey),22,M}
{(Palani Pratap),25,M}
I am trying to LOAD the record using this:
records = LOAD '~/Documents/PigBag.txt' AS (details:BAG{name:tuple(fullname:chararray),age:int,gender:chararray});
But I am getting this error:
2015-02-04 20:09:41,556 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 101> mismatched input ',' expecting RIGHT_CURLY
Please advice.
It's not a bag since it's not made up of tuples. Try
load ... as (name:tuple(fullname:chararray), age:int, gender:chararray)
For some reason Pig wraps the output of a line in curly braces which make it look like a bag but it's not. If you have saved this data using PigStorage you can save it using a parameter ('-schema') which tells PigStorage to create a schema file .pigschema (or something similar) which you can look at to see what the saved schema is. It can also be used when loading with PigStorage to save you the AS clause.
Yes LiMuBei point is absolutely right. Your input is not in the right format. Pig will always expect the bag should hold collection of tuples but in your case its a collection of (tuple and fields). In this case pig will retain the tuple and reject the fields(age and gender) during load.
But this problem can be easily solvable in different approach(kind of hacky solution).
1. Load each input line as chararray.
2. Remove the curly brackets and function brackets from the input.
3. Using strsplit function segregate the input as (name,age,sex) fields.
PigScript:
A = LOAD 'input' USING PigStorage AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[}{)(]+','')) AS (newline:chararray);
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',3)) AS (fullname:chararray,age:int,sex:chararray);
DUMP C;
Output:
(Larry Page,23,M)
(Suman Dey,22,M)
(Palani Pratap,25,M)
Now you can access all the fields using fullname,age,sex.

How to split a text file which is having '\t' and ',' values in Pig

I want to convert text file which is having tab and comma separated values into fully comma separated value in PIG. I am using Apache Pig version 0.11.1., I have tried with the following code and tried with FLATTEN, TOKENIZE. But I cannot make it into fully CSV file.
a = load '/home/mansoor/Documents/ip.txt' using PigStorage(',') as (key:chararray, val1:chararray, val2:chararray );
b = FOREACH a {
key= STRSPLIT(key,'\t');
GENERATE key;
}
Following is my text file input:
M12345 M123456,M234567,M987653
M23456 M23456,M123456,M234567
M34567 M234567,M765678,M987643
I need a file which is having fully CSV file like the following output:
M12345,M123456,M234567,M987653
M23456,M23456,M123456,M234567
M34567,M234567,M765678,M987643
How can I do this?
With pig 0.13, just using load without PigStorage made the csv be well loaded.
a = load '/home/mansoor/Documents/ip.txt';
dump a
gives me
(M12345,M123456,M234567,M987653)
(M23456,M23456,M123456,M234567)
(M34567,M234567,M765678,M987643 )
If that's not what you want, you might want to consider the REPLACE function.
Here is a quick and dirty solution to dispose of a usable csv :
a = load '/home/mansoor/Documents/ip.txt' using PigStorage('\n');
b = foreach a generate FLATTEN(REPLACE($0, '\t', ','));
store b into 'tmp.csv';
You can then use the csv as intended :
c = load 'tmp.csv' using PigStorage(',') as (key:chararray, val1:chararray, val2:chararray, val3:chararray);
describe c
gives c: {key: chararray,val1: chararray,val2: chararray, val3:chararray}
Try this,
a = load '/home/mansoor/Documents/ip.txt';
store a into '/home/mansoor/Documents/op' using PigStorage(',');
Now the file is fully converted into csv file.

Resources