using PIG to load a file - hadoop

I am very new to PIG and I am having what feels like a very basic problem.
I have a line of code that reads:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A, I get: (Money, coins, loans, debt,,,)
I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!

Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is "Money, coins, loans, debt" are getting stuck in your first column, word1. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.
To fix this, you should specify PigStorage to load by comma by doing:
A = LOAD '...' USING PigStorage(',') AS (...);

Related

Compare 2 csv file using shell script and print the output in 3rd file

I am learning shell script and by using it trying to build a framework for my team for their testing purpose. Thus need your help in something.
Overview: I am trying to extract the aggregated values from hive through my queries using shell script and storing the result in a separate file, let's say File1.csv.
Now I wanted to compare above csv file with another csv file File2.csv using shell script and print the result as PASS(if records are matching) or FAIL(if records are not matching) row wise into the third file, let's say output.txt
Note: First we need to sort the records into File1.csv and then compare it with File2.csv, following with store the result PASS/FAIL row wise into output.txt
Format of File1.csv
Postcode Location InnerLocation Value_% Volume_%
XYZ London InnerLondon 6.987 2.561
ABC NY High Street 3.564 0.671
DEF Florida Miami 8.129 3.178
Quick help will be appreciated. Thanks in Advance.
You have two sorted text files and want to see which lines are different. There is nothing in your question which would make the problem CSV specific.
A convenient tool for this type of task would be sdiff.
sdiff -s File[12].csv
The -s option ensures that you see only different lines, but have a look at the sdiff man page: Maybe you want also to add one of the options dealing with white space.
If you need to go into more detail and, for example, show not just different CSV lines, but out which field in the line is different, and if there are really general CSV files, you really should use a CSV parser and not do it in shell scripts. Parsing a CSV file from a shell script really works if you know for sure that only a subset of all features allowed for CSV files are actually used.

Escaping separator in data while bulk loading using importtsv tool and ingesting numeric values

I am using importtsv tool to ingest data. I have some doubts. I am using hbase 1.1.5.
First does it ingest non-string/numeric values? I was referring this link detailing importtsv in cloudera distribution. It says:"it interprets everything as strings". So I was guessing what does that mean.
I am using simple wordcount example where first column is a word and second column is word count.
When I keep file as follows:
"access","1"
"about","1"
and ingest and then do scan on hbase shell it gives following output:
about column=f:count, timestamp=1467716881104, value="1"
access column=f:count, timestamp=1467716881104, value="1"
When I keep file as follows (double quotes surrounding count is removed):
"access",1
"about",1
and ingest and then do scan on hbase shell it gives following output (double quotes surrounding count is not there):
about column=f:count, timestamp=1467716881104, value=1
access column=f:count, timestamp=1467716881104, value=1
So as you can see there are no double quotes in count's value.
Q1. Does that mean it is stored as integer and not as string? The cloudera's article suggests that custom MR job needs to be written for ingesting non-string values. However I am not able to get what does that mean if above is ingesting integer values.
Also another doubt I am having is that whether I can escape the column separator when it appears inside the column value. For example in importtsv, we can specify the separator as follows:
-Dimporttsv.separator=,
However what if I have employee data where first column is employee name and second column as address? My file will have rows resembling to something like this:
"mahesh","A6,Hyatt Appartment"
That second comma makes importtsv think that there are three columns and throwing BadTsvLineException("Excessive columns").
Thus I tried escaping comma with backslash (\) and just for sake of curiosity escaping backslash with another backslash (that is \\). So my file had following lines:
"able","1\"
"z","1\"
"za","1\\1"
When I ran scan on hbase shell, it gave following output:
able column=f:count, timestamp=1467716881104, value="1\x5C"
z column=f:count, timestamp=1467716881104, value="1\x5C"
za column=f:count, timestamp=1467716881104, value="1\x5C\x5C1"
Q2. So it seems that instead of escaping character following backslash, it encodes backslash as \x5C. Is it like that? Is there no way to escape column separator while bulk loading data using importtsv?

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

Issue with Comma as a Delimiter in Latin Pig for free text column

I am loading a file to PigStorage. The file has a column Newvalue, a free text column which includes commas in it. When I specify comma as delimiter this gives me problem. I am using following code.
inpt = load '/home/cd36630/CRM/1monthSample.txt' USING PigStorage(',')
AS (BusCom:chararray,Operation:chararray,OperationDate:chararray,
ISA:chararray,User:chararray,Field:chararray,Oldvalue:chararray,
Newvalue:chararray,RecordId:chararray);
Any help is appreciated.
If the input is in csv form then you can use CSVLoader to load it. This may fix your issue.
If this doesn't work then you can load into a single chararray and then write a UDF to split the total line in a way that respects the spaces in Newvalue. EG:
register 'myudfs.py' using jython as myudfs ;
A = LOAD '/home/cd36630/CRM/1monthSample.txt' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.prepare_input(total) ;

Reading files in PIG where delemeter comes in data

I want to read a CSV file using PIG what should i Do?. I used load n pigstorage(',') but it fails to read CSV file properly because where it encounters comma (,) in data it splits it.How should i give delimeter now if i have comma in data also?
It's generally impossible to distinguish comma in data from comma as a delimiter.
You will need to escape that comma that is in your 'data' and custom load function (for Pig) that can recognize escaped commas.
Take a look here:
http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
http://pig.apache.org/docs/r0.7.0/udf.html#Load%2FStore+Functions
Have you had a look at the CSVLoader loader in the PiggyBank if you want to read a CSV file? (of course the file format needs to be valid)
First make sure you have a valid CSV file. In the case you haven't try to change the source file through Excel (if the file is small) or other tool and export a new CSV with a good delimiter for your data (Ex: \t tab, ; , etc). Even better can be do another extract with a "good" delimiter.
Example of your load can be then something like this:
TABLE = LOAD 'input.csv' USING PigStorage(';') AS ( site_id: int,
name: chararray, ... );
Example of your DUMP:
STORE TABLE INTO 'clean.csv' using PigStorage(','); <- delimiter that suits you best

Resources