How Replace function in Pig works? - hadoop

My input file name is words.txt as below . Also there is no space in each record of this below file .
Hi
Hi
How
I am loading this file into Pig
words = LOAD '/user/inputs/words.txt' USING PigStorage() AS (line:chararray);
words_each = FOREACH words GENERATE REPLACE(line,'','|') ;
dump words_each;
I am getting output as
|H|i|
|H|i|
|H|o|w|
But I would like to know how exactly REPLACE functions treats '' which is my second argument in REPLACE function .
There is no empty space in my file, then how come I am getting | in my output .

Well, As per your statement, REPLACE function is called on ''. It doesn't contain any whitespace.
If you want to replace the space, you need to give it like this ' '. +
Both are different conditions as given below:
words_each = FOREACH words GENERATE REPLACE(line,'','|') ; // without space
words_each = FOREACH words GENERATE REPLACE(line,' ','|') ; // with space
First condition will add the Pipe symbol(|) after each character, while 2nd condition won't make any impact because there is no space in your file content.

Related

Word Count with custom delimiter (|) using Pig

I am a newbie to Pig latin.I wanted to process the below file and count the most occurred word.
Hadoop|is|an|open|source|Java-based|programming|framework|that|supports|
the|processing|and|storage|of|extremely|large|data|sets|in|a|distributed|computing|environment.
The file contains a | as a delimiter.
There are quite a few examples of word count examples out there.In any case,here's the one with delimiter '|'
lines = LOAD 'input.txt' AS (line:chararray);
newlines = FOREACH lines GENERATE REPLACE(line,'\\|',' ') AS newline;
words = FOREACH newlines GENERATE FLATTEN(TOKENIZE(newline)) as word;
grouped = GROUP words BY word;
w_count = FOREACH grouped GENERATE group, COUNT(words);
DUMP w_count;

How can remove a part of image name in matlab?

I have a folder data that contains a list of images named as follow :
AHTD3A0001_Para1.tif
AHTD3A0002_Para1.tif
AHTD3A0003_Para1.tif
.
.
AHTD3A1012_Para1
I want to delete the first part of image name ( AHTD3A) in order to replace image names such as :
0001_Para1.tif
0002_Para1.tif
0003_Para1.tif
.
.
AHTD3A1012_Para1
please any suggestion for matlab code and thanks in advance
You can simply use strrep to replace part of a string.
oldnames = {'AHTD3A0001_Para1.tif' 'AHTD3A0002_Para1.tif'};
newnames = strrep(oldnames, 'AHTD3A', '');
% '0001_Para1.tif' '0002_Para1.tif'
If the filename prefix isn't always the same and you simply want four digits followed by _Para1.tif. You could instead use regular expressions with regexprep.
newnames = regexprep(oldnames, '.*(?=\d{4}_Para1\.tif)', '');
Or you can match it using regexp instead
newnames = regexp(oldnames, '\d{4}_.*', 'match', 'once')

Multiple sequence alignment. Convert multi-line format to single-line format?

I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
TGFb3_human_used_for_docking LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA LRSADTTHST-
Each line begins with a sequence identifier, and then a sequence of characters (in this case describing the amino acid sequence of a protein). Each sequence is split into several lines, so you see that the first sequence (with ID TGFb3_human_used_for_docking) has two lines. I want to convert this to a format in which each sequence has a single line, like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
(In this particular examples the sequences are almost identical, but in general they aren't!)
How can I convert from multi-line multiple sequence alignment format to single-line?
Looks like you need to write a script of some sort to achieve this. Here's a quick example I wrote in Python. It won't line the white-space up prettily like in your example (if you care about that, you'll have to mess around with formatting), but it gets the rest of the job done
#Create a dictionary to accumulate full sequences
full_sequences = {}
#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
for line in infile:
line = [element.strip() for element in line.split()]
if len(line) < 2:
continue
full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]
#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
for seq in full_sequences:
outstr += seq + "\t\t" + full_sequences[seq] + "\n"
outfile.write(outstr)

I want to tokenize string using the following delimiters in pig: dash, comma, hash, space and colon

How can I do this using STRSPLIT, TOKENIZER or any other method?
You can use STRSPLIT with regex to solve this problem. I am not sure your input has single or multiple combination of delimiters(dash,comma,hypen,space and hash) but the below solution will work for both.
input
a#b c-d,e
f e,g#h:i
1,2,3,4,5
l#y#z#h#n
A B C D E
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'[-,:\\s#]',5));
DUMP B;
Output:
(a,b,c,d,e)
(f,e,g,h,i)
(1,2,3,4,5)
(l,y,z,h,n)
(A,B,C,D,E)
If you have only single delimiter in your input, say'#' or any other delimiter that you mentioned then try the below script ( '5' in the third arg is total number of columns in your input)
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'#',5));
In case of multiple delimiter, suppose you want to add any new delimiter say '$' then just add this delimiter inside the character class of regex.
Note '$' is special character in Regex which needs escaping for double backslashs like this '[\\$-,:\\s#]'

PigStorage and Variable Schemas from Input

I have a comma separated text file like
1,abc,1,
2,def,1,2,3,4
3,ghi,1,2
4,jkl,1,5,6,7,8,9
5,mno
The text file will always have the first two values, but will have 0 or more values after the second comma.
How can I load this data and give an alias to the first two values?
I can load it and not give an alias to the first two values via:
A = LOAD 'data.txt' USING PigStorage(',');
From here, I can do a B = FOREACH A GENERATE $0 AS foo:chararray, $1 AS bar:chararray; but it would discard the rest. It would be nice to do a wildcard and put the rest in a tuple.
Is there anyway to do this?
Try this
B = foreach A generate $0 as foo:chararray, $1 as bar:chararray, $2..;
reference
Drop single column in Pig
I am not sure about what you need.
Try this one
A = LOAD 'data.txt' USING PigStorage(',') AS (foo:chararray, bar:chararray);
This will ignore the other values after the second comma in your file.
Or you can create a Map for reamining fields.

Resources