Split characters inside Pig field

Split characters inside Pig field - hadoop

I have a text input with '|' separator as
0.0000|25000| |BM|BM901002500109999998|SZ
which I split using PigStorage
A = LOAD '/user/hue/data.txt' using PigStorage('|');
Now I need to split the field BM901002500109999998 into different fields based on their position , say 0-2 = BM - Field1 and like wise.
So after this step I should get BM, 90100, 2500, 10, 9999998.
Is there any way in Pig script to achieve this, otherwise I plan to write an UDF and put separator on required positions.
Thanks.

You are looking for SUBSTRING:
A = LOAD '/user/hue/data.txt' using PigStorage('|');
B = FOREACH A GENERATE SUBSTRING($4,0,2) AS FIELD_1, SUBSTRING($4,2,7) AS FIELD_2, SUBSTRING($4,7,11) AS FIELD_3, SUBSTRING($4,11,13) AS FIELD_4, SUBSTRING($4,13,20) AS FIELD_5;
The output would be:
dump B;
(BM,90100,2500,10,9999998)
You can find more info about this function here.

I think that it will be much more efficient to use the built in UDF REGEX_EXTRACT_ALL.
You can get some idea of how to use this UDF from:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#REGEX_EXTRACT_ALL
STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

Related

Is there a way to turn unix commands as map reduce activities?

I'm trying to get UNIX outputs in hive query.
For example the following query doesn't work:
select transform ('')
using 'pwd'
as syspath
But this query works:
select transform ('')
using 'hive -e "select 10 as col1"'
as col1
How do I enable UNIX commands or bash scripts as a map reduce job to gets its output available in hive?
Thanks in advance!

count the number of characters from a file
Why would you ever use Hive for that? Spark is so much more flexible.
val charCount = spark.read.textLines("path/to/file.txt")
.flatMap(line => line.toList())
.map(char => (char, 1)) // This is literally just wordcount, now
.reduceByKey(_ + _)
.map((char, count) => count)
.sum() // something like this ...
println(charCount.collect()(0))

Splitting Pig tuple

I want to split the following tuple into two tuples using pig script.
(key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
I want the output as follows:
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)

Yes you can solve this problem using REGEX and TOTUPLE function. First split the string into two parts, first column is before the first comma and second column is remaining strings. Finally convert the two columns as tuples and store it.
input
key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([^,]+),(.*)$')) AS (col1,col2);
C = FOREACH B GENERATE TOTUPLE(col1),TOTUPLE(col2);
STORE C INTO 'output';
Output:( will be stored in output/part* file)
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)

Pig Script STRSPLIT

My column has first name and last name separated by SPACE. I want to use pig function to split into 2 different columns. I think of STRSPLIT function but I don't know how to use it.
Could anyone help me on this simple question?

You can try something like this, sample code below
here what i am doing is
1.Reading each line as single column
2.Apply the STRSPLIT function using space as delimiter
3.Store the firstname and lastname into two different columns
input.txt
Pearson Charles
James Michael
Smith Linda
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+',2)) AS (firstname:chararray,lastname:chararray);
C = FOREACH B GENERATE firstname,lastname;
DUMP C;
Output:
(Pearson,Charles)
(James,Michael)
(Smith,Linda)
Check more info from this link
http://pig.apache.org/docs/r0.13.0/func.html#strsplit

PigStorage and Variable Schemas from Input

I have a comma separated text file like
1,abc,1,
2,def,1,2,3,4
3,ghi,1,2
4,jkl,1,5,6,7,8,9
5,mno
The text file will always have the first two values, but will have 0 or more values after the second comma.
How can I load this data and give an alias to the first two values?
I can load it and not give an alias to the first two values via:
A = LOAD 'data.txt' USING PigStorage(',');
From here, I can do a B = FOREACH A GENERATE $0 AS foo:chararray, $1 AS bar:chararray; but it would discard the rest. It would be nice to do a wildcard and put the rest in a tuple.
Is there anyway to do this?

Try this
B = foreach A generate $0 as foo:chararray, $1 as bar:chararray, $2..;
reference
Drop single column in Pig

I am not sure about what you need.
Try this one
A = LOAD 'data.txt' USING PigStorage(',') AS (foo:chararray, bar:chararray);
This will ignore the other values after the second comma in your file.

Or you can create a Map for reamining fields.

Expecting QUOTED STRING in pig script

I have written a script to select from vsql:
LOAD 'sql://{select * from sandesh.insights_voice_day
WHERE Observation_date BETWEEN '2011-11-22' AND '2011-11-23' AND
Type='total'
ORDER BY Observation_date}'
It is showing exception as '' Expecting QUOTEDSTRING?. What is problem?

Pig expects a quoted string following a load with the name of the file you are loading. Pig is not SQL, so you have to do something like first dump your query into a file and then:
A = LOAD "your_file" as (column1:datatype, column2:datatype);
B = FITER A by observation date > '2011-11-22' AND observation_date < '2011-11-23' AND
Type='total';
C = ORDER B by observation_date;
DUMP C;
Now, this will order these as strings. So depending on the version of Pig you're using, you'll need to deal with timestamps with the appropriate function. Something like:
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/evaluation/datetime/convert/CustomFormatToISO.html

The problem seems to be use of single quotes multiple times. Following in a single line seems to compile (pig -c test.pig)
A = LOAD 'sql://{select * from sandesh.insights_voice_day WHERE Observation_date BETWEEN "2011-11-22" AND "2011-11-23" AND Type="total" ORDER BY Observation_date}';

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Split characters inside Pig field - hadoop

I think that it will be much more efficient to use the built in UDF REGEX_EXTRACT_ALL. You can get some idea of how to use this UDF from: http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#REGEX_EXTRACT_ALL STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

Related

Is there a way to turn unix commands as map reduce activities?

Splitting Pig tuple

Pig Script STRSPLIT

PigStorage and Variable Schemas from Input

Expecting QUOTED STRING in pig script

Categories

Resources