Pig 0.12.0 - extracting last two characters from a string - hadoop

I am using CDH 5.5, Pig 0.12.0. I have a chararray like this: 25 - 45 and I want to extract 25 and 45 out of this String.
So, I did this:
minValue = (int)SUBSTRING(value,0,2);
maxValue = ((int)SUBSTRING(value,6,2);
I am able to extract minValue but unable to extract the maxValue i.e. last two characters of the given String.
Even I tried but this one is also not working.:
maxValue = ((int)SUBSTRING(value,-2,2);
Please let me know how to make this work.

If delimeter is colon ( - ) always, then we can split and flatten the chararray to extract min and max value.
A = LOAD 'input.csv' USING PigStorage(',') AS (min_max:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(min_max,' - ',0)) AS (min_val:chararray, max_val:chararray);
DUMP B;
Input :
25 - 45
35 - 65
45 - 85
Output :
(25,45)
(35,65)
(45,85)

You have to use the index of the specific character in the SUBSTRING function.
Here is what you need.
maxValue = (int)SUBSTRING(value,5,7);

Related

How to make this in laravel QueryBuilder

I want to take the max of the progress column that is ordered with a distinct column name.
For example;
progress - journey_id
0 - 90
12 - 90
89 - 90
7 - 90
the code I tried to make but I couldn't do more
$impressions = DB::table('journey_content_impression')
->where('user_id', 11)
->where('journey_id',12)
->select('journey_item_id','progress')->distinct()->get();
Finally, It should give us one row with the max number.
Also, other columns of row should be available like this,
$impressions->first()->user_id; //
Try this:
$impressions = DB::table('journey_content_impression')
->where('user_id',{id})
->where('journey_id',{j_id})
->groupBy('journey_item_id')
->get(['journey_item_id', DB::raw('MAX(progress) as progress')]);
Maybe it helps you try this
$impressions = DB::table('journey_content_impression')
->where([['user_id', 11],['journey_id',12]])
->select('journey_item_id','progress')->distinct()->get();

removing bad data from a data file using pig

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')
Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

How to split a string by amount of characters in a batch file?

I have about 6GB of various text files, the files have many lines but each record is missing its commas so all the data is in 1 record. I want to create a batch file where I can add commas at the appropriate places in each "record". I'm hoping to add commas so I can then import this into a database.
For example the file would be structured like this.
IDnameADDRESSphoneEMAILetc
IDnameADDRESSphoneEMAILetc
IDnameADDRESSphoneEMAILetc
Each field has a unique length which I know, and it's static between all files.
For example
ID - 10 characters
NAME - 40 characters
ADDRESS - 30 characters
etc
This will need to be run on an ongoing basis as new files come in so I'm hoping for something I can give a non technical person they can just run.
Any quick way to do this in a bat file?
Using your example above. Note we count the characters starting from 0, then tell the set to use letters starting at a certain count, counting the word length from there. See bottom for layout.
#echo off
setlocal enabledelayedexpansion
for /F "tokens=* delims=" %%a in (filename.txt) do (
set str=%%a
set id=!str:~0,2!
set na=!str:~2,4!
set add=!str:~6,7!
set ph=!str:~13,5!
set em=!str:~18,5!
set etc=!str:~23,3!
echo !id!,!na!,!add!,!ph!,!em!,!etc!
)
Characters assigned in a string as:
I D n a m e A D D R E S S p h o n e E M A I L e t c
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ID starts at Character 0 and is 2 characters, including itself :~0,2
name starts at character 2 and is 4 characters long :~2,4
etc..
For many files just add another loop as a main loop or give a list of files.
Based on your provided example, here is a quick powershell command, (despite no tag):
(GC 'Report.txt' | Select -First 1).Insert(10,',').Insert(51,',').Insert(82,',') > 'Fixed.txt'
It takes the first line of Report.txt…
After 10 characters insert ,(0 + 10 = 10) + 1
After another 40 characters insert ,(11 + 40 = 51) + 1
After another 30 characters insert ,(52 + 30 = 82) + 1
etc.
…then outputs the line complete with insertions to Fixed.txt
Just continue the .Insert(<number>,',') sequence for your other fixed width column sizes and ensure you've changed the filenames to suit your circumstances.
Edit
The following as an update to your comment and subsequent edit should work for all lines in the file.
GC 'Report.txt' | % {($_).Insert(10,',').Insert(51,',').Insert(82,',')} | Out-File 'Fixed.txt'

Spotfire Custom Expression : Calculate (Num/Den) Percentages

I am trying to plot Num/Den type percentages using OVER. But my thoughts does not appear to translate into spotfire custom expression syntax.
Sample Input:
RecordID CustomerID DOS Age Gender Marker
9621854 854693 09/22/15 37 M D
9732721 676557 09/18/15 65 M D
9732700 676557 11/18/15 65 M N
9777003 5514882 11/25/15 53 M D
9853242 1753256 09/30/15 62 F D
9826842 1260021 09/30/15 61 M D
9897642 3375185 09/10/15 74 M N
9949185 9076035 10/02/15 52 M D
10088610 3512390 09/16/15 33 M D
10120650 41598 10/11/15 67 F N
9949185 9076035 10/02/15 52 M D
10088610 3512390 09/16/15 33 M D
10120650 41598 09/11/15 67 F N
Expected Out:
Row Labels D Cumulative_D N Cumulative_N Percentage
Sep 6 6 2 2 33.33%
Oct 2 8 1 3 37.50%
Nov 1 9 1 4 44.44%
My counts are working.
I want to take the same Cumulative_N & Cumulative_D count and plot Percentage over [Axis.X] as a line chart.
Here's what I am using:
UniqueCount(If([Marker]="N",[CustomerID])) / UniqueCount(If([Marker]="D",[CustomerID])) THEN SUM([Value]) OVER (AllPrevious([Axis.X])) as [CumulativePercent]
I understand SUM([Value]) is not the way to go. But I don't know what to use instead.
Also tried the one below as well, but did not:
UniqueCount(If([Marker]="N",[CustomerID])) OVER (AllPrevious([Axis.X])) / UniqueCount(If([Marker]="D",[CustomerID])) OVER (AllPrevious([Axis.X])) as [CumulativePercent]
Can you have a look ?
I found a way to make it work, but it may not fit your overall solution. I should mention i used Count() versus UniqueCount() so that the results would mirror your desired output.
Add a transformation to your current data table
Insert a calculated column Month([DOS]) as [TheMonth]
Set Row Identifers = [TheMonth]
Set value columns and aggregation methods to Count([CustomerID])
Set column titles to [Marker]
Leave the column name pattern as %M(%V) for %C
That will give you a new data table. Then, you can do your cumulative functions. I did them in a cross table to replicate your expected results. Insert a new cross table and set the values to:
Sum([D]), Sum([N]), Sum([D]) OVER (AllPrevious([Axis.Rows])) as [Cumulative_D],
Sum([N]) OVER (AllPrevious([Axis.Rows])) as [Cumulative_N],
Sum([N]) OVER (AllPrevious([Axis.Rows])) / Sum([D]) OVER (AllPrevious([Axis.Rows])) as [Percentage]
That should do it.
I don't know if Spotfire released a fix or based on everyone's inputs I could get the syntax right. But here is the solution that worked for me.
For Columns D & N,
COUNT([CustomerID])
For columns Cumulative_D & Cumulative_N,
Count([CustomerID]) OVER (AllPrevious([Axis.X])) where [Axis.X] is DOS(Month), Marker
For column Percentage,
Count(If([Marker]="N",[CustomerID])) OVER (AllPrevious([Axis.X])) / Count(If([Marker]="D",[CustomerID])) OVER (AllPrevious([Axis.X]))
where [Axis.X] is DOS(Month)

Pig Scripting - Cast STRING to INT

Beginner in Pig, Need help
For all NON - AlphaNumeric, Cast the STRING TO INT
- To be handled without passing each field name separately.
Sample data -
00013425731998101620140402300032736901 00000000AAA001200X111685V00000000
00283335542006120920131010300030003105 00000000AAA001200X117407 00000000
00000000331998101620140402300033128107 00000000AAA001200X111685 00000000
00003902331999090620140402300032545208 00000000AAA001200X111685 00000000
Its a fixedwidth file, mapping details as follow -
orderNumber 1 9
origin 10 10
Startdate 11 18
ModDate 19 26
Identifier 27 36
Code 37 38
CodeType 39 40
Number 41 48
Num 49 114
Either use substr to extract the parts and then cast them or use a regexp. For example for the first two fields:
input = load ... as (line:chararray);
a = foreach input generate SUBSTRING(line, 0, 9) as orderNumber:long, SUBSTRING(line, 9, 10) as origin:chararray;
This way you should be able to convert each part of the input line into the desired components.
Alternatively you could write a UDF that takes a string as input and does the splitting and returns a bag or tuple.

Resources