How to convert target values with pig? - hadoop

I have some data with a target of 4 values ​​and I want three of these to become part of one single using latin pig.
Input: Output:
ID | Target ID | Target
----------------- -----------------
test1 1 test1 1
test2 1 test2 1
test3 2 test3 2
test4 2 test4 2
test5 3 test5 2
test6 4 test6 2
test7 2 test7 2
Someone knows the best way to do it

Use Bincond to check for target value greater than 1 and if true replace it with the value you want,in this case 2.
A = LOAD 'data.txt' USING PigStorage('\t') AS (Id:chararray,target:int);
B = FOREACH A GENERATE Id,(target > 1 ? 2 : target);
DUMP B;

Related

Vertically divide an array so we get minimum splits

I am thinking on the following problem.
I can have an array of strings like
Col1 Col2 Col3 Col4
aa aa aa aa
aaa aaa aaaaa aaa
aaaa aaaaaaa aa a
...........................
Actually it is CSV file. And I should find a way to divide this vertically into one or more files. Condition for splitting is that no one file contain no row that exceeds some bytes. For simplicity we can rewrite that array with lengths:
Col1 Col2 Col3 Col4
2 2 2 2
3 3 5 3
4 7 2 1
...........................
And let's say the limit is 10, i.e. if > 9 we should split. So if we split into 2 files [Col1, Col2, Col3] and [Col4] this will not satisfy the condition because the first file will contain 3 + 3 + 5 > 9 in the second row and 4 + 7 + 2 > 9 in the third row. If we split into [Col1, Col2] and [Col3, Col4] this will not satisfy the condition because the first file will contain 4 + 7 > 9 in the third row. So we are splitting this into 3 files like [Col1], [Col2, Col3] and [Col4]. Now every file is correct and looks like:
File1 | File2 | File3
------------------------------
Col1 | Col2 Col3 | Col4
2 | 2 2 | 2
3 | 3 5 | 3
4 | 7 2 | 1
...............................
So it should split from left to right giving maximum columns as possible to the left file. The problem is that this file can be huge and I don't want to read it into memory and so we read the initial file line by line and somehow I should determine a set of indexes to split. If that is possible at all? I hope I described the problem well, so you can understand it.
Generally awk is quite good at handling large csv files.
You could try something like this to retrieve the max length for each column and then decide how to split.
Let's say the file.txt contains
Col1;Col2;Col3;Col4
aa;aa;aa;aa
aaa;aaa;aaaaa;aaa
aaaa;aaaaaaa;aa;a
(Assuming windows style quotes) Running the following :
> awk -F";" "NR>1{for (i=1; i<=NF; i++) max[i]=(length($i)>max[i]?length($i):max[i])} END {for (i=1; i<=NF; i++) printf \"%d%s\", max[i], (i==NF?RS:FS)}" file.txt
Will output :
4;7;5;3
Could you try this on your real data set ?

Merging two files with columns in bash [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a file1.txt and the output is:
test4 30
test6 29
test3 17
test2 12
test5 5
This is file is ordered by second column. I sorted it with sort -nr -k 2 .
I have also file2.txt with the content of:
test2 A
test3 B
test4 C
test5 D
test6 E
What I want as result(result.txt) is:
test4 C 30
test6 E 29
test3 B 17
test2 A 12
test5 D 5
Using awk:
awk 'FNR == NR { a[$1] = $2; next } { print $1, a[$1], $2 }' file2 file1
Output:
test4 C 30
test6 E 29
test3 B 17
test2 A 12
test5 D 5
If file1 is not yet sorted, you can do:
sort -nr -k 2 file1 | awk 'FNR == NR { a[$1] = $2; next } { print $1, a[$1], $2 }' file2 -
Or
awk 'FNR == NR { a[$1] = $2; next } { print $1, a[$1], $2 }' file2 <(sort -nr -k 2 file1)
There are many ways to format the output. You can use column -t:
... | column -t
Output:
test4 C 30
test6 E 29
test3 B 17
test2 A 12
test5 D 5
Or you can use printf. Although I'd prefer using column -t since table would be broken if one column grows larger than the actual size that printf has provided.
... { printf "%s%3s%4.2s\n", $1, a[$1], $2 }' ...
Output:
test4 C 30
test6 E 29
test3 B 17
test2 A 12
test5 D 5
Don't sort the file before processing it, keep the sorting by 1st column.
Assuming you have :
file1 file 2
________________________
test2 12 test2 A
test3 17 test3 B
test4 30 test4 C
test5 5 test5 D
test6 29 test6 E
Using join file2 file1 | sort -nr -k 3 will yield :
test4 C 30
test6 E 29
test3 B 17
test2 A 12
test5 D 5
use -t' ' if you want your spacing unmodified by join

Selecting greatest value of B for equal entries in A

Consider the following table:
A | B
-----|------
123 | 1
456 | 2
123 | 5
456 | 0
789 | 3
789 | 9
123 | 6
I want to get the following output:
A | B
-----|------
123 | 6
456 | 2
789 | 9
In other words: the greatest value of B for each equal value in A.
The initial table above comes already from another query which only selects duplicates of A:
select A, B from tbl where A in (
select A from tbl
group by A
having count(A) > 1
);
I tried wrapping/integrating another grouping function with and without max(B) around/into this query, but no success.
How can I get the desired output?
Just use max:
select A, max(B)
from tbl
group by A
having count(A) > 1
maybe I'm being naive here, but:
SELECT tbl2.A, MAX(tbl2.B) FROM
(select A, B from tbl where A in (
select A from tbl
group by A
having count(A) > 1
)) as tbl2
GROUP BY tbl2.A
seems like it should work.

SSRS Interactive Sort Row Group data by Column Group

In SSRS I am attempting to do an interactive sort on the combination of a column group and a row group.
So my data would look like:
-------- ------------- --------- -------- -----------------
Category CampaignType Campaign Quarter Value
-------- ------------- --------- -------- -----------------
Cat 1 CampType1 Camp1 Q1 1
Cat 1 CampType1 Camp1 Q2 4
Cat 1 CampType1 Camp2 Q1 51
Cat 1 CampType2 Camp1 Q1 3
Cat 1 CampType2 Camp1 Q2 1
Cat 1 CampType2 Camp2 Q1 3
Cat 1 CampType2 Camp2 Q3 56
Cat 2 CampType1 Camp1 Q1 8
Cat 2 CampType1 Camp1 Q3 11
Cat 2 CampType1 Camp2 Q1 2
Cat 2 CampType2 Camp1 Q1 23
Cat 2 CampType2 Camp2 Q1 3
Cat 2 CampType3 Camp2 Q2 8
-------- ------------- --------- -------- -----------------
So I am grouping on the rows by:
-Category
---CampType
------Campaign
Grouping on the column by the row.
Getting the sum of the "Value" at the intersection of these two points.
Add some totals and we're done.
I would like to be able to sort the campaign value sums by quarter.
That, in my opinion seems like a reasonable request.
See below firstly what it looks like in designer and then an example:

LINQ retrieve values from a table that of which fields(of a certain column) are not equal of another table

I have two tables which have a column in common, how can I retrieve the first table where the values of it's column are not equal to the values of the column of the other table?
Here is an example:
table1ID | foo | fooBool table2ID | foo | fooin
---------------------------- -----------------------------
1 | test1 | true 1 | test2 | 5
2 | test2 | true 2 | test3 | 7
3 | test3 | true
4 | test4 | false
Therefore the result of the LINQ query for the first table that has values of foo not equal that of table2 are:
table1ID | foo | fooBool
---------------------------
1 | test1 | true
4 | test4 | false
var results = from t1 in db.table1
from t2 in db.table2
where t1.foo != t2.foo
select t1
You could also use the Intersect() IEnumerable Extension
var results = db.table1.Intersect(db.table2);
Or in LINQ
var codes = from intersected in db.table1.Intersect(db.table2)
select intersected.foo;
This would produce
results
----------
test2
test3
Update
Thanks to Joe for pointing out that Intersect would produce the common items (added sample above). What you would want is the Except extension method.
var results = db.table1.Except(db.table2);
Or in LINQ
var codes = from diff in db.table1.Except(db.table2)
select diff.foo;
This would produce
results
----------
test1
test4

Resources