Max value: Max of a given column in the cvs file - hadoop

Input: file1: input job should take input as below file.
C1 - ABC,DEF,GHI,JKL.
C2 - 10,15,20,30.
C3 - B1,B2,B3,B4.
C4 - 5,2,6,9.
Input parameter: column no ( ex.2).

Related

total row count of parent group (summarize total)

I have a table which has 2 groups and the second group is nested from first like so
Parent - Child - Col
X X1 xx
xx
X2 xx
xx
xx
---------------------
Y Y1 yy
yy
I only care about getting the count of Cols For each parent group
like This is the desired result
Parent - Child - Col
X X1 xx
xx
X2 xx
xx
xx
Total : 5
----------------------
Y Y1 yy
yy
Total : 2
I actually managed to get total of Child group using expression =Count(Fields!Col.Value) by putting cell inside any of the 2 groups and I tried to sum it by clicking on it and choose total but it do it for all table not the parent group
I assume it's the 2 that you can't get?
If so then you can get this by counting the number of distinct Parents
=CountDistinct(Fields!Parent.Value, "myDataSetName")
"myDataSetName" is the name of the dataset you are counting from, it is case sensitive and must be enclosed in quotes.

Split last column into two equal halves in unix

I need to split last column into two separate columns & delete some part of it.
Currently all the values in the last column has 6 numbers . I need to split them into two separate columns.
First column should have first three numbers and second column should have next three numbers.
I ultimately want to delete newly created second column.
Data -
ID c1 c2 c3 c4 c5
12 A XY 123 456 657098
The new file should be created as below -
Data 2
ID c1 c2 c3 c4 c5
12 A XY 123 456 657
Thanks
You can use this awk that checks length of last column for each row:
awk 'length($NF) == 6 { $NF = substr($NF, 1, 3) } 1' file
Data -
ID c1 c2 c3 c4 c5
12 A XY 123 456 657

Avoid accuracy problems while computing the permanent using the Ryser formula

Task
I want to calculate the permanent P of a NxN matrix for N up to 100. I can make use of the fact that the matrix features only M=4 (or slightly more) different rows and cols. The matrix might look like
A1 ... A1 B1 ... B1 C1 ... C1 D1 ... D1 |
... | r1 identical rows
A1 ... A1 B1 ... B1 C1 ... C1 D1 ... D1 |
A2 ... A2 B2 ... B2 C2 ... C2 D2 ... D2
...
A2 ... A2 B2 ... B2 C2 ... C2 D2 ... D2
A3 ... A3 B3 ... B2 C2 ... C2 D2 ... D2
...
A3 ... A3 B3 ... B3 C3 ... C3 D3 ... D3
A4 ... A4 B4 ... B4 C4 ... C4 D4 ... D4
...
A4 ... A4 B4 ... B4 C4 ... C4 D4 ... D4
---------
c1 identical cols
and c and r are the multiplicities of cols and rows. All values in the matrix are laying between 0 and 1 and are encoded as double precision floating-point numbers.
Algorithm
I tried to use the Ryser formula to calculate the permanent. For the formula, one needs to first calculate the sum of each row and multiply all the row sums. For the matrix above this yields
S0 = (c1 * A1 + c2 * B1 + c3 * C1 + c4 * D1)^r1 * ...
* (c1 * A4 + c2 * B4 + c3 * C4 + c4 * D4)^r4
As a next step the same is done with col 1 deleted
S1 = ((c1-1) * A1 + c2 * B1 + c3 * C1 + c4 * D1)^r1 * ...
* ((c1-1) * A4 + c2 * B4 + c3 * C4 + c4 * D4)^r4
and this number is subtracted from S0.
The algorithm continues with all possible ways to delete single and group of cols and the products of the row sums of the remaining matrix are added (even number of cols deleted) and subtracted (odd number of cols deleted).
The task can be solved relative efficiently if one makes use of the identical cols (for example the result S1 will pop up exactly c1 times).
Problem
Even if the final result is small the values of the intermediate results S0, S1, ... can reach values up to N^N. A double can hold this number but the absolute precision for such big numbers is below or on the order of the expected overall result. The expected result P is on the order of c1!*c2!*c3!*c4! (actually I am interested in P/(c1!*c2!*c3!*c4!) which should lay between 0 and 1).
I tried to arrange the additions and subtractions of the values S in a way that the sums of the intermediate results are around 0. This helps in the sense that I can avoid intermediate results that are exceeding N^N, but this improves things only a little bit. I also thought about using logarithms for the intermediate results to keep the absolute numbers down - but the relative accuracy of the encoded numbers will be still bounded by the encoding as floating point number and I think I will run into the same problem. If possible, I want to avoid the usage of data types that are implementing a variable-precision arithmetic for performance reasons (currently I am using matlab).

Query in Oracle for running sum

I need to pull the result set with sum of the previous record and current record.
Logic
My table is having one key column C1 and a numeric column C2. I need a result like below example. I need 3 columns as the out put out which 1 columns is with running sum. First two columns are same as source with the thrid columns but
The first record of C3 = first record C2.
Second record C3 = "First Record C2 + Second Record C2";
Third record C3 = "First Record C2 + Second Record C2 + Thrid Record C2"
and it should continue for all the records.
Ex.
I have one source table like
C1 C2
---------
a 1
b 2
c 3
I Need output like below
C1 C2 C3
-------------
a 1 1
b 2 3
c 3 6
select c1, c2, sum(c2) over (order by c2) c3
from table_name

Find a node in a the tree based on some selection criteria

[BASE]
/ \ \
C1 C2 C3
/\ \
C4 C5 C6
I have a tree like the above. This is a N child tree which is not balanced. The problem is, I need to select one of the node based on some condition. Like
Select C1 when k1 = a
Select C4 when K1 = a and K2=b and K3=C
Select C5 when k1 = a and k'=z
Select C2 when K'' = b
Select C3 when k5 = 9
Select C6 when k5=9 and k6 = 10
The input to the program would be an arbitraty length of key value pairs like if input is -k1=a,k2=b,k3=c,k8=10 - I should select C4 as that is the best match.
Ideally I was thinking of traversing the tree and for each node, there is a selection criteria which I can match against the input set. But soon I figured out, this tree can be very huge and Base node can have tens of thousands of child nodes under it. So it might not be a good idea to go node by node. If there is a way to select the nodes more efficiently, I would love to know that.
Looks like your k's are pointing to directory structure and the leaf of this structure (exactly one leaf for each directory) is the node you are looking for. You can keep this string in node as another value. What is not clear in question is how are the k's related to the tree
for e.g.
a->c1
a/b/c->c4
I have found a workable solution like this one
----------------------------------------
|rowId|param1|param2|param3|param4|node|
----------------------------------------
|10 | a | | | | C1 |
----------------------------------------
|14 | a | b | c | | C4 |
----------------------------------------
|18 | a | b | | | C5 |
----------------------------------------
Lets call it a condition table. Each column represent the input series (k) and for different combinations of the value, there is a node to be selected. This table can be think of an in memory data structure or a real table in RDBMS.

Resources