How do I line number a set of columns taking into account the strings values of the first column (UNIX shell) - bash

Can someone help me? I would like to number a tabulated file in UNIX depends on the columns in that file. However, the last column from some rows have SAME LETTERS AND LENGTH between them but different order and must be considered the same if the other previous columns are the same too. In summary, the input is something like
rs758613821 574290 insertion_inframe P 285 AAAP
rs758613821 574290 insertion_inframe P 285 APAA
rs758613821 574290 insertion_inframe P 285 APLA
rs1367252071 574290 deletion_inframe CADDL 134 F
rs538 3246 frameshift_variant F 97 FGLYP
rs538 3246 frameshift_variant F 97 PYFLG
And the output should be
1 rs758613821 574290 insertion_inframe P 285 AAAP
1 rs758613821 574290 insertion_inframe P 285 APAA
2 rs758613821 574290 insertion_inframe P 285 APLA
3 rs1367252071 574290 deletion_inframe CADDL 134 F
4 rs538 3246 frameshift_variant F 97 FGLYP
4 rs538 3246 frameshift_variant F 97 PYFLG
and so on...
In this way I have performed the code as follow
awk 'BEGIN {FS=OFS="\t"} function intern(sym) { if (sym in table)
return table[sym]
return table[sym] = ++counter }
{ print intern($1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6), $0 }' "input" > "output";
Nevertheless I didn't solve the problem concerning to the last column to assign the same number if are same letters and length although different order. Is it possible to do it in UNIX environment? I think maybe by means of substr function or similar like that but I'm not sure what will be a properly code. Thanks in advance for the support and helping!

Not sure this is what you want to do but give it a try
$ awk 'function canon(f) {n=split(f,a,"");
asort(a); c="";
for(i=1;i<=n;i++) c=c a[i];
return c;}
{key=canon($NF)}
!(key in keys) {keys[key]=++ctr}
{print keys[key],$0}' file
1 rs758613821 574290 insertion_inframe P 285 AAAP
1 rs758613821 574290 insertion_inframe P 285 APAA
2 rs758613821 574290 insertion_inframe P 285 APLA
3 rs1367252071 574290 deletion_inframe CADDL 134 F
4 rs538 3246 frameshift_variant F 97 FGLYP
4 rs538 3246 frameshift_variant F 97 PYFLG
convert the last field into canonical form and count the unique instances.
To use the full record as key do this instead
...
{line=$0;
$NF=canon($NF);
key=$0}
!(key in keys) {keys[key]=++ctr}
{print keys[key],line}' file
copy the line, replace last field with canonical form, use the updated line as key, count unique instances, print count and original line

Related

Comparision of 2 csv files having same column names with different data

I am having two CSV files each having 2 columns with same column name. 1.csv has generated first and 2.csv has generated after 1 hour. S
o I want to see the Profit % increament and decrement for each Business unit comparing to last report. for example: Business unit B has increment of 50%(((15-10)/10)*100).
However for C it has decrease of 50%. Some new business unit(AG & JK) is also added in new hour report which can be considered only for new one. However few businees unit(D) also removed from next hour which can be considered not required.
So basically i need how can i compare and extract this data.
Busines Profit %
A 0
B 10
C 10
D 0
E 0
F 1615
G 0
Busines profit %
A 0
B 15
C 5
AG 5
E 0
F 1615
G 0
JK 10
updated requirement:
Business Profits% Old profit % new Variation
A 0 0 0
B 10 15 50%
C 10 5 -50%
D 0 cleared
AG 5 New
E 0 0 0
F 1615 1615 0%
G 0 0 0%
JK 10 New
I'd use awk for the job, something like this:
$ awk 'NR==FNR{ # process file2
a[$1]=$2 # hash second column, key is the first column
next # process the next record of file2
}
{ # process file1
if($1 in a==0) # if company not found in hash a
p="new" # it must be new
else
p=($2-a[$1])/(a[$1]==0?1:a[$1])*100 # otherwise calculate p%
print $1,p # output company and p%
}' file1 file2
A 0
B 50
C -50
AG new
E 0
F 0
G 0
JK new
One-liner version with appropriate semicolons:
$ awk 'NR==FNR{a[$1]=$2;next}{if($1 in a==0)p="new";else p=($2-a[$1])/(a[$1]==0?1:a[$1])*100;print $1,p}' file1 file2

Use awk to print first column, second column and one column every four after that

I have a text file and I want to extract the first column, second column and one column every 3 after that. Also, I want to get rid of row #2. How can I do that using awk or similar?
An example:
I have a text file as such:
A B C D E F G H I J .. N (header 1)
A B C D E F G H I J .. N (header 2)
A B C D E F G H I J .. N (row 1)
A B C D E F G H I J .. N (row 2)
A B C D E F G H I J .. N (row n)
I'm trying to get it as
A B F J .. N (header 1)
A B F J .. N (row 1)
A B F J .. N (row 2)
A B F J .. N (row n)
Thanks.
P.s. I've tried playing with the solutions mentioned in How to print every 4th column up to nth column and from (n+1)th column to last using awk? but the solution doesn't work for me
$ awk 'NR!=2{out=$1; for (i=2;i<=NF;i+=4) out = out OFS $i; print out}' file
A B F J 1)
A B F J 1)
A B F J 2)
A B F J n)
The above output is messy because of the ...s and comments in your sample input that make it untestable. Always post ACTUAL, testable input/output, not a description or other abstract representation of such. And don't repeat the same data on every line as that makes it hard to map the output fields to the input and so makes it harder to understand your requirements. This would have been a more useful example:
$ cat file
101 102 103 104 105 106 107 108 109 110 111
201 202 203 204 205 206 207 208 209 210 211
301 302 303 304 305 306 307 308 309 310 311
401 402 403 404 405 406 407 408 409 410 411
$ awk 'NR!=2{out=$1; for (i=2;i<=NF;i+=4) out = out OFS $i; print out}' file
101 102 106 110
301 302 306 310
401 402 406 410

Replacing selective numbers with NaNs

I have eight columns of data. Colulmns 1,3, 5 and 7 contain 3-digit numbers. Columns 2,4,6 and 8 contain 1s and zeros and correspond to 1, 3, 5 and 7 respectively. Where there is a zero in an even column I want to change the corresponding number to NaN. More simply, if it were
155 1 345 0
328 1 288 1
884 0 145 0
326 1 332 1
159 0 186 1
then 884 would be replaced with NaN, as would 159, 345 and 145 with the other numbers remaining the same. I need to use NaN to maintain the data in matrix form.
I know I could use
data(3,1)=Nan; data(5,1)=Nan
etc but this is very time consuming. Any suggestions would be very welcome.
Approach 1
a1 = [
155 1 345 0
328 1 288 1
884 0 145 0
326 1 332 1
159 0 186 1]
t1 = a1(:,[2:2:end])
data1 = a1(:,[1:2:end])
t1(t1==0)=NaN
t1(t1==1)=data1(t1==1)
a1(:,[1:2:end]) = t1
Output -
a1 =
155 1 NaN 0
328 1 288 1
NaN 0 NaN 0
326 1 332 1
NaN 0 186 1
Approach 2
[x1,y1] = find(~a1(:,[2:2:end]))
a1(sub2ind(size(a1),x1,2*y1-1)) = NaN
I would split the problem into two matrices, with one being a logical mask, the other holding your data.
data = your_mat(:,1:2:end);
valid = your_mat(:,2:2:end);
Then you can simply do:
data(~valid)=NaN;
You could then rebuild your data by doing:
your_mat(:,1:2:end) = data;
Here is an interesting solution, I would expect it to perform quite well, but be aware that it is a bit tricky!
data(~data(:,2:end))=NaN
Using logical indexing:
even = a1(:,2:2:end); % even columns
odd = a1(:,1:2:end); % odd columns
odd(even == 0) = NaN; % set odd columns to NaN if corresponding col is 0
a1(:,1:2:end) = odd; % assign back to a1
a1 =
155 1 NaN 0
328 1 288 1
NaN 0 NaN 0
326 1 332 1
NaN 0 186 1
Here is an alternative solution. You can use circshift, in the following manner.
First create a mask of the even columns of the same size of your input matrix A:
AM = false(size(A)); AM(:,2:2:end) = true;
Then circshift the mask (A==0)&AM one element to the left, to shift this mask on the odd columns.
A(circshift((A==0)&AM,[0 -1])) = nan;
NOTE: I've searched for a one-liner ... I don't think it's a good one, but here is one you can use, based on my solution:
A(circshift(bsxfun(#and, A==0, mod(0:size(A,2)-1,2)),[0 -1])) = nan;
The dirty thing with bsxfun is to create on-line the mask AM. I use for that the oddness test on a vector of indices, bsxfun extends it over the whole matrix A. You can do anything else to create this mask, of course.

Sum up custom grand total on crosstab in BIRT

I have a crosstab and create custom grand total for the row level in each column dimension, by using a data element expression.
Crosstab Example:
Cat 1 Cat 2 GT
ITEM C F % VALUE C F % VALUE
A 101 0 0.9 10 112 105 93.8 10 20
B 294 8 2.7 6 69 66 95.7 10 16
C 211 7 3.3 4 212 161 75.9 6 10
------------------------------------------------------------------
GT 606 15 2.47 6 393 332 84.5 8 **14**
Explanation for GT row:
Those C and F column is summarized from the above. But the
% column is division result of F/C.
Create a data element to fill the VALUE column, which comes from range of value definition, varies for each Cat (category). For instance... in Cat 1, if the value is between 0 - 1 the value will be 10, or between 1 - 2 = 8, etc. And condition for Cat 2, between 85 - 100 = 10, and 80 - 85 = 8, etc.
The GT row (with the value of 14), is gathered by adding VALUE of Cat 1 + Cat 2.
I am able to work on point 1 and 2 above, but I can't seem to make it working for GT row. I don't know the code/expression to sum up the VALUE data element for this 2 categories. Because those VALUE field comes from one data element in design mode.
I have found the solution for my problem. I can show the result by using a report variable. I am assigning 2 report variables in % field expression, based on the category in data cube dimension (by using if statement). And then in data element expression, I am calling both of the expressions and add them.

Check if string exist in non-consecutive lines in a given column

I have files with the following format:
ATOM 8962 CA VAL W 8 8.647 81.467 25.656 1.00115.78 C
ATOM 8963 C VAL W 8 10.053 80.963 25.506 1.00114.60 C
ATOM 8964 O VAL W 8 10.636 80.422 26.442 1.00114.53 O
ATOM 8965 CB VAL W 8 7.643 80.389 25.325 1.00115.67 C
ATOM 8966 CG1 VAL W 8 6.476 80.508 26.249 1.00115.54 C
ATOM 8967 CG2 VAL W 8 7.174 80.526 23.886 1.00115.26 C
ATOM 4440 O TYR S 89 4.530 166.005 -14.543 1.00 95.76 O
ATOM 4441 CB TYR S 89 2.847 168.812 -13.864 1.00 96.31 C
ATOM 4442 CG TYR S 89 3.887 169.413 -14.756 1.00 98.43 C
ATOM 4443 CD1 TYR S 89 3.515 170.073 -15.932 1.00100.05 C
ATOM 4444 CD2 TYR S 89 5.251 169.308 -14.451 1.00100.50 C
ATOM 4445 CE1 TYR S 89 4.464 170.642 -16.779 1.00100.70 C
ATOM 4446 CE2 TYR S 89 6.219 169.868 -15.298 1.00101.40 C
ATOM 4447 CZ TYR S 89 5.811 170.535 -16.464 1.00100.46 C
ATOM 4448 OH TYR S 89 6.736 171.094 -17.321 1.00100.20 O
ATOM 4449 N LEU S 90 3.944 166.393 -12.414 1.00 94.95 N
ATOM 4450 CA LEU S 90 5.079 165.622 -11.914 1.00 94.44 C
ATOM 5151 N LEU W 8 -66.068 209.785 -11.037 1.00117.44 N
ATOM 5152 CA LEU W 8 -64.800 210.035 -10.384 1.00116.52 C
ATOM 5153 C LEU W 8 -64.177 208.641 -10.198 1.00116.71 C
ATOM 5154 O LEU W 8 -64.513 207.944 -9.241 1.00116.99 O
ATOM 5155 CB LEU W 8 -65.086 210.682 -9.033 1.00115.76 C
ATOM 5156 CG LEU W 8 -64.274 211.829 -8.478 1.00113.89 C
ATOM 5157 CD1 LEU W 8 -64.528 211.857 -7.006 1.00111.94 C
ATOM 5158 CD2 LEU W 8 -62.828 211.612 -8.739 1.00112.96 C
In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.
I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.
I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.
SOLUTION (thanks to all for your help and namely to sehe):
for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done
For this particular sample:
cut -c22-23 t | uniq | sort | uniq -dc
Will output
2 W
(the 22nd column contains 2 runs of the letter 'W')
untested
awk '
seen[$5] && $5 != current {
print "found non-consecutive chain on line " NR
exit
}
{ current = $5; seen[$5] = 1 }
' filename
Here you go, this awk script is tested and takes into account not just 'W':
{
if (ln[$5] && ln[$5] + 1 != NR) {
print "dup " $5 " at line " NR;
}
ln[$5] = NR;
}

Resources