How to find any decrement in the column? - sorting

I am trying to find out the decrements in a column and if found then print the last highest value.
For example:
From 111 to 445 there is a continous increment in the column.But 333 is less then the number before it.
111 aaa
112 aaa
112 aaa
113 sdf
115 aaa
222 ddd
333 sss
333 sss
444 sss
445 sss
333 aaa<<<<<<this is less then the number above it (445)
If any such scenario is found then print 445 sss

Like this, for example:
$ awk '{if (before>$1) {print before_line}} {before=$1; before_line=$0}' a
445 sss
What is it doing? Check the variable before and compare its value with the current. In case it is bigger, print the line.
It works for many cases as well:
$ cat a
111 aaa
112 aaa
112 aaa
113 sdf
115 aaa <--- this
15 aaa
222 ddd
333 sss
333 sss
444 sss
445 sss <--- this
333 aaa
$ awk '{if (before>$1) {print before_line}} {before=$1; before_line=$0}' a
115 aaa
445 sss

Store each number in a single variable called prevNumber then when you come to print the next one do a check e.g. if (newNumber < prevNumber) print prevNumber;
dont really know what language you are using

You can say:
awk '$1 > max {max=$1; maxline=$0}; END{ print maxline}' inputfile
For your input, it'd print:
445 sss

Related

Remove duplicates in each individual column from a text file

I have a text file of 7 tab-delimited columns. Each column has a different number of lines with values that could be duplicated. I want to remove the duplicates so that each column has only unique values for that specific column. As an example:
Input
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 111 333 333 222 333 666
222 111 444 111 333 555 555
333 444 555 222 444 666 444
444 666 555 777 555 666 333
444 777 777 555 666 888 333
777 888 999 666 888
999
Output
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888
I figure I would need to use awk to print each column and use sort -u separately, and then paste those outputs together. So, is there a way to make a loop that for i number of columns in a text file, would print each column | sort - u, and then paste it all together?
Thanks in advance,
Carlos
Using perl instead for its support of true multidimensional arrays:
perl -lane '
for my $n (0..$#F) {
if (!exists ${$vals[$n]}{$F[$n]}) {
push #{$cols[$n]}, $F[$n];
${$vals[$n]}{$F[$n]} = 1;
}
}
END {
for (1..$.) {
my #row;
for my $n (0..$#cols) {
push #row, shift #{$cols[$n]};
}
print join("\t", #row);
}
}' input.txt
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (colNr=1; colNr<=NF; colNr++) {
val = $colNr
if ( !seen[colNr,val]++ ) {
rowNr = ++colRowNrs[colNr]
vals[rowNr,colNr] = val
numRows = (rowNr > numRows ? rowNr : numRows)
}
}
numCols = (NF > numCols ? NF : numCols)
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
val = vals[rowNr,colNr]
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888
Assumptions
an (awk) array of the entire output result will fit in memory
variable number of columns and rows
One idea consists of a (sparse) 2-dimensional array of values, where the array structure would look like:
values[<column#>][<row#>]=<unique_cell_value>
One idea using a single awk invocation that a) requires a single pass through the input file and b) does not require any transposing/pasting (in case anyone takes Cyrus' comment/suggestion seriously):
awk '
BEGIN { FS=OFS="\t" }
{ maxNF = (NF > maxNF ? NF : maxNF) # keep track of max number of columns
for (i=1; i<=NF; i++) {
if ( $i == "" ) # ignore empty cell
continue
for (j=1; j<=ndx[i]; j++) { # loop through values already seen for this column
if ( $i == vals[i][j] ) { # and if already seen then
$i = "" # clear the current cell and
break # break out of this for/testing loop
}
}
if ( $i != "" ) { # if we got this var and the cell is not empty then
vals[i][++ndx[i]] = $i # store the new value in our array
}
}
}
END { for (j=1; j<=NR; j++) { # loop through all possible rows
pfx = ""
for (i=1; i<=maxNF; i++) { # loop through all possible columns
printf "%s%s", pfx, vals[i][j] # non-existent array entries default to ""
pfx = OFS
}
printf "\n"
}
}
' input_file
NOTE: The array of arrays structure (arr[i][j]) requires GNU awk otherwise we could convert to a pseudo dual index array structure of arr[i,j]
This generates:
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888

How to replace columns (matching pattern) using awk?

I am trying to use awk to edit files but I cant manage to do it without creating intermediate files.
Basicaly I want to search using column 1 in file2 and file3 and so on, and replace the 2nd column for matching 1st column lines. (note that file2 and file3 may contain other stuff)
I have
File1.txt
aaa 111
aaa 222
bbb 333
bbb 444
File2.txt
zzz zzz
aaa 999
zzz zzz
aaa 888
File3.txt
bbb 000
bbb 001
yyy yyy
yyy yyy
Desired output
aaa 999
aaa 888
bbb 000
bbb 001
this does what you specified but I guess there are many edge cases not covered.
$ awk 'NR==FNR{a[$1]; next} $1 in a' file{1..3}
aaa 999
aaa 888
bbb 000
bbb 001

Retrieving multiple last rows of occurences of a column based on same first column

I have the following file:
ABC 1234 2333 BCD
ABC 121 123 BCD
ABC 124 231 BCD
ABC 2342 2344 CDK
MBN 231 252 RFC
MBN 230 212 RFC
MBN 213 215 RFC
MBN 233 235 RFC
MBN 12 67 RTC
MBN 67 98 TCF
I want to find the last row of unique first and fourth column value based on search from another file, my other file will have
ABC
MBN
The code will work such that it will look for ABC first in the above file, then find last occurrence of BCD and so on and the output would be:
ABC 124 231 BCD
ABC 2342 2344 CDK
MBN 233 235 RFC
MBN 67 98 TCF
I have begun by first finding the occurrence of ABC as
grep ABC abovefile.txt | head -1
You can use this awk command:
awk 'NR==FNR{search[$1];next} $1 in search{key=$1 SEP $4; if (!(key in data)) c[++n]=key;
data[key]=$0} END{for (i=1; i<=n; i++) print data[c[i]]}' file2 file1
Output:
ABC 124 231 BCD
ABC 2342 2344 CDK
MBN 233 235 RFC
MBN 12 67 RTC
MBN 67 98 TCF
This solution is using 3 arrays:
search to hold search items from file2
data to hold records from file1 with the key as $1,$4
c for keeping the order of the already processed keys
Code Breakup:
NR==FNR # Execute next block for the 1st file in the list (i.e. file2)
{search[$1];next} # store first column in search array and move to next record
$1 in search # for next file in the list if first col exists in search array
key=$1 SEP $4 # make key variable as $1, $4
if(!(key in data))# if key is not in data array
c[++n]=key # store in array c with an incrementing index
data[key]=$0} # not store full record in data array with index=key
END # run this block at the end

calculate percentage between columns in bash?

I have long tab formatted file with many columns, i would like to calculate % between two columns (3rd and 4rth) and print this % with correspondence numbers with this format (%46.00).
input:
file1 323 434 45 767 254235 275 2345 467
file1 294 584 43 7457 254565 345 235445 4635
file1 224 524 4343 12457 2542165 345 124445 41257
Desired output:
file1 323 434(134.37%) 45(13.93%) 767 254235 275 2345 467
file1 294 584(198.64%) 43(14.63%) 7457 254565 345 235445 4635
file1 224 524(233.93%) 4343(1938.84%) 12457 2542165 345 124445 41257
i tried:
cat test_file.txt | awk '{printf "%s (%.2f%)\n",$0,($4/$2)*100}' OFS="\t" | awk '{printf "%s (%.2f%)\n",$0,($3/$2)*100}' | awk '{print $1,$2,$3,$11,$4,$10,$5,$6,$7,$8,$9}' - | sed 's/ (/(/g' | sed 's/ /\t/g' >out.txt
It works but I want something sort-cut of this.
I would say:
$ awk '{$3=sprintf("%d(%.2f%)", $3, ($3/$2)*100); $4=sprintf("%d(%.2f%)", $4, ($4/$2)*100)}1' file
file1 323 434(134.37%) 45(13.93%) 767 254235 275 2345 467
file1 294 584(198.64%) 43(14.63%) 7457 254565 345 235445 4635
file1 224 524(233.93%) 4343(1938.84%) 12457 2542165 345 124445 41257
With a function to avoid duplicities:
awk 'function print_nice (num1, num2) {
return sprintf("%d(%.2f%)", num1, (num1/num2)*100)
}
{$3=print_nice($3,$2); $4=print_nice($4,$2)}1' file
This uses sprintf to express a specific format and store it in a variable. The calculations are the obvious.

filtering fields based on certain values

I wish you you all a very happy New Year.
I have a file that looks like this(example): There is no header and this file has about 10000 such rows
123 345 676 58 1
464 222 0 0 1
555 22 888 555 1
777 333 676 0 1
555 444 0 58 1
PROBLEM: I only want those rows where both field 3 and 4 have a non zero value i.e. in the above example row 1 & row 3 should be included and rest should be excluded. How can I do this?
The output should look like this:
123 345 676 58 1
555 22 888 555 1
Thanks.
awk is perfect for this kind of stuff:
awk '$3 && $4' input.txt
This will give you the output that you want.
$3 && $4 is a filter. $3 is the value of the 3rd field, $4 is the value of the forth. 0 values will be evaluated as false, anything else will be evaluated as true. If there can be negative values, than you need to write more precisely:
awk '$3 > 0 && $4 > 0' input.txt

Resources