Values based comparison in Unix

Values based comparison in Unix - bash

I have two variables like below .
a=rw,bg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock
b=bg,rg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock
If condition is failing as it's looking for rw value from a variable at position 1 in b variable but it's in position 2 in variable b.
How can I compare the two lines even though the order of the fields is not the same?

This script seems to work:
a="rw,bg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock"
b="bg,rg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock"
{ echo "$a"; echo "$b"; } |
awk -F, \
'NR == 1 { for (i = 1; i <= NF; i++) a[$i] = 1 }
NR == 2 { for (i = 1; i <= NF; i++)
{
if ($i in a)
delete a[$i]
else
{
mismatch++
print "Unmatched item (row 2):", $i
}
}
}
END {
for (i in a)
{
print "Unmatched item (row 1):", i
mismatch++
}
if (mismatch > 0)
print mismatch, "Mismatches"
else
print "Rows are the same"
}'
Example runs:
$ bash pairing.sh
Unmatched item (row 2): rg
Unmatched item (row 1): rw
2 Mismatches
$ sed -i.bak 's/,rg,/,rw,/' pairing.sh
$ bash pairing.sh
Rows are the same
$
There are undoubtedly ways to make the script more compact, but the code is fairly straight-forward. If a field appears twice in the second row and appears once in the first row, the second one will be reported as an unmatched item. The code doesn't check for duplicates while processing the first row — it's an easy exercise to make it do so. The code doesn't print the input data for verification; it probably should.

Related

awk numbered columns and ignore errors

The following works well and captures all 2nd column values for S_nn. The goal is to add numbers in the 2nd column.
awk -F "," '/s_/ {cons = cons + $2} END {print cons}' G.csv
How can I change this to add only when nnn is between N1 and N2 e.g. s_23 and s_24?
Also is it possible to consider 1 if a line has junk instead of numbers in the 2nd column?
S_22, 1
S_23, 0
S_24, 1
S_25, 1
S_26, ?
Sample input: sum s_24 to s_26
Sample output: 1+1+1=3 (the last one is for error)

The solution is rather simple, all you need to do is perform a simple numeric test.
awk -v start=24 -v stop=26 '
BEGIN { FS="[_,]" }
(start <= $2 ) && ($2 <= stop) { s = s + (($3==$3+0)?$3:1) }
END{ print s+0 }' <file>
which outputs
3
How does it work:
line 1 : defines the start and stop fields
BEGIN statement redefines the field separator as a _ or a ,, so now we have 3 fields.
the second line checks if field 2 (the number) is between start and stop, if so perform the sum.
the field 3 is checked if it is a number by testing the condition $3==$3+0, if this fails, it is assumed to be 1
If you want to see the numbers printed, you can do :
awk -v start=24 -v stop=26 '
BEGIN{ FS="[_,]" }
(start <= $2 ) && ($2 <= stop) {
v = ($3==$3+0)?$3:1
s = s + v
printf "%s%d", (c++?"+":""), v
}
END{ printf "=%d\n", s }' <file>
output :
1+1+1=3
The printf statement always prints "+"$3 except on the first time. This is checked by keeping track of a counter c. By default the value of c is set to zero. The entry (c++?"+":"") determines if we are printing the first entry or not. c++ will return the value of c and afterwards sets c to the value c+1, This is called a post increment operator. Thus, the first time, c=0 and (c++?"+":"") returns "" and sets c to 1. The second time, (c++?"+":"") returns "+" and sets c to 2.

How to write a script that searches for numeric pattern in huge file?

I have 200000 integers written in a file like this
0
1
2
3
.
98
99
.
.
100
101
102
.
I want to write with awk or join script that would tell how many times this pattern(from 0 to 99 )repeats itself.

Not battle tested:
awk 'i++!=$0{i=$0==0?1:0}i==100{c++;i=0}END{print c}' p.txt
Breakdown:
i++ != $0 { # Use a cursor (i) which will be compared to input
i=$0==0?1:0; # If not matched reset cursor if current line is zero then set to 1 because
# .. this means we already matched our first line. If not set to 0
i == 100 { # If Full pattern found:
c++; # add to count
i=0; # reset cursor
}
END {print c} # Print matched count

You can do this using a state variable which is reset anytime the pattern is incomplete. For example:
#!/usr/bin/awk -f
BEGIN {
state = -1;
count = 0;
}
/^[0-9]+$/ {
if ( $0 == ( state + 1 ) || $0 == 0 ) {
state = $0;
if ( state == 99 ) {
count++;
}
} else {
state = -1;
}
next;
}
{ state = -1; next; }
END {
print count;
}
This script assumes awk is in /usr/bin (the usual case). You would put the script in a file, e.g., "patterns", and run it like
./patterns < p.txt

awk: More elegant way to filter a file with another one

I've recently approached the incredibly fast awk since I needed to parse very big files.
I had to parse this kind of input...
ID 001R_FRG3G Reviewed; 256 AA.
AC Q6GZX4;
[...]
SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64;
MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
SFRKIYTDLG WKFTPL
//
ID 002L_FRG3G Reviewed; 320 AA.
AC Q6GZX3;
[...]
SQ SEQUENCE 320 AA; 34642 MW; 9E110808B6E328E0 CRC64;
MSIIGATRLQ NDKSDTYSAG PCYAGGCSAF TPRGTCGKDW DLGEQTCASG FCTSQPLCAR
IKKTQVCGLR YSSKGKDPLV SAEWDSRGAP YVRCTYDADL IDTQAQVDQF VSMFGESPSL
AERYCMRGVK NTAGELVSRV SSDADPAGGW CRKWYSAHRG PDQDAALGSF CIKNPGAADC
KCINRASDPV YQKVKTLHAY PDQCWYVPCA ADVGELKMGT QRDTPTNCPT QVCQIVFNML
DDGSVTMDDV KNTINCDFSK YVPPPPPPKP TPPTPPTPPT PPTPPTPPTP PTPRPVHNRK
VMFFVAGAVL VAILISTVRW
//
ID 004R_FRG3G Reviewed; 60 AA.
AC Q6GZX1; dog;
[...]
SQ SEQUENCE 60 AA; 6514 MW; 12F072778EE6DFE4 CRC64;
MNAKYDTDQG VGRMLFLGTI GLAVVVGGLM AYGYYYDGKT PSSGTSFHTA SPSFSSRYRY
...filter it with a file like this...
Q6GZX4
dog
...to get an output like this:
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60
To do this, I came up with this code:
BEGIN{
while(getline<"filterFile.txt">0)B[$1];
}
{
if ($1=="ID")
len=$4;
else{
if ($1=="AC"){
acc=0;
line = substr($0,6,length($0)-6);
split(line,A,"; ");
for (i in A){
if (A[i] in B){
acc=A[i];
}
}
if (acc){
printf acc"\t";
}
}
if (acc){
if(substr($0, 1, 5) == " "){
printf $1$2$3$4$5$6;
}
if ($1 == "//"){
print "\t"len
}
}
}
}
However, since I've seen many examples of similar tasks done with awk, I think there probably is a much more elegant and efficient way to do it. But I can't really grasp the super-compact examples usually found around the internet.
Since this is my input, my output and my code I think this is a good occasion to understand more of awk optimization in terms of performance and coding-style, if some awk-guru has some time and patience to spend in this task.

Perl to the rescue:
#!/usr/bin/perl
use warnings;
use strict;
open my $FILTER, '<', 'filterFile.txt' or die $!;
my %wanted; # Hash of the wanted ids.
chomp, $wanted{$_} = 1 for <$FILTER>;
$/ = "//\n"; # Record separator.
while (<>) {
my ($id_string) = /^ AC \s+ (.*) /mx;
my #ids = split /\s*;\s*/, $id_string;
if (my ($id) = grep $wanted{$_}, #ids) {
print "$id\t";
my ($seq) = /^ SQ \s+ .* $ ((?s:.*)) /mx;
$seq =~ s/\s+//g; # Remove whitespace.
$seq =~ s=//$==; # Remove the final //.
print "$seq\t", length $seq, "\n";
}
}

An awk solution with a different field separator (in this way, you avoid to use substr and split):
BEGIN {
while (getline<"filterFile.txt">0) filter[$1] = 1;
FS = "[ \t;]+"; OFS = ""; ORS = "";
}
{
if (flag) {
if (len)
if ($1 == "//") {
print "\t" len "\n";
flag = 0; len = 0;
} else {
$1 = $1;
print;
}
else if ($1 == "SQ") len = $3;
} else if ($1 == "AC") {
for (i = 1; ++i < NF;)
if (filter[$i]) {
flag = 1;
print $i "\t";
break;
}
}
}
END { if (flag) print "\t" len }
Note: this code is not designed to be short but to be fast. That's why I didn't try to remove nested if/else conditions, but I tried to reduce as possible the global number of tests for a whole file.
However, after several changes since my first version and after several benchmarks, I must admit that choroba perl version is a little faster.

For that kind of task, an idea is to pipe your second file through awk or sed in order to create on the fly a new awk script parsing the big file. As an example:
Control file (f1):
test
dog
Data (f2):
tree 5
test 2
nothing
dog 1
An idea to start with:
sed 's/^\(.*\)$/\/\1\/ {print $2}/' f1 | awk -f - f2
(where -f - means: read the awk script from the standard input rather than from a named file).

may not be much shorter than the original but multiple awk scripts will make the code simpler. First awk generates the records of interest, second extracts the information, third formats
$ awk 'NR==FNR{keys[$0];next}
{RS="//";
for(k in keys)
if($0~k)
{print "key",k; print $0}}' keys file
| awk '/key/{key=$2;f=0;;next}
/SQ/{f=1;print "\n\n"key,$3;next}
f{gsub(" ","");printf $0}
END{print}'
| awk -vRS= -vOFS="\t" '{print $1,$3,$2}'
will print
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60

Your code looks almost OK as-is. Keep it simple, single-pass like that.
Only a couple suggestions:
1) The business around the split is too messy/brittle. Maybe try it this way:
acc="";
n=split($0,A,"[; ]+");
for (i=2;i<=n;++i){
if (A[i] in B){
acc=A[i];
break;
}
}
2) Don't use input data in the first argument to your printfs. You never know when something that looks like printf formatting might come in and really mess things up:
printf "%s\t",acc";
printf "%s%s%s%s%s%s",$1,$2,$3,$4,$5,$6;
Update with one more possible "elegance":
3) The awk style of pattern{action} is already a form of if/then, so you can avoid a lot of your outer if/then nesting:
$1="ID" {len=$4}
$1="AC" {
acc="";
...
}
acc {
if(substr($0, 1, 5) == " "){
...
}

In Vim it's actually one-liner to find the pattern:
/^AC.\{-}Q6GZX4;\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//
where Q6GZX4; is your pattern to find in order to match the sequence characters.
The above basically will do:
Search for the line with AC at the beginning (^) which is followed by Q6GZX4;.
Follow across multiple lines (\_.\{-}) to the line starting with SQ (\nSQ).
Then follow to the next line ignoring what's in the current (\_.\{-}\n).
Now start selecting the main pattern (\zs) which is basically everything across multiple lines (\_.\{-}) until (\ze) the // pattern if found.
Then execute normal Vim commands (norm) which selects the pattern (gn) and yank it into x register ("xy).
You may now print register (echo #x) or remove whitespace characters from it.
This can be extended into Ex editor script as below (e.g. cmd.ex):
let s="Q6GZX4"
exec '/^AC.\{-}' . s . ';\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//norm gn"xy'
let #x=substitute(#x,'\W','','g')
silent redi>>/dev/stdout
echon s . " " . #x
redi END
q!
Then run from the command-line as:
$ ex inputfile < cmd.ex
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL
The above example can be further extended for multiple files or matches.

awk 'FNR == NR { aFilter[ $1 ";"] = $1; next }
/^AC/ {
if (String !~ /^$/) print Taken "\t" String "\t" Len
Taken = ""; String = ""
for ( i = 2; i <= NF && Taken ~ /^$/; i++) {
if( $i in aFilter) Taken = aFilter[ $i]
}
Take = Taken !~ /^$/
next
}
Take && /^SQ/ { Len = $3; next }
Take && /^[[:blank:]]/ {
gsub( /[[:blank:]]*/, "")
String = String $0
}
END { if( String !~ /^$/) print Taken "\t" String "\t" Len }
' filter.txt YourFile
Not really shorter, maybe a bit more generic. The heavy part is to extract the value that serve as filter from the line

Print lines indexed by a second file

I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2

If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.

This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).

Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)

In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?

Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'

Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.

Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt

Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Values based comparison in Unix - bash

Related

awk numbered columns and ignore errors

How to write a script that searches for numeric pattern in huge file?

awk: More elegant way to filter a file with another one

Print lines indexed by a second file

Uniq in awk; removing duplicate values in a column using awk

Categories

Resources