How to Compare two files line by line and output the whole line if different - bash

I have two sorted files in question
1)one is a control file(ctrl.txt) which is external process generated
2)and other is line count file(count.txt) that I generate using `wc -l`
$more ctrl.txt
Thunderbird|1000
Mustang|2000
Hurricane|3000
$more count.txt
Thunder_bird|1000
MUSTANG|2000
Hurricane|3001
I want to compare these two files ignoring wrinkles in column1(filenames) such as "_" (for Thunder_bird) or "upper case" (for MUSTANG) so that my output only shows below file as the only real different file for which counts dont match.
Hurricane|3000
I have this idea to only compare second column from both the files and output whole line if they are different
I have seen other examples in AWK but I could not get anything to work.

Could you please try following awk and let me know if this helps you.
awk -F"|" 'FNR==NR{gsub(/_/,"");a[tolower($1)]=$2;next} {gsub(/_/,"")} ((tolower($1) in a) && $2!=a[tolower($1)])' cntrl.txt count.txt
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
gsub(/_/,"");
a[tolower($1)]=$2;
next}
{ gsub(/_/,"") }
((tolower($1) in a) && $2!=a[tolower($1)])
' cntrl.txt count.txt
Explanation: Adding explanation too here for above code.
awk -F"|" ' ##Setting field seprator as |(pipe) here for all lines in Input_file(s).
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file(cntrl.txt) in this case is being read. Following instructions will be executed once this condition is TRUE.
gsub(/_/,""); ##Using gsub utility of awk to globally subtitute _ with NULL in current line.
a[tolower($1)]=$2; ##Creating an array named a whose index is first field in LOWER CASE to avoid confusions and value is $2 of current line.
next} ##next is awk out of the box keyword which will skip all further instructions now.(to make sure they are read when 2nd Input-file named count.txt is being read).
{ gsub(/_/,"") } ##Statements from here will be executed when 2nd Input_file is being read, using gsub to remove _ all occurrences from line.
((tolower($1) in a) && $2!=a[tolower($1)]) ##Checking condition here if lower form of $1 is present in array a and value of current line $2 is NOT equal to array a value. If this condition is TRUE then print the current line, since I have NOT given any action so by default printing of current line will happen from count.txt file.
' cntrl.txt count.txt ##Mentioning the Input_file names here which we have to pass to awk.

Related

How to count a matching pattern in one line?

enter code hereI have a fasta file containing sequences
>lcl|QCYY01003067.1_cds_ROT65593.1_2
ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
>lcl|QCYY01003067.1_cds_ROT65593.1_3
ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
>lcl|QCYY01003067.1_cds_ROT65593.1_4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCC
I wanted to count the number of 'N' and also the number of patterns occurring in each line. No need to include header (>lcl|QCYY01003067.1_cds_ROT65593.1_2 )
eg:-
line 2=0,0
line 4=15,2
line 6=10,1
How to improve this code:
grep -n '[{N}]' <filename> | cut -d : -f 1 | uniq -c
Another awk:
$ awk 'NR%2==0{printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"")}' file
Output:
line 2=0,0
line 4=15,2
line 6=10,1
Explained:
$ awk '
NR%2==0 { # process even records
printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"") # count with gsub
}' file
gsub(/N/,"N") counts the amount of Ns in the record (returns the amount of replacements). gsub(/N+/,"") counts the number of consecutive strings of Ns. Notice, that "" removes those Ns from the record so if you need to later further process the data, use gsub(/N+/,"&") instead.
Updated:
The version I wrote for your already-deleted next question.
I added an extra line to your data which demonstrates the question I asked in the comments (is ...N\nNN.. one (NNN) or two (N,NN) patterns of your definition):
...
>seq4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCNNN
NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC
This one is for GNU awk (for using RT):
$ gawk '
BEGIN {
RS=">seq[^\n]+"
}
NR>1 {
# gsub(/\n/,"") # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN
printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
}
{
rt=RT
}' file
Output (pay special attention to the seq4):
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,3
or if you uncomment the gsub(/\n/,"") to remove the newline separating strings, the output is:
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,2
One-liner (with the one gsub uncommented):
$ awk 'BEGIN{RS=">seq[^\n]+"}NR>1{gsub(/\n/,"");printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")}{rt=RT}' file
Could you please try following.
awk '
!/^>/{
while(match($0,/N+/)){
count++
total+=length(substr($0,RSTART,RLENGTH))
$0=substr($0,RSTART+RLENGTH)
}
printf("%s %d=%d,%d\n","line",FNR,total,count)
count=total=""
}
' Input_file
Output will be as follows.
line 2=0,0
line 4=15,2
line 6=10,1
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!/^>/{ ##Checking condition if a line is NOT starting from > then do following.
while(match($0,/N+/)){ ##Running a while loop which will run till a match found for N characters continuous occurrence.
count++ ##Doing increment to variable count with 1 each time cursor comes here.
total+=length(substr($0,RSTART,RLENGTH)) ##Creating total variable which is keep adding its own value along with length of matched regex, where regex is looking for continuous occurrence of N character in current line.
$0=substr($0,RSTART+RLENGTH) ##Resetting value of current line to have only REST of line which starts from very next character of matched regex. So that we can skip previous matched regex and look for others in rest of the line.
} ##Closing BLOCK for above mentioned while loop here.
printf("%s %d=%d,%d\n","line",FNR,total,count) ##Printing values line,FNR,total,count variables here.
count=total="" ##Nullifying variables count and total here, so that previous values should NOT be added to current values of it.
}
' Input_file ##Mentioning Input_file name here.

replace names in fasta

I want to change the sequence names in a fasta file according a text file containing new names. I found several approaches but seqkit made a good impression, anyway I can´t get it running. Replace key with value by key-value file
The fasta file seq.fa looks like
>BC1
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>BC2
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>BC3
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
and the ref.txt tab delimited text file like
BC1 1234
BC2 1235
BC3 1236
using siqkit in Git Bash runs trough the file but doesn´t change the names.
seqkit replace -p' (.+)$' -r' {kv}' -k ref.txt seq.fa --keep-key
I´m used to r and new to bash and can´t find the bug but guess I need to adjust for tab and _ ?
As in the example https://bioinf.shenwei.me/seqkit/usage/#replace part 7. Replace key with value by key-value file the sequence name is tab delimited and only the second part is replaced.
Advise how to adjust the code?
Desired outcome should look like: Replacing BC1 by the number in the text file 1234
>1234
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>1235
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>1236
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
could you please try following.
awk '
FNR==NR{
a[$1]=$2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##FNR==NR is condition which will be TRUE when 1st Input_file named ref.txt will be read.
a[$1]=$2 ##Creating an array named a whose index is $1 and value is $2 of current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
($2 in a) && /^>/{ ##Checking condition if $2 of current line is present in array a and starts with > then do following.
print ">"a[$2] ##Printing > and value of array a whose index is $2.
next ##next will skip all further statements from here.
}
1 ##Mentioning 1 will print the lines(those which are NOT starting with > in Input_file seq.fa)
' ref.txt FS="[> ]" seq.fa ##Mentioning Input_file names here and setting FS= either space or > for Input_file seq.fa here.
EDIT: As per OP's comment need to add >1234_1 occurrence number too in output so adding following code now.
awk '
FNR==NR{
a[$1]=$2
b[$1]=++c[$2]
next
}
($2 in a) && /^>/{
print ">"a[$2]"_"b[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
awk solution that doesn't require GNU awk:
awk 'NR==FNR{a[$1]=$2;next}
NF==2{$2=a[$2]; print ">" $2;next}
1' FS='\t' ref.txt FS='>' seq.fa
The first statement is filling the array a with the content of the tab delimited file ref.txt.
The second statement prints all lines of the second files seq.fa with 2 fields given the > as field delimiter.
The last statement prints all lines of that same file.

Replace header of one column by file name

I have about 100 comma-separated text files with eight columns.
Example of two file names:
sample1_sorted_count_clean.csv
sample2_sorted_count_clean.csv
Example of file content:
Domain,Phylum,Class,Order,Family,Genus,Species,Count
Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Zymomonas,Zymomonas mobilis,0.0
Bacteria,Bacteroidetes,Flavobacteria,Flavobacteriales,Flavobacteriaceae,Zunongwangia,Zunongwangia profunda,0.0
For each file, I would like to replace the column header "Count" by sample ID, which is contained in the first part of the file name (sample1, sample2)
In the end, the header should then look like this:
Domain,Phylum,Class,Order,Family,Genus,Species,sample1
If I use my code, the header looks like this:
Domain,Phylum,Class,Order,Family,Genus,Species,${f%_clean.csv}
for f in *_clean.csv; do echo ${f}; sed -e "1s/Domain,Phylum,Class,Order,Family,Genus,Species,RPMM/Domain,Phylum,Class,Order,Family,Genus,Species,${f%_clean.csv}/" ${f} > ${f%_clean.csv}_clean2.csv; done
I also tried:
for f in *_clean.csv; do gawk -F"," '{$NF=","FILENAME}1' ${f} > t && mv t ${f%_clean.csv}_clean2.csv; done
In this case, "count" is replaced by the entire file name, but each row of the column contains file name now. The count values are no longer present. This is not what I want.
Do you have any ideas on what else I may try?
Thank you very much in advance!
Anna
If you are ok with awk, could you please try following.
awk 'BEGIN{FS=OFS=","} FNR==1{var=FILENAME;sub(/_.*/,"",var);$NF=var} 1' *.csv
EDIT: Since OP is asking that after 2nd underscore everything should be removed in file's name then try following.
awk 'BEGIN{FS=OFS=","} FNR==1{split(FILENAME,array,"_");$NF=array[1]"_"array[2]} 1' *.csv
Explanation: Adding explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code from here, which will be executed before Input_file(s) are being read.
FS=OFS="," ##Setting FS and OFS as comma here for all files all lines.
} ##Closing BEGIN section here.
FNR==1{ ##Checking condition if FNR==1 which means very first line is being read for Input_file then do following.
split(FILENAME,array,"_") ##Using split of awk out of box function by splitting FILENAME(which contains file name in it) into an array named array with delimiter _ here.
$NF=array[1]"_"array[2] ##Setting last field value to array 1st element underscore and then array 2nd element value in it.
} ##Closing FNR==1 condition BLOCK here.
1 ##Mentioning 1 will print the rest of the lines for current Input_file.
' *.csv ##Passing all *.csv files to awk program here.

Transpose rows to column after nth column in bash

I have a file like below format:
$ cat file_in.csv
1308123;28/01/2019;28/01/2019;22/01/2019
1308456;20/11/2018;27/11/2018;09/11/2018;15/11/2018;10/11/2018;02/12/2018
1308789;06/12/2018;04/12/2018
1308012;unknown
How can i transpose as below, starting from second column:
1308123;28/01/2019
1308123;28/01/2019
1308123;22/01/2019
1308456;20/11/2018
1308456;27/11/2018
1308456;09/11/2018
1308456;15/11/2018
1308456;10/11/2018
1308456;02/12/2018
1308789;06/12/2018
1308789;04/12/2018
1308012;unknown
I'm testing my script, but obtain a wrong result
echo "123;23/05/2018;24/05/2018" | awk -F";" 'NR==3{a=$1";";next}{a=a$1";"}END{print a}'
Thanks in advance
1st Solution: Eaisest solution will be, loop through all fields(off course have set field separator as ;) and then print $1 along with all fields in new line. Also note that loop is running from i=2 to till value of NF leaving first field since we need to print in new line from column 2nd onwards.
awk 'BEGIN{FS=OFS=";"} {for(i=2;i<=NF;i++){print $1,$i}}' Input_file
2nd Solution: Using 1 time substitution(sub) and global substitutions(gsub) functionality of awk. Here I am changing very first occurence of ; with ###(assumed that your Input_file will NOT have this characters together, in case it is there choose any unique character(s) which are NOT in one's Input_file on place of ###), then globally subsituting ;(all occurences) with ORS val(a variable which has value of $1) and ; so make values in new column. Now finally remove ### from first field. Why we have done this approch if we DO NOT substitute very first occurence of ; with any other character then it will place a NEW LINE before substituion which we DO NOT want to have. (Also as per Ed sir's comment this solution was tested in 1 Input_file and may have issues while reading multiple Input_files)
awk 'BEGIN{FS=OFS=";"} {val=$1;sub(";","###");gsub(";",ORS val ";");sub("###",";",$1)} 1' Input_file
Another awk
awk -F";" '{ OFS="\n" $1 ";"; $1=$1;$1=""; printf("%s",$0) } ' file

join all lines that have the same first column to the same line

IE:
File:
1234:abcd
1234:930
1234:999999
194:keee
194:284
194:222222
Result:
1234:abcd:930:999999
194:kee:284:222222
I have exhausted my brain to the best of my knowledge and can't come up with a way. Sorry to bother you guys!
$ awk -F: '$1==last {printf ":%s",$2; next} NR>1 {print "";} {last=$1; printf "%s",$0;} END{print "";}' file
1234:abcd:930:999999
194:keee:284:222222
How it works
-F:
This tells awk to use a : as the field separator.
$1==last {printf ":%s",$2; next}
If the first field of this line is the same as the first field of the last line, print a colon followed by field 2. Then, skip the rest of the commands and start over with the next line.
NR>1 {print "";}
If we get here, that means that this line has a new not-seen-before value of the first field. If this not the first line, we finish the last line by printing a newline character.
{last=$1; printf "%s",$0;}
Update the variable last with the new value of field 1. Then, print this line.
END{print "";}
After we reach the end of the file, print one last newline character.
Combining non-consecutive lines
Consider this test file:
$ cat testfile2
3:abcd
4:abcd
10:123
3:999
4:999
10:123
Apply this awk script:
$ awk -F: '{a[$1]=a[$1]":"$2;} END{for (x in a) print x ":" substr(a[x],2);}' testfile2
3:abcd:999
4:abcd:999
10:123:123
In this approach, the lines will not necessarily come out in any particular order. If order is important, you may want to pipe this output to sort.

Resources