Filter records based on Text in Unix - shell

I'm trying to extract all the records that matches the text "IN" in the 10th field from this file.
i tried but it's not giving me the accurate results. Any help provided here would be highly appreciated.
awk '$10 == "IN" {print $0}'
input_file: my input file
A1|A2|A3|A4|A5|A6|A7|A8|A9|PK|A11|A13|A14|A15|A16|A17|A18
1|2|3|4|5|6|7|8|9|IN|11|12|13|14|15|16|17|18
AW|BW|CQ|AA|AR|AF|RR|AKL|ASD|US|PP|BN|TY|OL|Q3|M8|I7|V6
AR|BR|CR|A8|AN|AQ|RU|A11|A13|IN|P9P|B0N|T2Y|O4L|Q43|M88|I71|V16
output_file: my output should be
1|2|3|4|5|6|7|8|9|IN|11|12|13|14|15|16|17|18
AR|BR|CR|A8|AN|AQ|RU|A11|A13|IN|P9P|B0N|T2Y|O4L|Q43|M88|I71|V16
all the records that matched "IN" in the 10th field should be filtered.

Since you haven't mentioned the field separator in awk code so by default it makes space as field separator and your Input_file is | pipe delimited so let awk know you should set it up in code.
Could you please try following.
awk -F'|' '$10=="IN"' Input_file
Explanation: Adding explanation for above code too.
awk -F'|' ' ##Setting field separator as |(pipe) for all lines of Input_file.
$10=="IN" ##Checking condition if 10th field is equal to IN here if yes then print the current line.
' Input_file ##Mentioning Input_file name here.

Related

replace names in fasta

I want to change the sequence names in a fasta file according a text file containing new names. I found several approaches but seqkit made a good impression, anyway I can´t get it running. Replace key with value by key-value file
The fasta file seq.fa looks like
>BC1
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>BC2
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>BC3
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
and the ref.txt tab delimited text file like
BC1 1234
BC2 1235
BC3 1236
using siqkit in Git Bash runs trough the file but doesn´t change the names.
seqkit replace -p' (.+)$' -r' {kv}' -k ref.txt seq.fa --keep-key
I´m used to r and new to bash and can´t find the bug but guess I need to adjust for tab and _ ?
As in the example https://bioinf.shenwei.me/seqkit/usage/#replace part 7. Replace key with value by key-value file the sequence name is tab delimited and only the second part is replaced.
Advise how to adjust the code?
Desired outcome should look like: Replacing BC1 by the number in the text file 1234
>1234
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>1235
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>1236
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
could you please try following.
awk '
FNR==NR{
a[$1]=$2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##FNR==NR is condition which will be TRUE when 1st Input_file named ref.txt will be read.
a[$1]=$2 ##Creating an array named a whose index is $1 and value is $2 of current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
($2 in a) && /^>/{ ##Checking condition if $2 of current line is present in array a and starts with > then do following.
print ">"a[$2] ##Printing > and value of array a whose index is $2.
next ##next will skip all further statements from here.
}
1 ##Mentioning 1 will print the lines(those which are NOT starting with > in Input_file seq.fa)
' ref.txt FS="[> ]" seq.fa ##Mentioning Input_file names here and setting FS= either space or > for Input_file seq.fa here.
EDIT: As per OP's comment need to add >1234_1 occurrence number too in output so adding following code now.
awk '
FNR==NR{
a[$1]=$2
b[$1]=++c[$2]
next
}
($2 in a) && /^>/{
print ">"a[$2]"_"b[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
awk solution that doesn't require GNU awk:
awk 'NR==FNR{a[$1]=$2;next}
NF==2{$2=a[$2]; print ">" $2;next}
1' FS='\t' ref.txt FS='>' seq.fa
The first statement is filling the array a with the content of the tab delimited file ref.txt.
The second statement prints all lines of the second files seq.fa with 2 fields given the > as field delimiter.
The last statement prints all lines of that same file.

Replace header of one column by file name

I have about 100 comma-separated text files with eight columns.
Example of two file names:
sample1_sorted_count_clean.csv
sample2_sorted_count_clean.csv
Example of file content:
Domain,Phylum,Class,Order,Family,Genus,Species,Count
Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Zymomonas,Zymomonas mobilis,0.0
Bacteria,Bacteroidetes,Flavobacteria,Flavobacteriales,Flavobacteriaceae,Zunongwangia,Zunongwangia profunda,0.0
For each file, I would like to replace the column header "Count" by sample ID, which is contained in the first part of the file name (sample1, sample2)
In the end, the header should then look like this:
Domain,Phylum,Class,Order,Family,Genus,Species,sample1
If I use my code, the header looks like this:
Domain,Phylum,Class,Order,Family,Genus,Species,${f%_clean.csv}
for f in *_clean.csv; do echo ${f}; sed -e "1s/Domain,Phylum,Class,Order,Family,Genus,Species,RPMM/Domain,Phylum,Class,Order,Family,Genus,Species,${f%_clean.csv}/" ${f} > ${f%_clean.csv}_clean2.csv; done
I also tried:
for f in *_clean.csv; do gawk -F"," '{$NF=","FILENAME}1' ${f} > t && mv t ${f%_clean.csv}_clean2.csv; done
In this case, "count" is replaced by the entire file name, but each row of the column contains file name now. The count values are no longer present. This is not what I want.
Do you have any ideas on what else I may try?
Thank you very much in advance!
Anna
If you are ok with awk, could you please try following.
awk 'BEGIN{FS=OFS=","} FNR==1{var=FILENAME;sub(/_.*/,"",var);$NF=var} 1' *.csv
EDIT: Since OP is asking that after 2nd underscore everything should be removed in file's name then try following.
awk 'BEGIN{FS=OFS=","} FNR==1{split(FILENAME,array,"_");$NF=array[1]"_"array[2]} 1' *.csv
Explanation: Adding explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code from here, which will be executed before Input_file(s) are being read.
FS=OFS="," ##Setting FS and OFS as comma here for all files all lines.
} ##Closing BEGIN section here.
FNR==1{ ##Checking condition if FNR==1 which means very first line is being read for Input_file then do following.
split(FILENAME,array,"_") ##Using split of awk out of box function by splitting FILENAME(which contains file name in it) into an array named array with delimiter _ here.
$NF=array[1]"_"array[2] ##Setting last field value to array 1st element underscore and then array 2nd element value in it.
} ##Closing FNR==1 condition BLOCK here.
1 ##Mentioning 1 will print the rest of the lines for current Input_file.
' *.csv ##Passing all *.csv files to awk program here.

AWK not printing output file separator OFS

Input
15.01.2018;Payment sent;;500.00;;
20.12.2017;Payment received;10.40;;;
Expected output
15.01.2018;Payment sent;-500.00
20.12.2017;Payment received;10.40
Current output
15.01.2018Payment sent-500.00
20.12.2017Payment received10.40
Does one see the problem in my command?
awk 'BEGIN{OFS=";";FS=";"} {print match($4, /[^ ]/) ? $1$2$3"-"$4 : $1$2$3}' < in.csv > out.csv
Thank you
I don't understand why you're surprised that when you print $1$2$3 there's no OFS between them but I also don't understand why you were trying to use the logic in your script at all instead of just:
$ awk 'BEGIN{FS=OFS=";"} {print $1, $2, ($3=="" ? "-"$4 : $3)}' file
15.01.2018;Payment sent;-500.00
20.12.2017;Payment received;10.40
Following awk may help you in same.
awk -F";" '$4~/[0-9]/{$4="-"$4}{gsub(/;+/,";");sub(/;$/,"")} 1' OFS=";" Input_file
Output will be as follows.
15.01.2018;Payment sent;-500.00
20.12.2017;Payment received;10.40
Explanation: Adding explanation for above code too now.
awk -F";" ' ##Setting field separator as semi colon here.
$4~/[0-9]/{ ##Checking condition here if 4th field is having digit in it, if yes then do following:
$4="-"$4 ##Adding a dash(-) before 4th field value here.
}
{
gsub(/;+/,";"); ##By Globally substitution method Removing multiple semi colons occurrences with single semi colon here, as per OP shown output.
sub(/;$/,"") ##By using normal substitution method replacing semi colon which comes at last of line with NULL here.
}
1 ##awk works on method of condition{action}, so here I am making condition TRUE and NOT mentioning any action so default print will happen.
' OFS=";" Input_file##Setting OFS(Output field separator) as semi colon here and mentioning Input_file name here too.
awk '{sub(/sent;;/,"sent;-")sub(/;;+/,"")}1' file
15.01.2018;Payment sent;-500.00
20.12.2017;Payment received;10.40
The first sub changes semicolon to dash and the second removes semicolons after last zero.
None of these answers is actually responsive to the OP's question. The question was "Why isn't the OFS appearing in the output?" The answer is quite simple, and one person made a snarky comment in the right direction but nobody actually answered.
Here's the answer: In the ... print $1$2$3... part, there are no spaces between $1, $2, and $3, so you've asked awk to just put those fields right next to each other with no space or field separator. If you had ... print $1,$2,$3.. then you'd have the result you are looking for.
And yes I know this is an old dead question
There's absolutely no point in using ";" as either FS or OFS
{m,g}awk 'sub(";*$",_,$!(NF=NF))' FS='sent;+' OFS='sent;-'
{m,g}awk NF=NF RS=';*\r?\n' FS='sent;+' OFS='sent;-'
15.01.2018;Payment sent;-500.00
20.12.2017;Payment received;10.40

How to remove partial duplicates from text file?

How can I remove partial duplicates in bash using either awk, grep or sort?
Input:
"3","6"
"3","7"
"4","9"
"5","6"
"26","48"
"543","7"
Expected Output:
"3","6"
"3","7"
"4","9"
"26","48"
Could you please try following and let me know if this helps you.
awk -F'[",]' '!a[$5]++' Input_file
Output will be as follows.
"3","6"
"3","7"
"4","9"
"26","48"
EDIT: Adding explanation too here.
awk -F'[",]' ' ##Setting field separator as " or , for every line of Input_file.
!a[$5]++ ##creating an array named a whose index is $5(fifth field) and checking condition if 5th field is NOT present in array a, so when any 5th field comes in array a then increasing its count so next time it will not take any duplicates in it. Since awk works on condition and then action, since here no action is mentioned so by default print of current line will happen.
' Input_file ##Mentioning the Input_file here too.

Extracting expression from csv field

I'm trying to extract the value that comes after word= in CSV file that looks like this:
1473228800,0.0,word=google.sentence=Android.something=not_set
1480228800,100.0,word=google_analytics.number=not_set.country=US.source=internet
1493228800,0.0,location=NY.word=Android.sentence=not_set.something=not_set.type=gauge
and the output I need is (it's important for me to only print "word" and it's value):
1473228800,0.0,word=google
1480228800,100.0,word=google_analytics
1493228800,0.0,word=Android
I tried using sed and awk, but each gave me soultion for only few of the csv file.
This is my last try using awk:
awk -F "," '{sub(/.*word.*=(.*)\.*/,"word=\1", $3);print $1","$2","$3}'
awk solution:
awk -F, '{match($3,/word=[^.]+/); print $1,$2,substr($3,RSTART,RLENGTH)}' OFS=',' file
The output:
1473228800,0.0,word=google
1480228800,100.0,word=google_analytics
1493228800,0.0,word=Android
match($3,/word=[^.]+/) - to match the needed sequence within the 3rd field
substr($3,RSTART,RLENGTH) - to extract matched sequence from the 3rd field
The match() function sets the predefined variable RSTART to the
index. It also sets the predefined variable RLENGTH to the length in
characters of the matched substring.
try:
awk -F, '{sub(/.*word/,"word",$3);sub(/\..*/,"",$3);print $1,$2,$3}' OFS="," Input_file
Making field separator as , then substituting >8word with string word. Then substituting from DOT to everything with NULL in $3 as we don't need it as per your question. Then printing the first, second and third fields set output field separator as comma then.

Resources