AWK find if line is newline or # - bash

I have the following, it's ignoring the lines with just # but not those with \n (empty/ just containing newline lines)
Do you know of a way I can hit two birds with one stone?
I.E. if the lines don't contain more than 1 char, then delete the line..
function check_duplicates {
awk '
FNR==1{files[FILENAME]}
{if((FILENAME, $0) in a) dupsInFile[FILENAME]
else
{a[FILENAME, $0]
dups[$0] = $0 in dups ? (dups[$0] RS FILENAME) : FILENAME
count[$0]++}}
{if ($0 ~ /#/) {
delete dups[$0]
}}
#Print duplicates in more than one file
END{for(k in dups)
{if(count[k] > 1)
{print ("\n\nDuplicate line found: " k) " - In the following file(s)"
print dups[k] }}
printf "\n";
}' $SITEFILES
awk '
NR {
b[$0]++
}
$0 in b {
if ($0 ~ /#/) {
delete b[$0]
}
if (b[$0]>1) {
print ("\n\nRepeated line found: "$0) " - In the following file"
print FILENAME
delete b[$0]
}
}' $SITEFILES
}
The expected input is usually as follows.
#File Path's
/path/to/file1
/path/to/file2
/path/to/file3
/path/to/file4
#
/more/paths/to/file1
/more/paths/to/file2
/more/paths/to/file3
/more/paths/to/file4
/more/paths/to/file5
/more/paths/to/file5
In this case, /more/paths/to/file5, occurs twice, and should be flagged as such.
However, there are also many newlines, which I'd rather ignore.
Er, it also has to be awk, I'm doing a tonne of post processing, and don't want to vary from awk for this bit, if that's okay :)
It really seems to be a bit tougher than I would have expected.
Cheers,
Ben

You can combine both the if into a single regex.
if ($0 ~ /#|\n/) {
delete dups[$0]
}
OR
To be more specific you can write
if ($0 ~ /^#?$/) {
delete dups[$0]
}
What it does
^ Matches starting of the line.
#? Matches one or zero #
$ Matches end of line.
So, ^$ matches empty lines and ^#$ matches lines with only #.

Related

Use awk to create index of words from file

I'm learning UNIX for school and I'm supposed to create a command line that takes a text file and generates a dictionary index showing the words (exluding articles and prepositions) and the lines where it appears in the file.
I found a similar problem as mine in: https://unix.stackexchange.com/questions/169159/how-do-i-use-awk-to-create-an-index-of-words-in-file?newreg=a75eebee28fb4a3eadeef5a53c74b9a8 The problem is that when I run the solution
$ awk '
{
gsub(/[^[:alpha:] ]/,"");
for(i=1;i<=NF;i++) {
a[$i] = a[$i] ? a[$i]", "FNR : FNR;
}
}
END {
for (i in a) {
print i": "a[i];
}
}' file | sort
The output contains special characters (which I don't want) like:
-Quiero: 21
Sancho,: 2, 4, 8
How can I remove all the special characters and excluding articles and prepositions?
$ echo This is this test. | # some test text
awk '
BEGIN{
x["a"];x["an"];x["the"];x["on"] # the stop words
OFS=", " # list separator to a
}
{
for(i=1;i<=NF;i++) # list words in a line
if($i in x==0) { # if word is not a stop word
$i=tolower($i) # lowercase it
gsub(/^[^a-z]|[^a-z]$/,"",$i) # remove leading and trailing non-alphabets
a[$i]=a[$i] (a[$i]==""?"":OFS) NR # add record number to list
}
}
END { # after file is processed
for(i in a) # in no particular order
print i ": " a[i] # ... print elements in a
}'
this: 1, 1
test: 1
is: 1

Replace in one file with value from another file not working properly

I have two files. A mapping file and an input file.
cat map.txt
test:replace
cat input.txt
The word test should be replaced.But the word testbook should not be
replaced just because it has "_test" in it.
Using the below command to find in the file and replace it with value in mapping file.
awk 'FNR==NR{ array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' FS=":" map.txt FS=" " input.txt
what it does is, searches for the text which are mentioned in map.txt and replace with the word followed after " : " in the same input file.
In the above example "test" with "replace".
Current result:
The word replace should be replaced.But the word replacebook should not be replaced just because it has _replace in it.
Expected Result:
The word replace should be replaced.But the word testbook should not be replaced just because it has "_test" in it.
so what i need is only if that word alone is found it has to be replaced. If that word has any other character clubbed then it should be ignored.
Any help is appreciated.
Thanks in advance.
With GNU awk for word boundaries:
awk -F':' '
NR==FNR { map[$1] = $2; next }
{
for (old in map) {
new = map[old]
gsub("\\<"old"\\>",new)
}
print
}
' map input
The above will fail if old contains regexp metacharacters or escape characters or if new contains & but as long as both use word consituent characters it'll be fine.
for loop all the words and replace where needed:
$ awk '
NR==FNR { # hash the map file
a[$1]=$2
next
}
{
for(i=1;i<=NF;i++) # loop every word and if it s hashed, replace it
if($i in a) # ... and if it s hashed...
$i=a[$i] # replace it
}1
' FS=":" map FS=" " input
The word replace should be replaced.But the word testbook should not be replaced just because it has "_test" in it.
Edit: Using match to extract words from strings to preserve punctuations:
$ cat input2
Replace would Yoda test.
$ awk '
NR==FNR { # hash the map file
a[$1]=$2
next
}
{
for(i=1;i<=NF;i++) {
# here should be if to weed out obvious non-word-punctuation pairs
# if($i ~ /^[a-zA-Z+][,\.!?]/)
match($i,/^[a-zA-Z]+/) # match from beginning of word. ¿correct?
w=substr($i,RSTART,RLENGTH) # extract word
if(w in a) # match in a
sub(w,a[w],$i)
}
}1' FS=":" map FS=" " input
Replace would Yoda replace.

awk get the nextline

i'm trying to use awk to format a file thats contains multiple line.
Contains of file:
ABC;0;1
ABC;0;0;10
ABC;0;2
EFG;0;1;15
HIJ;2;8;00
KLM;4;18;12
KLM;6;18;1200
KLM;10;18;14
KLM;1;18;15
result desired:
ABC;0;1;10;2
EFG;0;1;15
HIJ;2;8;00
KLM;4;18;12;1200;14;15
I am using the code below :
awk -F ";" '{
ligne= ligne $0
ma_var = $1
{getline
if($1 != ma_var){
ligne= ligne "\n" $0
}
else {
ligne= ligne";"NF
}
}
}
END {
print ligne
} ' ${FILE_IN} > ${FILE_OUT}
the objectif is to compare the first column of the next line to the first column the current line, if it matches then add the last column of the next line to the current line, and delete the next line, else print the next line.
Kind regards,
As with life, it's a lot easier to make decisions based on what has happened (the previous line) than what will happen (the next line). Re-state your requirements as the objective is to compare the first column of the current line to the first column the previous line, if it matches then add the last column of the current line to the previous line, and delete the current line, else print the current line. and the code to implement it becomes relatively straight-forward:
$ cat tst.awk
BEGIN { FS=OFS=";" }
$1 == p1 { prev = prev OFS $NF; next }
{ if (NR>1) print prev; prev=$0; p1=$1 }
END { print prev }
$ awk -f tst.awk file
ABC;0;1;10;2
EFG;0;1;15
HIJ;2;8;00
KLM;4;18;12;1200;14;15
If you're ever tempted to use getline again, be sure you fully understand everything discussed at http://awk.freeshell.org/AllAboutGetline before making a decision.
I would take a slightly different approach than Ed:
$ awk '$1 == p { printf ";%s", $NF; next } NR > 1 { print "" } {p=$1;
printf "%s" , $0} END{print ""}' FS=\; input
At each line, check if the first column matches the previous. If it does, just print the last field. If it doesn't, print the whole line with no trailing newline.

Remove empty lines followed by a pattern

I'm trying to find a way to remove empty lines which are found in my asciidoc file before a marker string, such as:
//Empty line
[source,shell]
I'd need:
[source,shell]
I'm trying with:
sed '/^\s*$\n[source,shell]/d' file
however it doesn't produce the expected effect (even escaping the parenthesis). Any help ?
You may use this awk-script to delete previous empty line:
awk -v desired_val="[source,shell]"
'BEGIN { first_time=1 }
{
if ( $0 != desired_val && first_time != 1) { print prev };
prev = $0;
first_time = 0;
}
END { print $0 }' your_file
Next script is little more than previous, but provides deleting all empty lines before desired value.
# AWK script file
# Provides clearing all empty lines in front of desired value
# Usage: awk -v DESIRED_VAL="your_value" -f "awk_script_fname" input_fname
BEGIN { i=0 }
{
# If line is empty - save counts of empty strings
if ( length($0) == 0 ) { i++; }
# If line is not empty and is DESIRED_VAL, print it
if ( length ($0) != 0 && $0 == DESIRED_VAL )
{
print $0; i=0;
}
# If line is not empty and not DESIRED_VAL, print all empty str and current
if ( length ($0) != 0 && $0 != DESIRED_VAL )
{
for (m=0;m<i;m++) { print ""; } i=0; print $0;
}
}
# If last lines is empty, print it
END { for (m=0;m<i;m++) { print ""; } }
This is awk-script used by typing followed command:
awk -v DESIRED_VAL="your_value" -f "awk_script_fname" input_fname
Your sed line doesn't work because sed processes one line at a time, so it will not match a pattern that includes \n unless you manipulate the pattern space.
If you still want to do it with sed:
sed '/^$/{N;s/\n\(\[source,shell]\)/\1/}' file
How it works: When matching an empty line, read the next line into the pattern space and remove the empty line if a marker is found. Note that this won't work correctly if you have two empty lines before the marker, as the first empty line will consume the second one and there will be no matching with the marker.

Sequence length of FASTA file

I have the following FASTA file:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
My desired output:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
This is my code:
awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa
The output I get with this code is:
>header1
60
57
>header2
3
>header3
7
I need a small modification in order to deal with multiple sequence lines.
I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.
An awk / gawk solution can be composed by three stages:
Every time header is found these actions should be performed:
Print previous seqlen if exists.
Print tag.
Initialize seqlen.
For the sequence lines we just need to accumulate totals.
Finally at the END stage we print the remnant seqlen.
Commented code:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
A oneliner:
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
For the totals:
awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length($0)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqlen
}' file.fa
A quick way with any awk, would be this:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.
It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.
BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split($0, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length($0);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}
"seqkit" is a quick way:
seqkit fx2tab --length --name --header-line sequence.fa

Resources