bash: substract constant number after prefix - bash

I have a large text file with many entries like this:
/locus_tag="PREFIX_05485"
including the empty spaces in the beginning. Unfortunately, the first identifier does not start with 00001.
The only part in this line that is changing is the number.
I would like to change the PREFIX (this I can do easily with sed), but I also want to decrease the number so it looks like this:
/locus_tag="myNewPrefix_00001"
(the next entry should be ..."myNewPrefix_00002" and so on). Alternatively, the entry could also be without leading zeros.
As far as I know, sed cannot calculate (like substracting a constant number). Any ideas how I can solve that?
Thank you very much. If the question is unclear, please let me know and I will try to improve it.
EDIT: Sometimes the same number occurs twice (this should also be the case in the modified file, for instance
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12346"
/locus_tag="PREFIX_12347"
should be in the end
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00002"
/locus_tag="myNewPrefix_00003"

You may use awk:
awk -v pf='myNewPrefix' 'BEGIN{FS=OFS="="}
$1 ~ /\/locus_tag$/ && split($2, a, /_/) == 2 {
$2 = sprintf("\"%s_%05d\"", pf, (a[2] in seen ? i : ++i)); seen[a[2]]
} 1' file
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00002"
/locus_tag="myNewPrefix_00003"

Check this Perl one liner
/tmp> cat littlebird.txt
abcdef
/locus_tag="PREFIX_12345"
hello hai
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12346"
/locus_tag="PREFIX_12347"
123 456
end
/tmp> perl -pe 'BEGIN{$r=qr/PREFIX_(.+)["]/} if(/$r/) {$kv{$1}++;$kv{$1}==1 and $kv2{$1}=sprintf("%04d",++$i) for(keys %kv) } s/$r/PREFIX_$kv2{$1}/g ' littlebird.txt
abcdef
/locus_tag="PREFIX_0001
hello hai
/locus_tag="PREFIX_0001
/locus_tag="PREFIX_0002
/locus_tag="PREFIX_0003
123 456
end
/tmp>

Related

Compare two text files line by line, finding differences but ignoring numerical values differences

I'm working on a bash script to compare two similar text files line by line and find the eventual differences between each line of the files, I should point the difference and tell in which line the difference is, but I should ignore the numerical values in this comparison.
Example:
Process is running; process found : 12603 process is listening on port 1200
Process is running; process found : 43023 process is listening on port 1200
In the example above, the script shouldn't find any difference since it's just the process id and it changes all the time.
But otherwise I want it to notify me of the differences between the lines.
Example:
Process is running; process found : 12603 process is listening on port 1200
Process is not running; process found : 43023 process is not listening on port 1200
I already have a working script to find the differences, and i've used the following function to find the difference and ignore the numerical values, but it's not working perfectly, Any suggestions ?
COMPARE_FILES()
{
awk 'NR==FNR{a[FNR]=$0;next}$0!~a[FNR]{print $0}' $1 $2
}
Where $1 and $2 are the two files to compare.
Would you please try the following:
COMPARE_FILES() {
awk '
NR==FNR {a[FNR]=$0; next}
{
b=$0; gsub(/[0-9]+/,"",b)
c=a[FNR]; gsub(/[0-9]+/,"",c)
if (b != c) {printf "< %s\n> %s\n", $0, a[FNR]}
}' "$1" "$2"
}
Any suggestions ?
Jettison digits before making comparison, I would ameloriate your code following way replace
NR==FNR{a[FNR]=$0;next}$0!~a[FNR]{print $0}
using
NR==FNR{a[FNR]=$0;next}gensub(/[[:digit:]]/,"","g",$0)!~gensub(/[[:digit:]]/,"","g",a[FNR]){print $0}
Explanation: I harness gensub string function as it does return new string (gsub alter selected variable value). I replace [:digit:] character using empty string (i.e. delete it) globally.
Using any awk:
compare_files() {
awk '{key=$0; gsub(/[0-9]+(\.[0-9]+)?/,0,key)} NR==FNR{a[FNR]=key; next} key!~a[FNR]' "${#}"
}
The above doesn't just remove the digits, it replaces every set of numbers, whether they're integers like 17 or decimals like 17.31, with the number 0 to avoid false matches.
For example, given input like:
file1: foo 1234 bar
file2: foo bar
If you just remove the digits then those 2 lines incorrectly become identical:
file1: foo bar
file2: foo bar
whereas if you replace all numbers with a 0 then they correctly remain different:
file1: foo 0 bar
file2: foo bar
Note that with the above though we're comparing the lines after converting numbers to 0, we're not modifying the original lines so the output would show the original lines, not the modified ones, for ease of further investigating the differences.

grep for duration=N while N is longer than X, the position of this sentence changes between lines

I have a very long file with a format of column_name=column_val, column_name2=column_val2 and so on.
the columns are not in the right order, lets say for example i have this file:
bar=x moshe=foo test=x duration=5
moshe=foo2 test=y duration=0 bar=y
duration=3 moshe=foo3 bar=z test=x
i want to return lines only where duration is greater then 2
as far as I know awk is not optional since i can't tell where the columns are location in each line.
on IRC in #bash channel someone recommended using gawk's match(). there too i was having problem seeing how to resolve this while each line the duration is elsewhere.
any ideas?
thanks
You can use duration= as field separator:
# showing field content in numeric context
$ awk -F'duration=' '{print +$2}' ip.txt
5
0
3
# use required numeric comparison to get desired output
$ awk -F'duration=' '+$2 > 2' ip.txt
bar=x moshe=foo test=x duration=5
duration=3 moshe=foo3 bar=z test=x
See https://www.gnu.org/software/gawk/manual/html_node/Strings-And-Numbers.html for conversion details
Unary + works on GNU awk, not sure about other versions. 0+$2 should work everywhere to force numeric context.
Note that if you have multiple duration= in a line, only the first one will be tested.
Extract the data with regex and compare.
awk '0+gensub(".*duration=([0-9]*).*", "\\1", "1") > 2'
#edit as above, the 0+ is needed to convert string to integer.
If you want to use grep:
grep -E 'duration=([3-9] |[0-9]{2,})' "file"
awk '/duration/ {
for (counter=1; counter <= NF; counter++) {
if ($counter ~ /^duration*/) {
value=substr($counter, index($counter,"=")+1);
if (value > 2) {
print $0;
}
}
}
}' <inputfile>

inserting text around a list of things

This is something simple that, for some reason is eluding me. I should think I should be able to do this from the bash prompt with a very simple script or one-liner.
I have a file, consisting of a list of numbers:
12345
23456
34567
45678
Very simple. I want to change it to:
arglebargle-12345-fulferol-12345-applesauce
arglebargle-23456-fulferol-23456-applesauce
arglebargle-34567-fulferol-34567-applesauce
arglebargle-45678-fulferol-45678-applesauce
So... insert a string on both sides of a number (or a string, it just happens that these strings are numbers, it is not essential that they be numbers)... then append the original string, and put a third string after that.
I think I would prefer to do this in sed or awk as a one-liner. Or as a ruby script. It should be so easy, and it is evading my mind for some reason!
Using sed
echo "12345
> 23456
> 34567
> 45678" | sed -e 's/\(.*\)/arglebargle-\1-fulferol-\1-applesauce/g'
arglebargle-12345-fulferol-12345-applesauce
arglebargle-23456-fulferol-23456-applesauce
arglebargle-34567-fulferol-34567-applesauce
arglebargle-45678-fulferol-45678-applesauce
If you want to substitute in place in a file(where your file is a.txt for example) you can do
sed -i 's/\(.*\)/arglebargle-\1-fulferol-\1-applesauce/g' a.txt
Something like this?
nums = %w(12345 23456 34567 45678)
nums.each { |num| puts "arglebargle-#{num}-fulferol-#{num}-applesauce" }
Output:
arglebargle-12345-fulferol-12345-applesauce
arglebargle-23456-fulferol-23456-applesauce
arglebargle-34567-fulferol-34567-applesauce
arglebargle-45678-fulferol-45678-applesauce
This should accomplish the job:
awk '{ print "arglebargle-" $0 "-fulferol-" $0 "-applesauce" }' numFile
See this related question.

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Delete lines before and after a match in bash (with sed or awk)?

I'm trying to delete two lines either side of a pattern match from a file full of transactions. Ie. find the match then delete two lines before it, then delete two lines after it and then delete the match. The write this back to the original file.
So the input data is
D28/10/2011
T-3.48
PINITIAL BALANCE
M
^
and my pattern is
sed -i '/PINITIAL BALANCE/,+2d' test.txt
However this is only deleting two lines after the pattern match and then deleting the pattern match. I can't work out any logical way to delete all 5 lines of data from the original file using sed.
an awk one-liner may do the job:
awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];}{a[NR]=$0}END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' file
test:
kent$ cat file
######
foo
D28/10/2011
T-3.48
PINITIAL BALANCE
M
x
bar
######
this line will be kept
here
comes
PINITIAL BALANCE
again
blah
this line will be kept too
########
kent$ awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];}{a[NR]=$0}END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' file
######
foo
bar
######
this line will be kept
this line will be kept too
########
add some explanation
awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];} #if match found, add the line and +- 2 lines' line number in an array "d"
{a[NR]=$0} # save all lines in an array with line number as index
END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' #finally print only those index not in array "d"
file # your input file
sed will do it:
sed '/\n/!N;/\n.*\n/!N;/\n.*\n.*PINITIAL BALANCE/{$d;N;N;d};P;D'
It works this way:
if sed has only one string in pattern space it joins another one
if there are only two it joins the third one
if it does natch to pattern LINE + LINE + LINE with BALANCE it joins two following strings, deletes them and goes at the beginning
if not, it prints the first string from pattern and deletes it and goes at the beginning without swiping the pattern space
To prevent the appearance of pattern on the first string you should modify the script:
sed '1{/PINITIAL BALANCE/{N;N;d}};/\n/!N;/\n.*\n/!N;/\n.*\n.*PINITIAL BALANCE/{$d;N;N;d};P;D'
However, it fails in case you have another PINITIAL BALANCE in string which are going to be deleted. However, other solutions fails too =)
For such a task, I would probably reach for a more advanced tool like Perl:
perl -ne 'push #x, $_;
if (#x > 4) {
if ($x[2] =~ /PINITIAL BALANCE/) { undef #x }
else { print shift #x }
}
END { print #x }' input-file > output-file
This will remove 5 lines from the input file. These lines will be the 2 lines before the match, the matched line, and the two lines afterwards. You can change the total number of lines being removed modifying #x > 4 (this removes 5 lines) and the line being matched modifying $x[2] (this makes the match on the third line to be removed and so removes the two lines before the match).
A more simple and easy to understand solution might be:
awk '/PINITIAL BALANCE/ {print NR-2 "," NR+2 "d"}' input_filename \
| sed -f - input_filename > output_filename
awk is used to make a sed-script that deletes the lines in question and the result is written on the output_filename.
This uses two processes which might be less efficient than the other answers.
This might work for you (GNU sed):
sed ':a;$q;N;s/\n/&/2;Ta;/\nPINITIAL BALANCE$/!{P;D};$q;N;$q;N;d' file
save this code into a file grep.sed
H
s:.*::
x
s:^\n::
:r
/PINITIAL BALANCE/ {
N
N
d
}
/.*\n.*\n/ {
P
D
}
x
d
and run a command like this:
`sed -i -f grep.sed FILE`
You can use it so either:
sed -i 'H;s:.*::;x;s:^\n::;:r;/PINITIAL BALANCE/{N;N;d;};/.*\n.*\n/{P;D;};x;d' FILE

Resources