awk sed filter values in all lines greater/smaller than - bash

is there a way to construct a filter in awk (or something similar) that for a given file, say:
0.99,0.98,1.1,0.85,0.92
0.76,1.4,0.99,0.99,0.82
1.0,1.45,0.78,0.91,0.95
would replace any record in a line that is greater than 1.0 with 1.0?

Here is something you can do with awk
awk -F, '{for(i=1;i<=NF;i++) if($i>1) {$i="replacement"}}1' OFS=, file
Test:
$ cat file
0.99,0.98,1.1,0.85,0.92
0.76,1.4,0.99,0.99,0.82
1.0,1.45,0.78,0.91,0.95
$ awk -F, '{for(i=1;i<=NF;i++) if($i>1) {$i="replacement"}}1' OFS=, file
0.99,0.98,replacement,0.85,0.92
0.76,replacement,0.99,0.99,0.82
1.0,replacement,0.78,0.91,0.95

Here’s a sed solution:
sed -e 's/[1-9][0-9]*\.[0-9]*/1.0/g' in-file > out-file
The pattern [1-9][0-9]*\.[0-9]* simply matches any sequence that begins with a digit greater than 0, followed by zero or more digits, followed by the decimal point, followed by additional digits. If you want an in-place replacement, you can use the -i option:
sed -i -e 's/[1-9][0-9]*\.[0-9]*/1.0/g' in-file

Related

Sed command to replace numbers between space and :

I have a file with a records like the below
FIRST 1: SECOND 2: THREE 4: FIVE 255: SIX 255
I want to remove values between space and :
FIRST:SECOND:THREE:FIVE:SIX
with code
awk -F '[[:space:]]*,:*' '{$1=$1}1' OFS=, file
tried on gnu awk:
awk -F' [0-9]*(: *|$)' -vOFS=':' '{print $1,$2,$3,$4,$5}' file
tried on gnu sed:
sed -E 's/\s+[0-9]+(:|$)\s*/\1/g' file
Explanation of awk,
regex , a space, followed by [0-9]+ one or more number followed by literal : followed by one or more space: *, if all such matched, then collect everything else than this matched pattern, ie. FIRST, SECOND,... so on because -F option determine it as field separator (FS) and $1, $2 .. so on is always else than FS. But the output needs nice look ie. has FS so that'd be : and it'd be awk variable definition -vOFS=':'
You can add [[:digit:]] also with a ending asterisk, and leave only a space just after OFS= :
$ awk -F '[[:space:]][[:digit:]]*' '{$1=$1}1' OFS= file
FIRST:SECOND:THREE:FIVE:SIX
To get the output we want in idiomatic awk, we make the input field separator (with -F) contain all the stuff we want to eliminate (anchored with :), and make the output field separator (OFS) what we want it replaced with. The catch is that this won't eliminate the space and numbers at the end of the line, and for this we need to do something more. GNU’s implementation of awk will allow us to use a regular expression for the input record separator (RS), but we could just do a simple sub() with POSIX complaint awk as well. Finally, force recalculation via $1=$1... the side effects for this pattern/statement are that the buffer will be recalculated doing FS/RS substitution for us, and that non-blank lines will take the default action -- which is to print.
gawk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: -v RS='[[:space:]]*[[:digit:]]*\n' '$1=$1' file
Or:
awk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: '{ sub(/[[:space:]]*[[:digit:]]*$/, “”) } $1=$1' file
A sed implementation is fun but probably slower (because current versions of awk have better regex implementations).
sed 's/[[:space:]]*[[:digit:]]*:[[:space:]]/:/g; s/[[:space:]]*[[:digit:]]*[[:space:]]*$//' file
Or if POSIX character classes are not available...
sed 's/[\t ]*[0-9]*:[\t ]/:/g; s/[\t ]*[0-9]*[\t ]*$//' file
Something tells me that your “FIRST, SECOND, THIRD...” might be more complicated, and might contain digits... in this case, you might want to experiment with replacing * with + for awk or with \+ for sed.

Using grep to pull a series of random numbers from a known line

I have a simple scalar file producing strings like...
bpred_2lev.ras_rate.PP 0.9413 # RAS prediction rate (i.e., RAS hits/used RAS)
Once I use grep to find this line in the output.txt, is there a way I can directly grab the "0.9413" portion? I am attempting to make a cvs file and just need whatever value is generated.
Thanks in advance.
There are several ways to combine finding and extracting into a single command:
awk (POSIX-compliant)
awk '$1 == "bpred_2lev.ras_rate.PP" { print $2 }' file
sed (GNU sed or BSD/OSX sed)
sed -En 's/^bpred_2lev\.ras_rate\.PP +([^ ]+).*$/\1/p' file
GNU grep
grep -Po '^bpred_2lev\.ras_rate\.PP +\K[^ ]+' file
You can use awk like this:
grep <your_search_criteria> output.txt | awk '{ print $2 }'

Extract string between two patterns (inclusive) while conserving the format

I have a file in the following format
cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACK
id3,PPLRTOMcccccccccccJACK
I am trying to identify and print the string between TOM and JACK including these two strings, while maintaining the first column FS=,
Desired output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
So far I have tried gsub:
awk -F"," 'gsub(/.*TOM|JACK.*/,"",$2) && !_[$0]++' test.txt > out.txt
and have the following output
id1 aaaaaaaaaaa
id2 bbbbbbbbbbb
id3 ccccccccccc
As you can see I am getting close but not able to include TOM and JACK patterns in my output. Plus I am also losing the original FS. What am I doing wrong?
Any help will be appreciated.
You are changing a field ($2) which causes awk to reconstruct the record using the value of OFS as the field separator and so in this case changing the commas to spaces.
Never use _ as a variable name - using a name with no meaning is just slightly better than using a name with the wrong meaning, just pick a name that means something which, in this case is seen but idk what you are trying to do when using that in this context.
gsub() and sub() do not support capture groups so you either need to use match()+substr():
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/){$2=substr($2,RSTART,RLENGTH)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or use GNU awk for the 3rd arg to match()
$ gawk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or for gensub():
$ gawk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
The main difference between the match() and gensub() solutions is how they would behave if TOM appeared twice on the line:
$ cat file
id1,PPLLfooTOMbarTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACKfooJACKbar
id3,PPLRfooTOMbarTOMcccccccccccJACKfooJACKbar
$
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMbarTOMcccccccccccJACKfooJACK
$
$ awk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMcccccccccccJACKfooJACK
and just to show one way of stopping at the first instead of the last JACK on the line:
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=gensub(/(JACK).*/,"\\1","",a[0])} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMbarTOMcccccccccccJACK
Use capture groups to save the parts of the line you want to keep. Here's how to do it with sed
sed 's/^\([^,]*,\).*\(TOM.*JACK\).*/\1\2/' <test.txt > out.txt
Do you mean to do the following?
$ cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACKABCD
id2,PPLRTOMbbbbbbbbbbbJACKDFCC
id3,PPLRTOMcccccccccccJACKSDER
$ cat test.txt | sed -e 's/,.*TOM/,TOM/g' | sed -e 's/JACK.*/JACK/g'
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
$
This should work as long as the TOM and JACK do not repeat themselves.
sed 's/\(.*,\).*\(TOM.*JACK\).*/\1\2/' <oldfile >newfile
Output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK

Reading numbers from a text line in bash shell

I'm trying to write a bash shell script, that opens a certain file CATALOG.dat, containing the following lines, made of both characters and numbers:
event_0133_pk.gz
event_0291_pk.gz
event_0298_pk.gz
event_0356_pk.gz
event_0501_pk.gz
What I wanna do is print the numbers (only the numbers) inside a new file NUMBERS.dat, using something like > ./NUMBERS.dat, to get:
0133
0291
0298
0356
0501
My problem is: how do I extract the numbers from the text lines? Is there something to make the script read just the number as a variable, like event_0%d_pk.gz in C/C++?
A grep solution:
grep -oP '[0-9]+' CATALOG.dat >NUMBERS.dat
A sed solution:
sed 's/[^0-9]//g' CATALOG.dat >NUMBERS.dat
And an awk solution:
awk -F"[^0-9]+" '{print $2}' CATALOG.dat >NUMBERS.dat
There are many ways that you can achieve your result. One way would be to use awk:
awk -F_ '{print $2}' CATALOG.dat > NUMBERS.dat
This sets the field separator to an underscore, then prints the second field which contains the numbers.
Awk
awk 'gsub(/[^[:digit:]]/,"")' infile
Bash
while read line; do echo ${line//[!0-9]}; done < infile
tr
tr -cd '[[:digit:]\n]' <infile
You can use grep command to extract the number part.
grep -oP '(?<=_)\d+(?=_)' CATALOG.dat
gives output as
0133
0291
0298
0356
0501
Or
much simply
grep -oP '\d+' CATALOG.dat
You don't need perl mode in grep for this. BREs can do this.
grep -o '[[:digit:]]\+' CATALOG.dat > NUMBERS.dat

Add blank column using awk or sed

I have a file with the following structure (comma delimited)
116,1,89458180,17,FFFF,0403254F98
I want to add a blank column on the 4th field such that it becomes
116,1,89458180,,17,FFFF,0403254F98
Any inputs as to how to do this using awk or sed if possible ?
thank you
Assuming that none of the fields contain embedded commas, you can restate the task as replacing the third comma with two commas. This is just:
sed 's/,/,,/3'
With the example line from the file:
$ echo "116,1,89458180,17,FFFF,0403254F98" | sed 's/,/,,/3'
116,1,89458180,,17,FFFF,0403254F98
You can use this awk,
awk -F, '$4="," $4' OFS=, yourfile
(OR)
awk -F, '$4=FS$4' OFS=, yourfile
If you want to add 6th and 8th field,
awk -F, '{$4=FS$4; $1=FS$1; $6=FS$6}1' OFS=, yourfile
Through awk
$ echo '116,1,89458180,17,FFFF,0403254F98' | awk -F, -v OFS="," '{print $1,$2,$3,","$4,$5,$6}'
116,1,89458180,,17,FFFF,0403254F98
It prints a , after third field(delimited) by ,
Through GNU sed
$ echo 116,1,89458180,17,FFFF,0403254F98| sed -r 's/^([^,]*,[^,]*,[^,]*)(.*)$/\1,\2/'
116,1,89458180,,17,FFFF,0403254F98
It captures all the characters upto the third command and stored it into a group. Characters including the third , upto the last are stored into another group. In the replacement part, we just add an , between these two captured groups.
Through Basic sed,
Through Basic sed
$ echo 116,1,89458180,17,FFFF,0403254F98| sed 's/^\([^,]*,[^,]*,[^,]*\)\(.*\)$/\1,\2/'
116,1,89458180,,17,FFFF,0403254F98
echo 116,1,89458180,17,FFFF,0403254F98|awk -F',' '{print $1","$2","$3",,"$4","$5","$6}'
Non-awk
t="116,1,89458180,17,FFFF,0403254F98"
echo $(echo $t|cut -d, -f1-3),,$(echo $t|cut -d, -f4-)
You can use bellow awk command to achieve that.Replace the $3 with what ever the column that you want to make it blank.
awk -F, '{$3="" FS $3;}1' OFS=, filename
sed -e 's/\([^,]*,\)\{4\}/&,/' YourFile
replace the sequence of 4 [content (non comma) than comma ] by itself followed by a comma

Resources