replace exact number in shell - shell

I have following matrix:
0.380451 0.381955 0 0.237594
0.317293 0.362406 0 0.320301
0.261654 0.38797 0 0.350376
0 0 0 1
0 1 0 0
0 0 0 1
0 0.001504 0 0.998496
0.270677 0.35188 0.018045 0.359398
0.36391 0.305263 0 0.330827
0.359398 0.291729 0.037594 0.311278
0.359398 0.276692 0.061654 0.302256
And I want to replace only the zeros not the zeros followed by points to 0.001, how can I do that with sed or gsub?

This is not elegant, and not super portable, but it works on your specific example:
sed -e 's=^0 =X =g
s= 0$= X=g
s= 0 = X =g' data.txt
First of all, it assumes that the fields in the input file are separated by one or more white spaces. The first part looks for "0" at the beginning of the line, the second at the end of the line, and the third finds "0" with spaces on both sides.
Any particular reason to use only sed for this? I am sure that a simple awk script could do a better job, and also be more robust.

Match whitespace in your replacement.
echo 0 0.001504 0 0.998496 | sed 's/0[\t ]/Z /g'

Related

Extract compound data from SDF file using IDNUMBER and write to a new file

I'm still quite new to awk and have been trying to use a bash script and awk to filter a file according to a list of codes in a separate text file. While there are a few similar questions around, I have been unable to adapt their implementations.
My first file idnumber.txtlooks like this:
4323-7584
K8933-4943
L2837-0493
The file I am attempting to filter the molecule blocks from has entries as follows:
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (K784-9550)
K784-9550
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
4323-7584
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
L2789-0943
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-2738)
4323-2738
> <SALT>
NaCl
$$$$
The file repeats in this fashion, starting with the -ISIS- -- StrEd -- and ending with the $$$$. I need to extract this entire block for each string in IDNUMBER. So the expected output would be the block from above from -ISIS- to the $$$$ that has a matching ID in the IDNUMBER.txt.
Each entry is a different length, and I am trying to extract the entire block from the -ISIS- -- StrEd --
I have tried a few options of sed trying to recognise the first line to the IDNUMBER and extracting around it but that didn't work. My current iteration of the code is as follows:
#!/bin/bash
cat idnumbers.txt | while read line
do
sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
done
The logic behind what I was attempting was to find the block that would match the start as the ISIS phrase and end with the relevant ID number, copying that to a file. I realise now that what my logic was doing would skip the $$$$ that terminates each block.
But I have a feeling I am missing something as it is not actually writing anything to filtered.sdf.
Expected output:
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
4323-7584
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
L2789-0943
$$$$
Edit:
So I have tried a different approach based on another question but have not been able to figure out how to alter the key assigned to a record in awk based on recognizing the characters at the line containing the IDNUMBER because it is a different field for each record.
awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
(NR==FNR){a[$1]=$0; next}
($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
I assume it would be a matter of changing the field reference in the array $1 to an expression that recognizes the line after > <IDNUMBER>(xyz), but I am unsure how to go about achieving that.
I am missing something
In this command
sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
you are using following regular expressions
^-ISIS-$
^$line$
^ denotes start of line, $ denotes end of line
1st is looking for -ISIS- spanning whole line, whilst your file has
-ISIS- -- StrEd --
that is -ISIS- as part of line, therefore you should use regular expression without anchors that is -ISIS-
2nd does include $ and then some other characters (line) implying some character being after end, which is impossible, so your code will keeping printing until all file is made, I have not idea if this is desired behavior, but be warned that more common way to do so in GNU sed is using $ as address (meaning last line) for example if you want to print first line holding digit and all following you could do
sed -n '/[0-9]/,$p' file.txt
Maybe this is what you are looking for, some explanation:
[[:blank:]] -> Space or tab only, not newline characters
First regex is looking for the start pattern -ISIS- -- StrEd -- you mentioned (with a variable length of spaces/tabs between), and if it's a match, the variable found is set to 1
Second regex is looking for the end pattern > <IDNUMBER> (xxxx-xxxx) (also with a variable length of spaces/tabs), where xxxx-xxxx is coming from the file idnumber.txt, and if it's a match set found to 2.
So now we know we are between the desired start and end of "idnumber"-text we want to print
Third regex is looking for $$$$ and set found to 3 if matching.
This is the "real" endpoint, so jump with exit to the END section
So if the value of found is less or equal 2 the input line of compound_library.sdf is saved to variable text
At the END block of the awk the value of found is checked for the value 3 so the whole variable text is printed
while IFS= read IdNumber; do
awk '
BEGIN {
found=0
}
/^[[:blank:]]*-ISIS-[[:blank:]]*--[[:blank:]]*StrEd[[:blank:]]*--/ {
found=1
}
/^>[[:blank:]]*<IDNUMBER>[[:blank:]]*\('"${IdNumber}"'\)/ {
found++
#print "IdNumber='"${IdNumber}"', found=" found >>"/dev/stderr"
}
found <= 2 {
text=sprintf("%s%s\n", text, $0)
}
/^\$\$\$\$$/ {
found++
exit
}
END {
if (found == 3) {
printf text
}
}' \
compound_library.sdf
#compound_library.sdf > ${IdNumber}.sdf
done < idnumber.txt

use bash or awk to replace part of a string

I have the following example lines in a file:
sweet_25 2 0 4
guy_guy 2 4 6
ging_ging 0 0 3
moat_2 0 1 0
I want to process the file and have the following output:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
Notice that the required effect happened in lines 2 and 3 - that an underscore and text follwing a text is remove on lines where this pattern occurs.
I have not succeeded with the follwing:
sed -E 's/([a-zA-Z])_[a-zA-Z]/$1/g' file.txt >out.txt
Any bash or awk advice will be welcome.Thanks
If you want to replace the whole word after the underscore, you have to repeat the character class one or more times using [a-zA-Z]+ and use \1 in the replacement.
sed -E 's/([a-zA-Z])_[a-zA-Z]+/\1/g' file.txt >out.txt
If the words should be the same before and after the underscore, you can use a repeating capture group with a backreference.
If you only want to do this for the start of the string you can prepend ^ to the pattern and omit the /g at the end of the sed command.
sed -E 's/([a-zA-Z]+)(_\1)+/\1/g' file.txt >out.txt
The pattern matches:
([a-zA-Z]+) Capture group 1, match 1 or more occurrences of a char a-zA-Z
(_\1)+ Capture group 2, repeat matching _ and the same text captured by group 1
The file out.txt will contain:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
With your shown samples, please try following awk code.
awk 'split($1,arr,"_") && arr[1] == arr[2]{$1=arr[1]} 1' Input_file
Explanation: Simple explanation would be, using awk's split function that splits 1st field into an array named arr with delimiter _ AND then checking condition if 1st element of arr is EQAUL to 2nd element of arr then save only 1st element of arr to first field($1) and by mentioning 1 printing edited/non-edited lines.
You can do it more simply, like this:
sed -E 's/_[a-zA-Z]+//' file.txt >out.txt
This just replaces an underscore followed by any number of alphabetical characters with nothing.
$ awk 'NR~/^[23]$/{sub(/_[^ ]+/,"")} 1' file
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
I would do:
awk '$1~/[[:alpha:]]_[[:alpha:]]/{sub(/_.*/,"",$1)} 1' file
Prints:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0

Iterating over a text file in bash and rounding each number

My file looks like this
0 0 1 0.2 1 1
1 1 0.8 0.1 1
0.2 0.4 1 0 1
And I need to a create a new output file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
i.e. if the number is greater than 0.5, it is rounded up to 1, and if it less than 0.5, it is rounded down to 0 and put into a new file.
The file is quite large, with ~ 1400000000 values. I would quite like to write a bash script to do this.
I am guessing the best way to do this would be to iterate over each value in a for loop, with an if statement inside which tests whether the number is greater or less than 0.5 and then prints 0 or 1 dependent.
The pseudocode would look like this, but my bash isn't great so - before you tell my it isnt syntatically correct, I already know
#!/bin/bash
#reads in each line
while read p; do
#loops through each number in each line
for i in p; do
#tests if each number is greater than or equal to 0.5 and prints accordingly
if [i => 0.5]
then
print 1
else
print 0
fi
done < test.txt >
I'm not really sure how to do this. Can anyone help? Thanks.
awk '{
for( i=1; i<=NF; i++ )
$i = $i<0.5 ? 0 : 1
}1' input_file > output_file
$i = $i<0.5 ? 0 : 1 changes each field to 0 or 1 and {...}1 will print the line with the changed values afterwards.
another awk without loops...
$ awk -v RS='[ \n]' '{printf ($1>=0.5) RT}' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
if the values are not between 0 and 1, you may want to change to
$ awk -v RS='[ \n]' '{printf "%.0f%s", $1, RT}' file
note that default rounding is to the even (i.e. 0.5 -> 0, but 1.5 -> 2). If you want always to round up
$ awk -v RS='[ \n]' '{i=int($1); printf "%d%s", i+(($1-i)>=0.5), RT}' file
should take of non-negative numbers. For negatives, there are again two alternatives, round towards zero or towards negative infinity.
Here's one in Perl using regex and look-ahead:
$ perl -p -e 's/0(?=\.[6789])/1/g;s/\.[0-9]+//g' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
I went with the if it less than 0.5, it is rounded down to 0 part.

Delete row if value in 3rd column is in another text file

I have a long text file (haplotypes.txt) that looks like this:
19 rs541392352 55101281 A 0 0 ...
19 rs546022921 55106773 C T 0 ...
19 rs531959574 31298342 T 0 0 ...
And a simple text file (positions.txt) that looks like this:
55103603
55106773
55107854
55112489
If would like to remove all the rows where the third field is present in positions.txt, to obtain the following output:
19 rs541392352 55101281 A 0 0 ...
19 rs531959574 31298342 T 0 0 ...
I hope someone can help.
With AWK:
awk 'NR == FNR{a[$0] = 1;next}!a[$3]' positions.txt haplotypes.txt
Breakdown:
NR == FNR { # If file is 'positions.txt'
a[$0] = 1 # Store line as key in associtive array 'a'
next # Skip next blocks
}
!a[$3] # Print if third column is not in the array 'a'
This should work:
$ grep -vwFf positions.txt haplotypes.txt
19 rs541392352 55101281 A 0 0 ...
19 rs531959574 31298342 T 0 0 ...
-f positions.txt: read patterns from file
-v: invert matches
-w: match only complete words (avoid substring matches)
-F: fixed string matching (don't interpret patterns as regular expressions)
This expects that only the third column looks like a long number. If the pattern happens to match the exact same word in one of the columns that aren't shown, you can get false positives. To avoid that, you'd have to use an awk solution filtering by column (see andlrc's answer).

How to update matrix-like-data (.txt) file in bash programming?

I am new to bash programming. I have this file, the file contain is:
A B C D E
1 X 0 X 0 0
2 0 X X 0 0
3 0 0 0 0 0
4 X X X X X
Where X means, it has value, 0 means its empty.
From there, let say user enter B3, which is a 0, means I will need to replace it to X. What is the best way to do it? I need to constantly update this file.
FYI: This is a homework question, thus please, dont give a direct code/answer. But any sample code like (how to use this function etc) will be very much appreciated).
Regards,
Newbie Bash Scripter
EDIT:
If I am not wrong, Bash can call/update directly the specific column. Can it be done with row+column?
If you can use sed I'll throw out this tidbit:
sed -i "/^2/s/. /X /4" /path/to/matrix_file
Input
A B C D E
1 0 0 0 0 0
2 X 0 0 X 0
3 X X 0 0 X
Output
A B C D E
1 0 0 0 0 0
2 X 0 X X 0
3 X X 0 0 X
Explanation
^2: This restricts sed to only work on lines which begin with 2, i.e. the 2nd row
s: This is the replacement command
Note for the next two, '_' represents a white space
._: This is the pattern to match. The . is a regular expression that matches any character, thus ._ matches "any character followed by a space". Note that this could also be [0X]_ if you are guaranteed that the only two characters you can have are '0' and 'X'
X_: This is the replacement text. In this case we are replacing 'any character followed by a space' as described above with 'X followed by a space'
4: This matches the 4th occurrence of the pattern text above, i.e. the 4th row including the row index.
What would be left for you to do is use variables in the place of ^2 and 4 such as ^$row and $col and then map the letters A - E to 1 - 5
something to get you started
#!/bin/bash
# call it using ./script B 1
# make it executable using "chmod 755 script"
# read input parameters
col=$1
row=$2
# construct below splits on whitespace
while read -a line
do
for i in ${line[#]}; do
array=( "${array[#]}" $i );
done
done < m
# Now you have the matrix in a one-dimensional array that can be indexed.
# Lets print it
for i in ${array[#]}; do
echo $i;
done
Here's a starter for you using AWK:
awk -v col=B -v row=3 'BEGIN{getline; for (i=1;i<=NF;i++) cols[$i]=i+1} NR==row+1{print $cols[col]}'
The i+1 and row+1 account for the row heading and column heading, respectively.

Resources