How to find column values and replace in bash - bash

I could do this easily in R with grepl and row indexing, but wanted to try this in shell. I have a text file that looks like what I have below. I would like to find rows where It matches TWGX and wherever it match, I would like to concatenate column 1 and column 2 separated by _ and make it column values for both column 1 and column 2.
text:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP 10064-8036056040 0 0 0 -9
TWGX-MAP 11570-8036056502 0 0 0 -9
TWGX-MAP 11680-8036055912 0 0 0 -9
This is the result I want:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9

The regex /TWGX/ selects the lines containing that string and applies the action that follows. The 1 is an awk shorthand that will print both the modified and unmodified lines.
$ awk 'BEGIN{FS=OFS="\t"} /TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}1' file
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9
BEGIN { FS = OFS = "\t" }
# Just once, before processing the file, set FS (file separator) and OFS (output file separator) to be the tab character
/TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}
# For every line that contains a match for TWGX create a mashup of the first two columns, and assign it to each of columns 1 and 2. (Note that in awk string concatenation is done by simply putting expressions next to one another)
1
# This is an awk idiom that consists of the pattern 1, which is always true. By not explicitly specifying an action to go with that pattern, the default action of printing the whole line will be executed.

Related

Splitting a large file containing multiple molecules

I have a file that contains 10,000 molecules. Each molecule is ending with keyword $$$$. I want to split the main files into 10,000 separate files so that each file will have only 1 molecule. Each molecule have different number of lines. I have tried sed on test_file.txt as:
sed '/$$$$/q' test_file.txt > out.txt
input:
$ cat test_file.txt
ashu
vishu
jyoti
$$$$
Jatin
Vishal
Shivani
$$$$
output:
$ cat out.txt
ashu
vishu
jyoti
$$$$
I can loop it through whole main file to create 10,000 separate files but how to delete the last molecule that was just moved to new file from main file. Or please suggest if there is a better method for it, which I believe there is. Thanks.
Edit1:
$ cat short_library.sdf
untitled.cdx
csChFnd80/09142214492D
31 34 0 0 0 0 0 0 0 0999 V2000
8.4660 6.2927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4660 4.8927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2124 2.0951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4249 2.7951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
30 31 1 0 0 0 0
31 26 1 0 0 0 0
M END
> <Mol_ID> (1)
1
> <Formula> (1)
C22H24ClFN4O3
> <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html
$$$$
Dimesna.cdx
csChFnd80/09142214492D
16 13 0 0 0 0 0 0 0 0999 V2000
2.4249 1.4000 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
3.6415 2.1024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8540 1.4024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4904 1.7512 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
1 14 2 0 0 0 0
M END
> <Mol_ID> (2)
2
> <Formula> (2)
C4H8Na2O6S4
> <URL> (2)
http://www.selleckchem.com/products/Dimesna.html
$$$$
Here's a simple solution with standard awk:
LANG=C awk '
{ mol = (mol == "" ? $0 : mol "\n" $0) }
/^\$\$\$\$\r?$/ {
outFile = "molecule" ++fn ".sdf"
print mol > outFile
close(outFile)
mol = ""
}
' input.sdf
If you have csplit from GNU coreutils:
csplit -s -z -n5 -fmolecule test_file.txt '/^$$$$$/+1' '{*}'
This will do the whole job directly in bash:
molsplit.sh
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
if [[ "$line" = '$$$$' ]]; then
end=1
exec 3>&-
fi
done
Input is read from stdin, though that would be easy enough to change. Something like this:
./molsplit.sh < test_file.txt
ADDENDUM
From subsequent commentary, it seems that the input file being processed has Windows line endings, whereas the processing environment's native line ending format is UNIX-style. In that case, if the line-termination style is to be preserved then we need to modify how the delimiters are recognized. For example, this variation on the above will recognize any line that starts with $$$$ as a molecule delimiter:
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
case $line in
'$$$$'*) end=1; exec 3>&-;;
esac
done
The same statement that sets the current output file name also closes the previous one. close(_)^_ here is same as close(_)^0, which ensures the filename always increments for the next one, even if the close() action resulted in an error.
— if the output file naming scheme allows for leading-edge zeros, then change that bit to close(_)^(_<_), which ALWAYS results in a 1, for any possible string or number, including all forms of zero, the empty string, inf-inities, and nans.
mawk2 'BEGIN { getline __<(_ = "/dev/null")
ORS = RS = "[$][$][$][$][\r]?"(FS = RS)
__*= gsub("[^$\n]+", __, ORS)
} NF {
print > (_ ="mol" (__+=close(_)^_) ".txt") }' test_file.txt
The first part about getline from /dev/null neither sets $0 | NF nor modifies NR | FNR, but it's existence ensures the first time close(_) is called it wouldn't error out.
gcat -n mol12345.txt
1 Shivani
2 jyoti
3 Shivani
4 $$$$
it was reasonably speedy - from 5.60 MB synthetic test file created 187,710 files in 11.652 secs.

Replace values of one column based on other column conditions in shell

I have a tab separated text file below. I want to match values in column 2 and replace the values in column 5. The condition is if there are X or Y in column 2, I want column 5 to have 1 just like in the result below.
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 0
1:943937:C:T X 0 943937 0
1:944858:A:G Y 0 944858 0
1:945010:C:A X 0 945010 0
1:946247:G:A 1 0 946247 0
result:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
I tried awk -F'\t' '{ $5 = ($2 == X ? 1 : $2) } 1' OFS='\t' file.txt but I am not sure how to match both X and Y in one step.
With awk:
awk 'BEGIN{FS=OFS="\t"} $2=="X" || $2=="Y"{$5="1"}1' file
Output:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Assuming you want $5 to be zero (as opposed to remaining unchanged) if the condition is false:
$ awk 'BEGIN{FS=OFS="\t"} {$5=($2 ~ /^[XY]$/)} 1' file
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0

Add Columns Values with Shell

I've an input file which looks as below.
pmx . pmnosysrelspeechneighbr -m 1 -r
INFO: The ROP files contain suspected faulty counter values.
They have been discarded but can be kept with pmr/pmx option "k" (pmrk/pmxk) or highlighted with pmx option "s" (pmxs)
Date: 2017-11-04
Object Counter 14:45 15:00 15:15 15:30
UtranCell=UE1069XA0 pmNoSysRelSpeechNeighbr 0 1 0 0
UtranCell=UE1069XA1 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1069XA2 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1069XA3 pmNoSysRelSpeechNeighbr 0 0 2 0
UtranCell=UE1069XB0 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1069XB1 pmNoSysRelSpeechNeighbr 0 0 0 3
UtranCell=UE1069XB2 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1069XB3 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1069XC0 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1069XC1 pmNoSysRelSpeechNeighbr 0 0 0 4
UtranCell=UE1069XC2 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1069XC3 pmNoSysRelSpeechNeighbr 0 0 1 0
UtranCell=UE1164XA0 pmNoSysRelSpeechNeighbr 0 3 0 0
UtranCell=UE1164XA1 pmNoSysRelSpeechNeighbr 0 0 0 0
UtranCell=UE1164XA2 pmNoSysRelSpeechNeighbr 1 0 0 0
Now I want the output as below which is basically sum of the time column (from $3 to $6) values.
Counter 14:45 15:00 15:15 15:30
pmNoSysRelSpeechNeighbr 1 4 3 7
I've been trying with below command. But it's just giving sum of one column values:
pmx . pmnosysrelspeechneighbr -m 1 -r | grep - i ^Object| awk '{sum += $4} END {print $1 , sum}'
Try this out, you will get both header and trailer as sum of individual columns.
BEGIN {
trail="pmNoSysRelSpeechNeighbr";
}
{
if($1=="Object") print $2 OFS $3 OFS $4 OFS $5 OFS $6;
else if($1 ~ /^UtranCell/) {
w+=$3; x+=$4; y+=$5; z+=$6;
}
}
END {
print trail OFS w OFS x OFS y OFS z;
}
You need to sum each of the columns separately:
awk -v g=pmNoSysRelSpeechNeighbr '$0 ~ g { for(i=3;i<=6;i++) sum[i]+=$i }
END { printf g; for(i=3;i<=6;i++) printf OFS sum[i] }' file
but only for lines (records) containing the group (counter) of interest ($0~"pmNoSysRelSpeechNeighbr").
Note you (almost) never need to pipe grep's output to awk, because awk already supports extended regular expressions filtering with /regex/ { action }, or var ~ /regex/ { action }. One exception would be the need for PCRE (grep -P).
As an alternative to awk for simple "command-line statistical operations" on textual files, you could also use GNU datamash.
For example, to sum columns 3 to 6, but group by column 2:
grep 'UtranCell' file | datamash -W -g2 sum 3-6

Removing duplicate lines with different columns

I have a file which looks like follows:
ENSG00000197111:I12 0
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 0
ENSG00000197111:I5 1
I have some lines that are duplicated but I cannot remove by sort -u because the second column has different values for them (1 or 0). How do I remove such duplicates by keeping the lines with second column as 1 such that the file will be
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
you can use awk and or operator, if the order isn't mandatory
awk '{d[$1]=d[$1] || $2}END{for(k in d) print k, d[k]}' file
you get
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
Edit, only sort solution
You can use sort with a double pass, example
sort -k1,1 -k2,2r file | sort -u -k1,1
you get,
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1

Bash exclude lines where proportion of columns contain matched value

I have a lage text file that I would like to filter by excluding lines that have a number of columns matching a certain character. I had previously removed lines where all columns from 2 onwards contained a 0 or a . like so:
awk '{
for (i=2; i<=NF; i++)
if ($i!~/^(\.|0)/) {
print
break
}
}'
but now I would like it so that I would print lines that had less than a specific number of columns with this value (".").
For example with data:
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .
and a match value of 2 I would expect the bottom two lines to be excluded so that the output would be:
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
Any ideas?
With awk:
$ awk '{c=0;for(i=1;i<NF;i++) c += ($i == ".")}c<2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
Basically it iterates each column and add one to the counter if the column equals a period (.).
The c<2 part will only print the line if there is less than two columns with periods.
With sed one can use:
$ sed -r 'h;s/[^. ]+//g;s/\.\. *//g;/\. \./d;x' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
-r enables extended regular expressions (-E on *BSD).
Basically a copy of the pattern space is stored in the hold buffer, then all but spaces and periods is removed.
Now if the pattern space contains two separate periods it can be deleted if not the pattern space can be exchanged with the hold buffer.
$ awk '{delete a; for(i=1;i<=NF;i++) a[$i]++; if(a["."]>=2) next} 1' foo
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
It iterates all fields (for), counts field values and if 2 or more . in a record, restrains from printing (next). If you want to count the periods only from field 3 onward, change the start value of i in the for: for(i=3; ...).
$ cat ip.txt
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .
$ perl -ne '(#c)=/\.\/\.|\./g; print if $#c < 1' ip.txt
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
(#c)=/\.\/\.|\./g array of ./. or . matches from current line
$#c indicates index of last element, i.e (size of array - 1)
So, to ignore lines containing 3 elements like ./. or . use $#c < 2
Similar to #spasic's answer, but easier (for me) to read!
perl -ane 'print if (grep { /^\.$/} #F) < 2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
The -a separates the space-separated fields into an array called #F for me. I then grep in the array #F looking for elements that consist of just a period - i.e. those that start with a period and end immediately after the period. That counts the lone periods in each line and I print the line if that number is less than 2.
Perhaps this is alright.
awk '$0 !~/\. \./' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0

Resources