Splitting a large file containing multiple molecules - bash

I have a file that contains 10,000 molecules. Each molecule is ending with keyword $$$$. I want to split the main files into 10,000 separate files so that each file will have only 1 molecule. Each molecule have different number of lines. I have tried sed on test_file.txt as:
sed '/$$$$/q' test_file.txt > out.txt
input:
$ cat test_file.txt
ashu
vishu
jyoti
$$$$
Jatin
Vishal
Shivani
$$$$
output:
$ cat out.txt
ashu
vishu
jyoti
$$$$
I can loop it through whole main file to create 10,000 separate files but how to delete the last molecule that was just moved to new file from main file. Or please suggest if there is a better method for it, which I believe there is. Thanks.
Edit1:
$ cat short_library.sdf
untitled.cdx
csChFnd80/09142214492D
31 34 0 0 0 0 0 0 0 0999 V2000
8.4660 6.2927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4660 4.8927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2124 2.0951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4249 2.7951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
30 31 1 0 0 0 0
31 26 1 0 0 0 0
M END
> <Mol_ID> (1)
1
> <Formula> (1)
C22H24ClFN4O3
> <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html
$$$$
Dimesna.cdx
csChFnd80/09142214492D
16 13 0 0 0 0 0 0 0 0999 V2000
2.4249 1.4000 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
3.6415 2.1024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8540 1.4024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4904 1.7512 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
1 14 2 0 0 0 0
M END
> <Mol_ID> (2)
2
> <Formula> (2)
C4H8Na2O6S4
> <URL> (2)
http://www.selleckchem.com/products/Dimesna.html
$$$$

Here's a simple solution with standard awk:
LANG=C awk '
{ mol = (mol == "" ? $0 : mol "\n" $0) }
/^\$\$\$\$\r?$/ {
outFile = "molecule" ++fn ".sdf"
print mol > outFile
close(outFile)
mol = ""
}
' input.sdf

If you have csplit from GNU coreutils:
csplit -s -z -n5 -fmolecule test_file.txt '/^$$$$$/+1' '{*}'

This will do the whole job directly in bash:
molsplit.sh
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
if [[ "$line" = '$$$$' ]]; then
end=1
exec 3>&-
fi
done
Input is read from stdin, though that would be easy enough to change. Something like this:
./molsplit.sh < test_file.txt
ADDENDUM
From subsequent commentary, it seems that the input file being processed has Windows line endings, whereas the processing environment's native line ending format is UNIX-style. In that case, if the line-termination style is to be preserved then we need to modify how the delimiters are recognized. For example, this variation on the above will recognize any line that starts with $$$$ as a molecule delimiter:
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
case $line in
'$$$$'*) end=1; exec 3>&-;;
esac
done

The same statement that sets the current output file name also closes the previous one. close(_)^_ here is same as close(_)^0, which ensures the filename always increments for the next one, even if the close() action resulted in an error.
— if the output file naming scheme allows for leading-edge zeros, then change that bit to close(_)^(_<_), which ALWAYS results in a 1, for any possible string or number, including all forms of zero, the empty string, inf-inities, and nans.
mawk2 'BEGIN { getline __<(_ = "/dev/null")
ORS = RS = "[$][$][$][$][\r]?"(FS = RS)
__*= gsub("[^$\n]+", __, ORS)
} NF {
print > (_ ="mol" (__+=close(_)^_) ".txt") }' test_file.txt
The first part about getline from /dev/null neither sets $0 | NF nor modifies NR | FNR, but it's existence ensures the first time close(_) is called it wouldn't error out.
gcat -n mol12345.txt
1 Shivani
2 jyoti
3 Shivani
4 $$$$
it was reasonably speedy - from 5.60 MB synthetic test file created 187,710 files in 11.652 secs.

Related

Replace values of one column based on other column conditions in shell

I have a tab separated text file below. I want to match values in column 2 and replace the values in column 5. The condition is if there are X or Y in column 2, I want column 5 to have 1 just like in the result below.
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 0
1:943937:C:T X 0 943937 0
1:944858:A:G Y 0 944858 0
1:945010:C:A X 0 945010 0
1:946247:G:A 1 0 946247 0
result:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
I tried awk -F'\t' '{ $5 = ($2 == X ? 1 : $2) } 1' OFS='\t' file.txt but I am not sure how to match both X and Y in one step.
With awk:
awk 'BEGIN{FS=OFS="\t"} $2=="X" || $2=="Y"{$5="1"}1' file
Output:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Assuming you want $5 to be zero (as opposed to remaining unchanged) if the condition is false:
$ awk 'BEGIN{FS=OFS="\t"} {$5=($2 ~ /^[XY]$/)} 1' file
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0

How to find column values and replace in bash

I could do this easily in R with grepl and row indexing, but wanted to try this in shell. I have a text file that looks like what I have below. I would like to find rows where It matches TWGX and wherever it match, I would like to concatenate column 1 and column 2 separated by _ and make it column values for both column 1 and column 2.
text:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP 10064-8036056040 0 0 0 -9
TWGX-MAP 11570-8036056502 0 0 0 -9
TWGX-MAP 11680-8036055912 0 0 0 -9
This is the result I want:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9
The regex /TWGX/ selects the lines containing that string and applies the action that follows. The 1 is an awk shorthand that will print both the modified and unmodified lines.
$ awk 'BEGIN{FS=OFS="\t"} /TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}1' file
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9
BEGIN { FS = OFS = "\t" }
# Just once, before processing the file, set FS (file separator) and OFS (output file separator) to be the tab character
/TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}
# For every line that contains a match for TWGX create a mashup of the first two columns, and assign it to each of columns 1 and 2. (Note that in awk string concatenation is done by simply putting expressions next to one another)
1
# This is an awk idiom that consists of the pattern 1, which is always true. By not explicitly specifying an action to go with that pattern, the default action of printing the whole line will be executed.

How to find sum of elements in column inside of a text file (Bash)

I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182

Insert new lines with missing values in a array

I have a data like below:
2016-07-25:06 5
2016-07-25:07 1
2016-07-25:08 1
2016-07-25:09 2
2016-07-25:10 1
2016-07-25:11 1
2016-07-25:13 9
2016-07-25:14 1
In the above i should display hours from 00 to till 23, like below:
2016-07-25:00 0
2016-07-25:01 0
2016-07-25:02 0
2016-07-25:03 0
2016-07-25:04 0
2016-07-25:05 0
2016-07-25:06 5
2016-07-25:07 1
2016-07-25:08 1
2016-07-25:09 2
2016-07-25:10 1
2016-07-25:11 1
2016-07-25:12 0
2016-07-25:13 9
2016-07-25:14 1
2016-07-25:15 0
2016-07-25:16 0
2016-07-25:17 0
2016-07-25:18 0
2016-07-25:19 0
2016-07-25:20 0
2016-07-25:21 0
2016-07-25:22 0
2016-07-25:23 0
could you please let me know how can i achieve this using awk?
Thank you!!!
Using awk you can do this:
awk -F '[:[:blank:]]+' '{for (;i<$2; i++) printf "%s:%02d\t0\n", $1, i; print; i++; s=$1}
END{for (;i<24; i++) printf "%s:%02d\t0\n", s, i}' file
2016-07-25:00 0
2016-07-25:01 0
2016-07-25:02 0
2016-07-25:03 0
2016-07-25:04 0
2016-07-25:05 0
2016-07-25:06 5
2016-07-25:07 1
2016-07-25:08 1
2016-07-25:09 2
2016-07-25:10 1
2016-07-25:11 1
2016-07-25:12 0
2016-07-25:13 9
2016-07-25:14 1
2016-07-25:15 0
2016-07-25:16 0
2016-07-25:17 0
2016-07-25:18 0
2016-07-25:19 0
2016-07-25:20 0
2016-07-25:21 0
2016-07-25:22 0
2016-07-25:23 0
$ cat tst.awk
BEGIN { FS="[:[:space:]]+" }
function prt() {
if ( NR > 1 ) {
for (i=0; i<=23; i++) {
printf "%s:%02d%s%d\n", $1, i, OFS, val[$1,i]
}
delete val
}
}
$1 != prev { prt() }
{ val[$1,$2+0]=$3; prev=$1 }
END { prt() }
.
$ awk -f tst.awk file
2016-07-25:00 0
2016-07-25:01 0
2016-07-25:02 0
2016-07-25:03 0
2016-07-25:04 0
2016-07-25:05 0
2016-07-25:06 5
2016-07-25:07 1
2016-07-25:08 1
2016-07-25:09 2
2016-07-25:10 1
2016-07-25:11 1
2016-07-25:12 0
2016-07-25:13 9
2016-07-25:14 1
2016-07-25:15 0
2016-07-25:16 0
2016-07-25:17 0
2016-07-25:18 0
2016-07-25:19 0
2016-07-25:20 0
2016-07-25:21 0
2016-07-25:22 0
2016-07-25:23 0
This uses more tools than just awk, but it might be helpful:
#!/bin/bash
date="2016-07-25" #or a method to get the date you are interested in
#Generate all the zero lines
remaining=`for i in 0{0..9} {10..23}; do echo "$date:$i 0"; done | grep -v "$(cat datafile | awk '{print $1}')"`
#Add the original data and sort the lines
echo -e "$remaining\n$(cat datafile)" | sort -n

Replace blank fields with zeros in AWK

I wish to replace blank fields with zeros using awk but when I use sed 's/ /0/' file, I seem to replace all white spaces when I only wish to consider missing data. Using awk '{print NF}' file returns different field numbers (i.e. 9,4) due to some empty fields
input
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 M *BB
501666604 0 M *OO
90007162 7 M +178852
90007568 3 M +189182
output
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 0 M *BB 0 0 0 0
501666604 0 0 M *OO 0 0 0 0
90007162 0 7 M +178852 0 0 0 0
90007568 0 3 M +189182 0 0 0 0
Using GNU awk FIELDWIDTHS feature for fixed width processing:
$ awk '{for(i=1;i<=NF;i++)if($i~/^ *$/)$i=0}1' FIELDWIDTHS="11 9 5 2 16 3 2 11 2" file | column -t
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 0 M *BB 0 0 0 0
501666604 0 0 M *OO 0 0 0 0
90007162 0 7 M +178852 0 0 0 0
90007568 0 3 M +189182 0 0 0 0

Resources