Paste multiple files while excluding first column - bash

I have a directory with 100 files of the same format:
> S43.txt
Gene S43-A1 S43-A10 S43-A11 S43-A12
DDX11L1 0 0 0 0
WASH7P 0 0 0 0
C1orf86 0 15 0 1
> S44.txt
Gene S44-A1 S44-A10 S44-A11 S44-A12
DDX11L1 0 0 0 0
WASH7P 0 0 0 0
C1orf86 0 15 0 1
I want to make a giant table containing all the columns from all the files, however when I do this:
paste S88.txt S89.txt | column -d '\t' >test.merge
Naturally, the file contains two 'Gene' columns.
How can I paste ALL the files in the directory at once?
How can I exclude the first column from all the files after the first one?
Thank you.

If you're using bash, you can use process substitution in paste:
paste S43.txt <(cut -d ' ' -f2- S44.txt) | column -t
Gene S43-A1 S43-A10 S43-A11 S43-A12 S44-A1 S44-A10 S44-A11 S44-A12
DDX11L1 0 0 0 0 0 0 0 0
WASH7P 0 0 0 0 0 0 0 0
C1orf86 0 15 0 1 0 15 0 1
(cut -d$'\t' -f2- S44.txt) will read all but first column in S44.txt file.
To do this for all the file matching S*.txt, use this snippet:
arr=(S*txt)
file="${arr[1]}"
for f in "${arr[#]:1}"; do
paste "$file" <(cut -d$'\t' -f2- "$f") > _file.tmp && mv _file.tmp file.tmp
file=file.tmp
done
# Clean up final output:
column -t file.tmp

use join with the --nocheck-order option:
join --nocheck-order S43.txt S44.txt | column -t
(the column -t command to make it pretty)
However, as you say you want to join all the files, and join only takes 2 at a time, you should be able to do this (assuming your shell is bash):
tmp=$(mktemp)
files=(*.txt)
cp "${files[0]}" result.file
for file in "${files[#]:1}"; do
join --nocheck-order result.file "$file" | column -t > "$tmp" && mv "$tmp" result.file
done

Related

Splitting a large file containing multiple molecules

I have a file that contains 10,000 molecules. Each molecule is ending with keyword $$$$. I want to split the main files into 10,000 separate files so that each file will have only 1 molecule. Each molecule have different number of lines. I have tried sed on test_file.txt as:
sed '/$$$$/q' test_file.txt > out.txt
input:
$ cat test_file.txt
ashu
vishu
jyoti
$$$$
Jatin
Vishal
Shivani
$$$$
output:
$ cat out.txt
ashu
vishu
jyoti
$$$$
I can loop it through whole main file to create 10,000 separate files but how to delete the last molecule that was just moved to new file from main file. Or please suggest if there is a better method for it, which I believe there is. Thanks.
Edit1:
$ cat short_library.sdf
untitled.cdx
csChFnd80/09142214492D
31 34 0 0 0 0 0 0 0 0999 V2000
8.4660 6.2927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4660 4.8927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2124 2.0951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4249 2.7951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
30 31 1 0 0 0 0
31 26 1 0 0 0 0
M END
> <Mol_ID> (1)
1
> <Formula> (1)
C22H24ClFN4O3
> <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html
$$$$
Dimesna.cdx
csChFnd80/09142214492D
16 13 0 0 0 0 0 0 0 0999 V2000
2.4249 1.4000 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
3.6415 2.1024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8540 1.4024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4904 1.7512 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
1 14 2 0 0 0 0
M END
> <Mol_ID> (2)
2
> <Formula> (2)
C4H8Na2O6S4
> <URL> (2)
http://www.selleckchem.com/products/Dimesna.html
$$$$
Here's a simple solution with standard awk:
LANG=C awk '
{ mol = (mol == "" ? $0 : mol "\n" $0) }
/^\$\$\$\$\r?$/ {
outFile = "molecule" ++fn ".sdf"
print mol > outFile
close(outFile)
mol = ""
}
' input.sdf
If you have csplit from GNU coreutils:
csplit -s -z -n5 -fmolecule test_file.txt '/^$$$$$/+1' '{*}'
This will do the whole job directly in bash:
molsplit.sh
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
if [[ "$line" = '$$$$' ]]; then
end=1
exec 3>&-
fi
done
Input is read from stdin, though that would be easy enough to change. Something like this:
./molsplit.sh < test_file.txt
ADDENDUM
From subsequent commentary, it seems that the input file being processed has Windows line endings, whereas the processing environment's native line ending format is UNIX-style. In that case, if the line-termination style is to be preserved then we need to modify how the delimiters are recognized. For example, this variation on the above will recognize any line that starts with $$$$ as a molecule delimiter:
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
case $line in
'$$$$'*) end=1; exec 3>&-;;
esac
done
The same statement that sets the current output file name also closes the previous one. close(_)^_ here is same as close(_)^0, which ensures the filename always increments for the next one, even if the close() action resulted in an error.
— if the output file naming scheme allows for leading-edge zeros, then change that bit to close(_)^(_<_), which ALWAYS results in a 1, for any possible string or number, including all forms of zero, the empty string, inf-inities, and nans.
mawk2 'BEGIN { getline __<(_ = "/dev/null")
ORS = RS = "[$][$][$][$][\r]?"(FS = RS)
__*= gsub("[^$\n]+", __, ORS)
} NF {
print > (_ ="mol" (__+=close(_)^_) ".txt") }' test_file.txt
The first part about getline from /dev/null neither sets $0 | NF nor modifies NR | FNR, but it's existence ensures the first time close(_) is called it wouldn't error out.
gcat -n mol12345.txt
1 Shivani
2 jyoti
3 Shivani
4 $$$$
it was reasonably speedy - from 5.60 MB synthetic test file created 187,710 files in 11.652 secs.

How do I remove duplicated by position SNPs using PLink?

I am working with PLINK to analyse SNP chip data.
Does anyone know how to remove duplicated SNPs (duplicated by position)?
If we already have files in plink format then we should have .bim for binary plink files or .map for text plink files. In either case the positions are on the 3rd column and SNP names are on 2nd column.
We need to create a list of SNPs that are duplicated:
sort -k3n myFile.map | uniq -f2 -D | cut -f2 > dupeSNP.txt
Then run plink with --exclude flag:
plink --file myFile --exclude dupeSNP.txt --out myFileSubset
you can also do it directly in PLINK1.9 using the --list-duplicate-vars flag
together with the <require-same-ref>, <ids-only>, or <suppress-first> modifiers depending on what you want to do.
check https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars for more details
If you want to delete all occurences of a variant with duplicates, you will have to use the --exclude flag on the output file of --list-duplicate-vars ,
which should have a .dupvar extention.
I should caution that the two answers given below yield different results. This is because the sort | uniq method only takes into account SNP and bp location; whereas, the PLINK method (--list-duplicate-vars) takes into account A1 and A2 as well.
Similar to sort | uniq on the .map file we could use AWK on a .gen file, that looks like this:
22 rs1 12 A G 1 0 0 1 0 0
22 rs1 12 G A 0 1 0 0 0 1
22 rs2 16 C A 1 0 0 0 1 0
22 rs2 16 C G 0 0 1 1 0 0
22 rs3 17 T CTA 0 0 1 0 1 0
22 rs3 17 CTA T 1 0 0 0 0 1
# Get list of duplicate rsXYZ ID's
awk -F' ' '{print $2}' chr22.gen |\
sort |\
uniq -d > chr22_rsid_duplicates.txt
# Get list of duplicated bp positions
awk -F' ' '{print $3}' chr22.gen |\
sort |\
uniq -d > chr22_location_duplicates.txt
# Now match this list of bp positions to gen file to get the rsid for these locations
awk 'NR==FNR{a[$1]=$2;next}$3 in a{print $2}' \
chr22_location_duplicates.txt \
chr22.gen |\
sort |\
uniq \
> chr22_rsidBylocation_duplicates.txt
cat chr22_rsid_duplicates.txt \
chr22_rsidBylocation_duplicates.txt \
> tmp
# Get list of duplicates (by location and/or rsid)
cat tmp | sort | uniq > chr22_duplicates.txt
plink --gen chr22.gen \
--sample chr22.sample \
--exclude chr22_duplicates.txt \
--recode oxford \
--out chr22_noDups
This will classify rs2 as a duplicate; however, for the PLINK list-duplicate-vars method rs2 will not be flagged as a duplicate.
If one want's to obtain the same results using PLINK (a non-trivial task for BGEN file formats since awk, sed etc. do not work on binary files!) you can use the --rm-dup command from PLINK2.0. The list of all duplicate SNPs removed can be logged (to a file ending in .rmdup.list) using the list parameter, like so:
plink2 --bgen chr22.bgen \
--sample chr22.sample \
--rm-dup exclude-all list \
--export bgen-1.1 \ # Export as bgen version 1.1
--out chr22_noDups
Note: I'm saving the output as version 1.1 since plink1.9 still has commands not available in plink version 2.0. Therefore the only way to use bgen files with plink1.9 (at this time) is with the older 1.1 version.

How to find sum of elements in column inside of a text file (Bash)

I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182

Removing duplicate lines with different columns

I have a file which looks like follows:
ENSG00000197111:I12 0
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 0
ENSG00000197111:I5 1
I have some lines that are duplicated but I cannot remove by sort -u because the second column has different values for them (1 or 0). How do I remove such duplicates by keeping the lines with second column as 1 such that the file will be
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
you can use awk and or operator, if the order isn't mandatory
awk '{d[$1]=d[$1] || $2}END{for(k in d) print k, d[k]}' file
you get
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
Edit, only sort solution
You can use sort with a double pass, example
sort -k1,1 -k2,2r file | sort -u -k1,1
you get,
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1

Format output in bash

I have output in bash that I would like to format. Right now my output looks like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
abac.com 1
abbotsleigh.nsw.edu.au 1
abc.mre.gov.br 1
ableland.freeserve.co.uk 1
academicplanet.com 1
access-k12.org 1
acconnect.com 1
acconnect.com 1
accountingbureau.co.uk 1
acm.org 1
acsalaska.net 1
adam.com.au 1
ada.state.oh.us 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
aecom.yu.edu 1
aecon.com 1
aetna.com 1
agedwards.com 1
ahml.info 1
The problem with this is none of the numbers on the right line up. I would like them to look like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
Would there be anyway to make them look like this without knowing exactly how long the longest domain is? Any help would be greatly appreciated!
The code that outputted this is:
grep -E -o -r "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" $ARCHIVE | sed 's/.*#//' | uniq -ci | sort | sed 's/^ *//g' | awk ' { t = $1; $1 = $2; $2 = t; print; } ' > temp2
ALIGNMENT:
Just use cat with column command and thats it:
cat /path/to/your/file | column -t
For more details on column command refer http://manpages.ubuntu.com/manpages/natty/man1/column.1.html
EDITED:
View file in terminal:
column -t < /path/to/your/file
(as noted by anishsane)
Export to a file:
column -t < /path/to/your/file > /output/file

Resources