Cannot sort VCF with bcftools due to invalid input - sorting

I am trying to compress & index a VCF file and am facing several issues.
When I use bgzip/tabix, it throws an error saying it cannot be indexed due to some unsorted values.
# code used to bgzip and tabix
bgzip -c fn.vcf > fn.vcf.gz
tabix -p vcf fn.vcf.gz
# below is the error returnd
[E::hts_idx_push] Unsorted positions on sequence #1: 115352924 followed by 115352606
tbx_index_build failed: fn.vcf.gz
When I use bcftools sort to sort this VCF to tackle #1, it throws an error due to invalid entries.
# code used to sort
bcftools sort -O z --output-file fn.vcf.gz fn.vcf
# below is the error returned
Writing to /tmp/bcftools-sort.YSrhjT
[W::vcf_parse_format] Extreme FORMAT/AD value encountered and set to missing at chr12:115350908
[E::vcf_parse_format] Invalid character '\x0F' in 'GT' FORMAT field at chr12:115352482
Error encountered while parsing the input
Cleaning
I've tried sorting using linux commands to get around #2. However, when I run the below code, the size of fout.vcf is almost half of fin.vcf, indicating something might be going wrong.
grep "^#" fin.vcf > fout.vcf
grep -v "^#" fin.vcf | sort -k1,1V -k2,2n >> fout.vcf
Please let me know if you have any advice regarding:
How I could sort/fix the problematic inputs in my VCF in a safe & feasible way. (The file is 340G so I cannot simply open the file and edit.)
Why my linux sort might be behaving in an odd way. (i.e. returning file much smaller than the original.)
Any comments or suggestions are appreciated!

Try this
mkdir tmp ##1 create a tmp folder in your working directory
tmp=/yourpath/ ##2 assign the tmp folder
bcftools sort file.vcf -T ./tmp -Oz -o file.vcf.gz
you can index your file after sorting your file
bcftools index file.vcf.gz

Related

Is it possible to grep using an array as pattern?

TL;DR
How to filter an ls/find output using grep
with an array as a pattern?
Background story:
I have a pipeline which I have to rerun for datasets which run into an error.
Which datasets are run into an error is saved in a tab separated file.
I want to delete the files where the pipeline has run into an error.
To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.
This is the folder structure (X=1-30):
datasets/dsX/results/dsX.tsv
Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm
#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/
#2. delete the empty folders
find /datasets/*/. -type d -empty -delete
But since I want to exclude the finished datasets I thought it would be clever to save them in an array:
#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[#]}
which works as expected but now I am stuck in filtering the ls output using that array:
*pseudocode
#trying to ignore the dataset in the array - not working
ls -I${finished[#]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}
What do you think about my current ideas?
Is this possible using bash only? I guess in python I could do that easily
but for training purposes, I want to do it in bash.
grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.
If you need to process the input somehow, you can use process substitution:
grep -f <(process the input...)
I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:
find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -
If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

Most efficient way to add non-unique data to unique data in bash

I have a massive file with each line being unique. I have a collection of smaller files (but still relatively large) where the lines are not unique. This collection is constantly growing. I need to add the small files into the big file and make sure there are no duplicates in the big file. Right now what I do is add all the files into one, and then run sort -u on it. However this ends up rescanning the entire big file which takes longer and longer as more files come in, and seems inefficient. Is there a better way to do this?
If the big file is already sorted, it would be more efficient to sort -u only the smaller files, and then sort -u -m (merge) the result with the big file. -m assumes the inputs are already individually sorted.
Example (untested):
#!/bin/bash
# Merges unique lines in the files passed as arguments into BIGFILE.
BIGFILE=bigfile.txt
TMPFILE=$(mktemp)
trap "rm $TMPFILE" EXIT
sort -u "$#" > "$TMPFILE"
sort -um "$TMPFILE" "$BIGFILE" -o "$BIGFILE"
This answer explains why -o is necessary.
If you like process substitutions you can even do it in a one-liner:
sort -um <(sort -u "$#") "$BIGFILE" -o "$BIGFILE"

How to get frequency counts of unique values in a list using UNIX?

I have a file that has a couple thousand domain names in a list. I easily generated a list of just the unique names using the uniq command. Now, I want to go through and find how many times each of the items in the uniques list appears in the original, non-unique list. I thought this should be pretty easy to do with this loop, but I'm running into trouble:
for name in 'cat uniques.list'; do grep -c $name original.list; done > output.file
For some reason, it's spitting out a result that shows some count of something (honestly not sure what) for the uniques file and the original file.
I feel like I'm overlooking something really simple here. Any help is appreciated.
Thanks!
Simply use uniq -c on your file :
-c, --count
prefix lines by the number of occurrences
The command to get the final output :
sort original.list | uniq -c

Unix script is sorting the input

I am having sometime here with my home assignment. Maybe you guys will advise what to read or what commands I can use in order to create the following:
Create a shell script test that will act as follows:
The script will display the following message on the terminal screen:
Enter file names (wild cards OK)
The script will read the list of names.
For each file on the list that is a proper file, display a table giving the ten most frequently used words in the file, sorted with the most frequent first. Include the count.
Repeat steps 1-3 over and over until the user indicates end-of-file. This is done by entering the single character Ctrl-d as a file name.
Here is what I have so far:
#!/bin/bash
echo 'Enter file names (wild cards OK)'
read input_source
if test -f "$input_source"
then
I'm usually ignoring homework questions without showing some progress and effort to learn something - but you're as beautifully cheeky so i'll make an exception.
here is what you want
while read -ep 'Files?> ' files
do
for file in $files
do
echo "== word counts for the $file =="
tr -cs '[:alnum:]' '\n' < "$file" | sort | uniq -c | tail | sort -nr
done
done
And now = at least try understand what the above doing...
Ps: voting to close...
How to find the ten most frequently used words in a file
Assumptions:
The files given have one word per line.
The files are not huge, so efficiency isn't a primary concern.
You can use sort and uniq to find the count of non-unique values in a file, then tail to cut off all but the last ten, and reverse-numeric sort to put them in descending order.
sort "$afile" | uniq -c | tail | sort -rd
Some tips:
have access to the complete bash manual: it's daunting at first, but it's an invaluable reference -- http://www.gnu.org/software/bash/manual/bashref.html
You can get help about bash builtins at the command line: try help read
the read command can handle printing the prompt with the -p option (see previous tip)
you'll accomplish the last step with a while loop:
while read -p "the prompt" filenames; do
# ...
done

Why is sort -k not working all the time?

I have now a script that puts a list of files in two separate arrays:
First, I get a file list from a ZIP file and fill FIRST_Array() with it. Second, I get a file list from a control file within a ZIP file and fill SECOND_Array() with it
while read length date time filename
do
FIRST_Array+=( "$filename" )
echo "$filename" >> FIRST.report.out
done < <(/usr/bin/unzip -qql AAA.ZIP |sort -g -k12 -t~)
Third, I compare both array like so:
diff -q <(printf "%s\n" "${FIRST_Array[#]}") <(printf "%s\n" "${SECOND_Array[#]}") |wc -l
I can tell that Diff fails because I output each array to files: FIRST.report.out and SECOND.report.out are simply not sorted properly.
1) FIRST.report.out (what's inside the ZIP file)
JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML
2) SECOND.report.out (what's inside the ZIP's control file)
JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML
Using sort -k12 -t~ made sense since ~ is the delimiter for the file's date field (12th position). But it is not working consistently. Added -g made no difference.
The sort is worse when my script processes bigger ZIP files. Why is sort -k not working all the time? How can I sort both arrays?
you don't really have a k12 in your data, your separator is '~' in your spec, but you have ~ and sometimes - in your data.
you can check by
head -n 1 your.data.file | sed -e "s/~/\n/g"
Business requirements are going to be changed. Sort is no longer required in this case. Thread can closed. Thank you.

Resources