How to split fastq file that has both forward and reverse reads mixed - bioinformatics

everyone.
I hope you guys are doing well.
As the header says, I have a fastq file that has both pair-end reads mixed. I want to split it into forward and reverse reads.
AIM:
convert file.fastq to forward.fastq and reverse.fastq
I tried this code but it didn't work
paste - - - - - - - - < read.fq \
| tee >(cut -f 1-4 | tr "\t" "\n" > read-1.fq) \
| cut -f 5-8 | tr "\t" "\n" > read-2.fq

Related

Use cat to sort multiple fastq files in a directory

I have the following to sort my fastq by their sequence identifiers:
zcat 001_T1_1.fastq.gz | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" | gzip -c > 001_T1_1_sorted.fastq.gz
zcat 001_T1_2.fastq.gz | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" | gzip -c > 001_T1_2_sorted.fastq.gz
It is a bit slow when i am trying it for one sample. Can we make it faster and run for all fastq.gz in a directory?
How can i make it with bash ?
I think this might be a bit faster as we remove two executables from the pipe-line:
zcat file_in.gz |
awk 'BEGIN{PROCINFO["sorted_in"]="#val_str_asc"}
{n=int((NR-1)/4); a[n] = a[n] $0 ORS }
END { for(i in a) printf "%s",a[i] }' - | gzip -c - > file_out.gz

Connecting Wget and Sed Commands in One Script?

I use 3 commands (wget/sed/and a tr/sort) that all work in command line to produce a most-common words list. I use commands sequentially, saving output from sed to use in the tr/sort command. Now I need to graduate to writing a script that combines these 3 commands. So, 1) wget downloads a file, that I put into 2) sed -e 's/<[^>]*>//g' wget-file.txt, and that output > goes to 3)
cat sed-output.txt | tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c |
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt
I'm aware of the problem/debate about using regex to remove HTML tags, but these 3 commands are working for me for the moment. So thanks in helping pull this together.
Using awk.
wget -O- http://down.load/file| awk '{ gsub(/<[^>]*>/,"") # remove the content in label <>
$0=tolower($0) # convert all to lowercase
gsub(/[^a-z]]*/," ") # remove all non-letter chars and replaced by space
for (i=1;i<=NF;i++) a[$i]++ # save each word in array a, and sum it.
}END{for (i in a) print a[i],i|"sort -nr|head -100"}' # print the result, sort it, and get the top 100 records only
This command should do the job:
wget -O- http://down.load/file | sed -e 's/<[^>]*>//g' | \
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | \
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt

Unix uniq command to CSV file

I have a text file (list.txt) containing single and multi-word English phrases. My goal is to do a word count for each word and write the results to a CSV file.
I have figured out the command to write the amount of unique instances of each word, sorted from largest to smallest. That command is:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less > output.txt
The problem is the way the new file (output.txt) is formatted. There are 3 leading spaces, followed by the number of occurrences, followed by a space, followed by the word. Then on to a next line. Example:
9784 the
6368 and
4211 for
2929 to
What would I need to do in order to get the results in a more desired format, such as CSV? For example, I'd like it to be:
9784,the
6368,and
4211,for
2929,to
Even better would be:
the,9784
and,6368
for,4211
to,2929
Is there a way to do this with a Unix command, or do I need to do some post-processing within a text editor or Excel?
Use awk as follows:
> cat input
9784 the
6368 and
4211 for
2929 to
> cat input | awk '{ print $2 "," $1}'
the,9784
and,6368
for,4211
to,2929
You full pipeline will be:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | awk '{ print $2 "," $1}' > output.txt
use sed to replace the spaces with comma
cat extra_set.txt | sort -i | uniq -c | sort -nr | sed 's/^ *//g' | sed 's/ /\, /'

How to sort groups of lines?

In the following example, there are 3 elements that have to be sorted:
"[aaa]" and the 4 lines (always 4) below it form a single unit.
"[kkk]" and the 4 lines (always 4) below it form a single unit.
"[zzz]" and the 4 lines (always 4) below it form a single unit.
Only groups of lines following this pattern should be sorted; anything before "[aaa]" and after the 4th line of "[zzz]" must be left intact.
from:
This sentence and everything above it should not be sorted.
[zzz]
some
random
text
here
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
And neither should this one and everything below it.
to:
This sentence and everything above it should not be sorted.
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
[zzz]
some
random
text
here
And neither should this one and everything below it.
Maybe not the fastest :) [1] but it will do what you want, I believe:
for line in $(grep -n '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -n +$line sections.txt | head -n 5
done
Here's a better one:
for pos in $(grep -b '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
[1] The first one is something like O(N^2) in the number of lines in the file, since it has to read all the way to the section for each section. The second one, which can seek immediately to the right character position, should be closer to O(N log N).
[2] This takes you at your word that there are always exactly five lines in each section (header plus four following), hence head -n 5. However, it would be really easy to replace that with something which read up to but not including the next line starting with a '[', in case that ever turns out to be necessary.
Preserving start and end requires a bit more work:
# Find all the sections
mapfile indices < <(grep -b '^\[.*\]$' sections.txt)
# Output the prefix
head -c+${indices[0]%%:*} sections.txt
# Output sections, as above
for pos in $(printf %s "${indices[#]}" |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
# Output the suffix
tail -c+$((1+${indices[-1]%%:*})) sections.txt | tail -n+6
You might want to make a function out of that, or a script file, changing sections.txt to $1 throughout.
Assuming that other lines do not contain a [ in them:
header=`grep -n 'This sentence and everything above it should not be sorted.' sortme.txt | cut -d: -f1`
footer=`grep -n 'And neither should this one and everything below it.' sortme.txt | cut -d: -f1`
head -n $header sortme.txt #print header
head -n $(( footer - 1 )) sortme.txt | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
#cat sortme.txt | head -n $(( footer - 1 )) | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
tail -n +$footer sortme.txt #print footer
Serves the purpose.
Note that the main sort work is done by 4th command only. Other lines are to reserve header & footer.
I am also assuming that, between header & first "[section]" there are no other lines.
This might work for you (GNU sed & sort):
sed -i.bak '/^\[/!b;N;N;N;N;s/\n/UnIqUeStRiNg/g;w sort_file' file
sort -o sort_file sort_file
sed -i -e '/^\[/!b;R sort_file' -e 'd' file
sed -i 's/UnIqUeStRiNg/\n/g' file
Sorted file will be in file and original file in file.bak.
This will present all lines beginning with [ and following 4 lines, in sorted order.
UnIqUeStRiNg can be any unique string not containing a newline, e.g. \x00

'Read' command stripping '\n' string

I want to extract data from a file which looks like this :
BK20120802130531:/home/michael/Scripts/usb_backup.sh
BK20120802130531:/home/michael/Scripts/yad_0.17.1.1-1_i386.deb
BK20120802130731:/home/michael/Scripts/gbk.sh
BK20120802130131:/home/michael/Scripts/alt-notify-send.sh
BK20120802130131:/home/michael/Scripts/bk.bak
BK20120802130131:/home/michael/Scripts/bk.sh
BK20120802130131:/home/michael/Scripts/demande_password.sh
The idea is to show on the screen (without creating a temporary file, nor modifying the original file) what follows :
alt-notify-send.sh
/home/michael/Scripts
bk.bak
/home/michael/Scripts
bk.sh
/home/michael/Scripts
demande_password.sh
/home/michael/Scripts
gbk.sh
/home/michael/Scripts
usb_backup.sh
/home/michael/Scripts
yad_0.17.1.1-1_i386.deb
/home/michael/Scripts
To sum up :
Strip the characters before ':'
Put the filenames before their corresponding directory
Sort the filenames by alphabetical order
Do a carriage return between each filename and its corresponding directory
I succeed doing all this, but there is still an ugly thing in my code concerning point #4 :
cut -f 2 -d ':' $big_file | \
sort -u | \
while read file ; do
echo "$(basename "$file")zipzapzupzop$(dirname "$file")" # <-- ugly thing #1
done | \
sort -dfb | \
while read line ; do
echo $line
done | \
sed 's/zipzapzupzop/\n/' # <-- ugly thing #2
At the beginning, I had written :
echo "$(basename "$file")\n$(dirname "$file")"
in place of ugly thing#1, in order to be able to do
echo -e "$line"
in the second while boucle. However, the read command strips each time the '\n' string, so that I obtain
alt-notify-send.shn/home/michael/Scripts
bk.bakn/home/michael/Scripts
bk.shn/home/michael/Scripts
demande_password.shn/home/michael/Scripts
gbk.shn/home/michael/Scripts
usb_backup.shn/home/michael/Scripts
yad_0.17.1.1-1_i386.debn/home/michael/Scripts
I tried to protect the '\' character by another '\', but the result is the same.
man read
is of no help either. So, is it a proper way to do this ?
read is a shell builtin, and man read may be giving you the docs for the (mostly unrelated) syscall.
read -r will prevent read from processing \ sequences.
The whole thing could have been done with a single awk script though:
awk '
{
start = index($0, ":") + 1
end = match($0, "[^/]*$")
out[NR] = substr($0, end) "\n" substr($0, start, end - start - 1)
}
END {
asort(out)
for (i = 1; i <= NR; i++)
print out[i]
}'
If you don't need to handle spaces in filenames, you can do this:
cat $bigfile | sed 's/.*://' | while read file; do
echo "$(basename $file) $(dirname $file)"
done | sort | awk '{print $1"\n"$2}'
You can do it with the following pipeline (should be on one line, I've split it and added comments for readability):
| sed -e 's/^[^:]*://' # Remove from start of line to first ':'
-e 's?/\([^/]*$\)? \1?' # Replace final '/' with a space
| sort -k2 # Sort on column 2 (filename)
| awk '{print $2"\n"$1}' # Reverse fields
See the following transcript:
echo 'BK20120802130531:/home/michael/Scripts/usb_backup.sh
BK20120802130531:/home/michael/Scripts/yad_0.17.1.1-1_i386.deb
BK20120802130731:/home/michael/Scripts/gbk.sh
BK20120802130131:/home/michael/Scripts/alt-notify-send.sh
BK20120802130131:/home/michael/Scripts/bk.bak
BK20120802130131:/home/michael/Scripts/bk.sh
BK20120802130131:/home/michael/Scripts/demande_password.sh'
| sed -e 's/^[^:]*://'
-e 's?/\([^/]*$\)? \1?'
| sort -k2
| awk '{print $2"\n"$1}'
alt-notify-send.sh
/home/michael/Scripts
bk.bak
/home/michael/Scripts
bk.sh
/home/michael/Scripts
demande_password.sh
/home/michael/Scripts
gbk.sh
/home/michael/Scripts
usb_backup.sh
/home/michael/Scripts
yad_0.17.1.1-1_i386.deb
/home/michael/Scripts
Just keep in mind that sort may not work as expected with lines containing spaces.
Assuming you do not have hash tags in your filenames you could use this coreutils pipeline:
cut -d: -f2- infile \
| sed -r 's,(.*)/([^/]*)$,\2#\1,' \
| sort -t'#' \
| tr '#' '\n'
cut removes the first part.
sed splits the path, swaps filename and directory and delimits them with a #.
sort hash tag delimited text.
tr finally replaces the hash tag with a newline.
If you know the number of path elements, you can use the simpler version:
cut -d: -f2- infile \
| sort -t/ -k4,4 \
| sed 's,(.*)/([^/]*)$,\2\n\1,'

Resources