converting nexus to FASTA format - shell

I have many .nexus files that I want to convert to FASTA style format and combine into one .fasta file. Here is an example code:
for i in *.nexus;
do
awk 'NR >5' /path/to/nexus_files/$i | tr -d "'" | tr " " "\n" | sed 's/locus/>locus/g' > /path/to/fasta/${i}.fasta
done
This works for the first nexus file, but the #NEXUS header remains in subsequent conversions.
Input:
#NEXUS
begin data;
dimensions ntax=1 nchar=300;
format datatype=dna missing=? gap=-;
matrix
'locus1_individual-1'
???????????????????????????????TAGATTTTTTAGTCCTTAC
;
end;
Desired output:
>locus1_individual-1
???????????????????????????????TAGATTTTTTAGTCCTTAC

To speed it some up, you may reduce the number of commands needed:
for i in *.nexus;
do
awk 'NR>5 {gsub(f,"");gsub(/ /,"\n");gsub(/uce/,">&");print}' f="'" /path/to/nexus_files/$i > /path/to/fasta/${i}.fasta
done
An idea from anishsane. (all in one awk)
awk 'FNR>5 {sub(/\.nexus$/,"",FILENAME);sub(/.*\//,"/path/to/fasta/",FILENAME);gsub(f,"");gsub(/ /,"\n");gsub(/uce/,">&");print >FILENAME".fasta"}' f="'" /path/to/nexus_files/*
First sub removes the nexus extention from the filename.
Second change the path to /path/to/fasta/
Now its important to use FNR, since you read many files within one awk

Try:
for i in *.nexus;
do
awk 'FNR >5' /path/to/nexus_files/$i | tr -d "'" | tr " " "\n" | sed 's/uce/>uce/g' > /path/to/fasta/${i}.fasta
done
NR is the total number of records across files, FNR is the record count for each file, reset to zero for a new file.

Related

Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

Merge multiple files into a single row file with a delimeter

UPDATED QS:
I have been working on a bash script that will merge multiple text files with numerical values into one a single row text file using delimiter for each file values while merging
Example:
File1.txt has the followling contents:
168321099
File2.txt has:
151304
151555
File3.txt has:
16980925
File4.txt has:
154292
149092
Now i want a output.txt file like below:
, 168321099 151304 151555 16980925 , 154292 149092
Basically each file delimited by space and in a single row. with comma as first and 6 field of the outputrow
tried:
cat * > out.txt but its not coming as expected
I am not very sure If I understood your question correctly, but I interpreted it as following :
The set of files file1,...,filen contain a set of words which you want to have printed in one single line.
Each word is space separated
In addition to the string of words, you want the first character to be a , and between word 4 and 5 you want to have a ,.
The cat+tr+awk solution:
$ cat <file1> ... <filen> | tr '\n' ' ' | awk '{$1=", "$1; $4=$4" ,"; print}'
The awk solution:
$ awk 'NR==1||NR==4{printf s",";s=" "}{printf " "$1}' <file1> ... <filen>
If tr is available on your system you can do the following cat * | tr "\n" " " > out.txt
tr "\n" " " translates all line breaks to spaces
If the number of lines per file is constant, then the easiest way is tr as #Littlefinix suggested, with a couple of anonymous files to supply the commas, and an echo at the end to add an explicit newline to the output line:
cat <(echo ",") File1.txt File2.txt File3.txt <(echo ",") File4.txt | tr "\n" " " > out.txt; echo >> out.txt
out.txt is exactly what you specified:
, 168321099 151304 151555 16980925 , 154292 149092
If the number of lines per input file might vary (e.g., File2.txt has 3 or 4 lines, etc.), then placing the commas always in the 1st and 6th field will be more involved, and you'd probably need a script and not a one-liner.
Following single awk could help you on same.
awk 'FNR==1{count++;} {printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0)} END{print ""}' *.txt
Adding a non-one liner form of solution too now.
awk '
FNR==1 { count++ }
{ printf("%s%s",count==1||(count==(ARGC-1)&&FNR==1)?", ":" ",$0) }
END { print "" }
' *.txt

Comparing 2 files with a for loop in bash

I am trying to compare the values in 2 files. For each row in Summits3.txt I want to define the value in Column 1 as "Chr" and then find the rows in generef.txt which have my value for "Chr" in column 2.
Then I would like to output some info about that row from generef.txt to out.txt and then repeat until the end.
I am using the following script:
#!/bin/bash
IFS=$'\n'
for i in $(cat Summits3.txt)
do
Chr=$(echo "$i" | awk '{print $1}')
awk -v var="$Chr" '{
if ($2==""'${Chr}'"")
print $2, $3
}' generef.txt > out.txt
done
it "works" but its only comparing values from the last line of Summits3.txt. It seems like it not looping through the awk bit.
Anyway please help if you can!
I think you might be looking for something like this:
awk 'FNR == NR {a[$1]; next} $2 in a {print $2, $3}' Summits3.txt generef.txt > out.txt
Basically you read column one from the first file into an array (array index is your chr and the value is empty character) then for the second file print only rows where the second column is in the index set of the array. FNR row number in file that is currently being processed, NR row number of all processed rows so far. This is a general look-up command I use for pulling out genes or variants from one file that are present in the other.
In your code above it should be appending to out.txt: >> out.txt. But you have to make sure to re-set out.txt at each run.
Besides using external scripts inside a loop (that is expensive), the first thing we see is that you redirect your output to a file from insside the loop. The output files is recreated each time, so please change inte append (>>) or better move the redirection outdide the loop.
When you want to use a loop, try this
while read -r Chr other; do
cut -d" " -f2,3 generef.txt | grep -E "^${Chr} "
done < Summits3.txt > out.txt
When you want to avoid the loop (needed for large inputfiles), an awk or some combined command can be used.
The first solution can fail:
grep -f <(cut -d" " -f1 Summits3.txt) <(cut -d" " -f2,3 generef.txt)
You only want matches of the complete field Chr, so starting at the first position until a space ( I assume that is the field-sep).
grep -f <(cut -d" " -f1 Summits3.txt| sed 's/.*/^& /') <(cut -d" " -f2,3 generef.txt)

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Bash script to search csv file column and count how many times a value shows up

I am really new a bash and I was trying to search a csv file column for a value and then add a counter. I found this online but it prints it and I have been trying to count how many times an R shows up and not print the whole thing.
awk -F "\"*,\"*" '{print $2}' $file
The csv file is like:
12345,R,N,N,Y,N,N,N,Bob Builder
I am looking for R in column 2. Can anybody point me in the right direction?
The following should do what you want (where file.csv is your csv file):
Case sensitive version:
cut -f 2 -d , file.csv | grep -c R
Case insensitive version:
cut -f 2 -d , file.csv | grep -ic R
Explanation
cut -f 2 -d , file.csv
This takes each line of file.csv and extracts the specified fields. The -f 2 option means extract field 2 and the -d , means use a ',' as the field delimiter. The output of this is then piped to grep.
grep -c R This looks for lines containing 'R'. Since it is passed the contents of the previous cut command, it is looking for an 'R' in field two. The -c option means count the number of matching lines.
Using awk only:
awk -F "\",\"" '{if ($2 == "R") cnt++} END{print cnt}' file
For a fun - perl only - this count everything.
perl -F, -anle 'map{$cnt{$_}{$F[$_]}++}0..$#F;END{print $cnt{1}{R}}'

Resources