Converting from tsv to fasta

Converting from tsv to fasta - bash

I have a bunch of TSV files in my folder and for everyone one of them I would like to get a fasta file where the header after the sign '>' is the name of the file.
My TSV file has 5 columns without header:
Thus:
inputfile called: "A.coseq.table_headless.tsv"
HIV1B-pol-seed 15 MAX 1959 GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
output file called "A.fasta"
>A_MAX
GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
I want to run the script simultaneously in bash for all the files and I have this script who does not work because in awk print statement I have a curly brace:
for sample in `ls *coseq.table_headless.tsv`
do
base1=$(basename $sample "coseq.table_headless.tsv")
awk '{print ">"${base1}"_"$3"\n"$5}' ${base1}coseq.table_headless.tsv > ${base1}fasta
done
Any idea how to correct this code?
Thank you very much

if the basename is the part until the first ".", you can get rid of the loop as well.
awk '{split(FILENAME,base,".");
print ">" base[1] "_" $3 "\n" $5 > base[1]".fasta"}' *coseq.table_headless.tsv

The other solutions posted so far have a few issues:
not closing the files as they're written will produce "too many
open files" errors unless you use GNU awk,
calculating the output file name every time a line is
read rather than once when the input file is opened is inefficient, and
using parenthesized expression on the right side of output
redirection is undefined behavior and so will only work in some awks
(including GNU awk).
This will work robustly and efficiently in all awks:
awk '
FNR==1 { close(out); f=FILENAME; sub(/\..*/,"",f); pfx=">"f"_"; out=f".fasta" }
{ print pfx $3 ORS $5 > out }
' *coseq.table_headless.tsv

Another awk solution:
awk '{ pfx=substr(FILENAME,1,index(FILENAME,".")-1);
printf(">%s_%s\n%s\n",pfx,$3,$5) > pfx".fasta" }' *coseq.table_headless.tsv
pfx contains the first part of filename (till the 1st .)

Related

How to replace multiple " in between two, with help of sed or awk

In csv file If in between two, there are more than two " present then I want to replace them with only two " using shell script.
Example
If in csv file it is like, """any word"", it should get replaced with, "any word", or if it is like, [any number of "], it should get replaced with, "".
FYI: " this is double quote not two single quote.
and [] are not present actually in data , i gave it for understanding

awk solution:
sample testfile contents:
sdsdf,"""hello"",sdsdf
asdasd,[asdasd asdasd]",sdfsdf
sdf,"[asdasd]",asdasd
The job:
awk -F, '{ for(i=1;i<=NF;i++) if($i~/"{2,}/) gsub(/"+/,"\"",$i);
else if($i~/^[^"]*"{1,}[^"]*$/) $i="\"\""; }1' OFS=',' testfile
The output:
sdsdf,"hello",sdsdf
asdasd,"",sdfsdf
sdf,"[asdasd]",asdasd

Try to use Roman's file
awk -F, '{gsub(/"""hello""/,"\42hello\42",$2)gsub(/\[asdasd asdasd\]/,"\42")}1' OFS=, file
sdsdf,"hello",sdsdf
asdasd,"",sdfsdf
sdf,"[asdasd]",asdasd

Here's a sed solution, which as OP works between commas, but which doesn't work, if there are commas in between the quotation marks:
sed ':a;s/\(,"[^,"]*\|^"[^,"]*\)"\([^,]\)/\1\2/;ta' testfile
Using Roman's test file my output is:
sdsdf,"hello",sdsdf
asdasd,[asdasd asdasd]",sdfsdf
sdf,"[asdasd]",asdasd
Note that the second field of the second line is different in my version, as I'm not sure what behavior OP wants in that case or if fields like that even exist.

How to strip date in csv output using shell script?

I have a few csv extracts that I am trying to fix up the date on, they are as follows:
"Time Stamp","DBUID"
2016-11-25T08:28:33.000-8:00,"5tSSMImFjIkT0FpiO16LuA"
The first column is always the "Time Stamp", I would like to convert this so it only keeps the date "2016-11-25" and drops the "T08:28:33.000-8:00".
The end result would be..
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
There are plenty of files with different dates.
Is there a way to do this in ksh? Some kind of for each loop to loop through all the files and replace the long time-stamp and leave just the date?

Use sed:
$ sed '2,$s/T[^,]*//' file
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
How it works:
2,$ # Skip header (first line) removing this will make a
# replacement on the first line as well.
s/T[^,]*// # Replace everything between T (inclusive) and , (exclusive)
# `[^,]*' Matches everything but `,' zero or more times

Here's one solution using a standard aix utility,
awk -F, -v OFS=, 'NR>1{sub(/T.*$/,"",$1)}1' file > file.cln && mv file.cln file
output
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
(but I no longer have access to an aix environment, so only tested with my local awk).
NR>1 skips the header line, and the sub() is limited to only the first field (up to the first comma). The trailing 1 char is awk shorthand for {print $0}.
If your data layout changes and you get extra commas in your data, this may required fixing.
IHTH

Using sed:
sed -i "s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\).*,/\1-\2-\3,/" file.csv
Output:
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
-i edit files inplace
s substitute

This is a perfect job for awk, but unlike the previous answer, I recommend using the substring function.
awk -F, 'NR > 1{$1 = substr($1,1,10)} {print $0}' file.txt
Explanation
-F,: The -F flag sets a field separator, in this case a comma
NR > 1: Ignore the first row
$1: Refers to the first field
$1 = substr($1,1,10): Sets the first field to the first 10 characters of the field. In the example, this is the date portion
print $0: This will print the entire row

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done

Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done

Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)

There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...

if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc

This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

How to process many csv files into many separate dat files

I have several thousand csv files I wish to reformat. They all have a standard filename with incremental integer, eg. file_1.csv, file_2.csv, file_3.csv, and they all have the same format:
CH1
s,Volts
-1e-06,-0.0028,
-9.998e-07,-0.0032,
-9.99e-07,-0.0036,
For 10,002 lines. I want to remove the header, and I want to separate the two columns into separate files. I have the following code which produces the results I want when I consider a single input file:
tail -10000 file_1.csv |
awk -F, '{print $1 > "s.dat"; print $2 > "Volts.dat"}'
However, I want something that will produce the equivalent files for each csv file, say, replace s.dat with s_$i.dat or similar, but I'm not sure how to go about this, and how to call in each separate csv file in a loop rather than explicitly stating it as file_1.csv.

awk to the rescue!
awk -F, 'FNR>2{print $1 > "s_"FILENAME".dat";
print $2 > "Volts_"FILENAME".dat"}' file*
or reading the filename from the data files
$ awk -F, 'FNR==2{s="_"FILENAME".dat";h1=$1s;h2=$2s}
FNR>2{print $1 > h1; print $2 > h2}' file*

awk Splitting huge file creates error "too many open files" [duplicate]

I am just splitting a very large csv file in to parts. When ever i run the following command. the doesn't completely split rather returns me the following error. how can i avoid the split the whole file.
awk -F, '{print > $2}' test1.csv
awk: YY1 makes too many open files
input record number 31608, file test1.csv
source line number 1

Just close the files after writing:
awk -F, '{print > $2; close($2)}' test1.csv

You must have a lot of lines. Are you sure that the second row repeats enough to put those records into an individual file? Anyway, awk is holding the files open until the end. You'll need a process that can close the file handles when not in use.
Perl to the rescue. Again.
#!perl
while( <> ) {
#content = split /,/, $_;
open ( OUT, ">> $content[1]") or die "whoops: $!";
print OUT $_;
close OUT;
}
usage: script.pl your_monster_file.csv
outputs the entire line into a file named the same as the value of the second CSV column in the current directory, assuming no quoted fields etc.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Converting from tsv to fasta - bash

if the basename is the part until the first ".", you can get rid of the loop as well. awk '{split(FILENAME,base,"."); print ">" base[1] "_" $3 "\n" $5 > base[1]".fasta"}' *coseq.table_headless.tsv

Another awk solution: awk '{ pfx=substr(FILENAME,1,index(FILENAME,".")-1); printf(">%s_%s\n%s\n",pfx,$3,$5) > pfx".fasta" }' *coseq.table_headless.tsv pfx contains the first part of filename (till the 1st .)

Related

How to replace multiple " in between two, with help of sed or awk

How to strip date in csv output using shell script?

How to use awk to split a file and store each filename in a Bash array

How to process many csv files into many separate dat files

awk Splitting huge file creates error "too many open files" [duplicate]

Categories

Resources