File names stacking during for loop - macos

I am new to shell script, so this might be a dumb question. I haven't found an answer online though. I am taking a coworkers script and changing it so that it works for my data. Right now I am running a test that only uses three of my data files. The code hits a spot in the script where it goes through a for loop and it is supposed to run through the loop once for each of the different files (three times).
listtumor=`cat /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/MatchedTumorTest.txt`
for i in $listtumor
do
lst=`ls /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/freshstart/${i}*.txt | awk -F'/' '{print $9}'`
MatchedTumorTest.txt just contains the three different file names that I am using for the test without '.txt' As far as I can tell, this code should just run through the script three times, one for each file. Instead I am getting this error:
ls: /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/freshstart/TCGA-04-1514-01A-01D-0500-02_S01_CGH_105_Dec08\rTCGA-04-1530-01A-02D-0500-02_S01_CGH_105_Dec08\rTCGA-04-1542-01A-01D-0500-02_S01_CGH_105_Dec08*.txt: No such file or directory
For some reason all of the file names are stacked on top of each other instead of the loop going to each one individually. Any ideas why this is happening?
Thanks,
T.J.

It looks like the lines in your text file may be separated by carriage returns instead of newlines. Since none of the file names in your example have spaces, the for loop should work just fine if you initialize your listtumor like this:
listtumor=`tr '\r' '\n' < /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/MatchedTumorTest.txt`
The tr command will translate the carriage returns into newlines, which is what most text-processors (like the shell's own for command) will expect), and write the result to standard output.

The for loop doesn't do too well with some kinds of separators. Try this instead:
while read line; do
lst=`ls /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/freshstart/${line}*.txt | awk -F'/' '{print $9}'`
...
done < /Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun/MatchedTumorTest.txt
I'm assuming here that you're separating MatchedTumorTest.txt with newlines.

So combined all together:
dir="/Users/TReiersen/Work-Folder/OV/DataProcessing/TestRun"
file="$dir/MatchedTumorTest.txt"
< "$file" tr '\r' '\n' | while read tumor
do
ls "$dir/freshstart" | grep "$tumor.*\.txt$"
done
will print all .txt file-names in the directory $dir/freshstart what contain a name form the file MatchedTumorTest.txt

Related

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

bash for with awk command inside

i have this piece of code of a bash script
for file in "$(ls | grep .*.c)"
do
cat $file |awk '/.*open/{print $0}'|awk -v nomeprog=$file 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
done
this give me this error :
*awk: cmd. line:1: file.c
awk: cmd. line:1: ^ syntax error
*i have this error when i have more of a file c into the folder , with just 1 file it works
Overall, you should probably follow Charles Duffy's recommendation to use more appropriate tools for the task. But I'd like to go over why the current script isn't working and how to fix it, as a learning exercise.
Also, two quick recommendations for shell script checking & troubleshooting: run your scripts through shellcheck.net to point out common mistakes, and when debugging put set -x before the problem section (and set +x after), so the shell will print out what it thinks is going on as the script runs.
The problem is due to how you're using the file variable. Let's look at what this does:
for file in "$(ls | grep .*.c)"
First, ls prints a list of files in the current directory, one per line. ls is really intended for interactive use, and its output can be ambiguous and hard to parse correctly; in a script, there are almost always better ways to get lists of filenames (and I'll show you one in a bit).
The output of ls is piped to grep .*.c, which is wrong in a number of ways. First, since that pattern contains a wildcard character ("*"), the shell will try to expand it into a list of matching filenames. If the directory contains any hidden (with a leading ".") .c files, it'll replace it with a list of those, and nothing is going to work at all right. Always quote the pattern argument to grep to prevent this.
But the pattern itself (".*.c") is also wrong; it searches for any number of arbitrary characters (".*"), followed by a single arbitrary character ("." -- this is in a regex, so "." is not treated literally), followed by a "c". And it searches for this anywhere in the line, so any filename that contains a "c" somewhere other than the first position will match. The pattern you want would be something like '[.]c$' (note that I wrapped it in single-quotes, so the shell won't try to treat $ as a variable reference like it would in double-quotes).
Then there's another problem, which is (part of) the problem you're actually experiencing: the output of that ls | grep is expanded in double-quotes. The double-quotes around it tell the shell not to do its usual word-split-and-wildcard-expand thing on the result. The common (but still wrong) thing to do here is to leave off the double-quotes, because word-splitting will probably break the list of filenames up into individual filenames, so you can iterate over them one-by-one. (Unless any filenames contain funny characters, in which case it can give weird results.) But with double-quotes it doesn't split them, it just treats the whole thing as one big item, so your loop runs once with file set to "src1.c\nsrc2.c\nsrc3.c" (where the \n's represent actual newlines).
This is the sort of trouble you can get into by parsing ls. Don't do it, just use a shell wildcard directly:
for file in *.c
This is much simpler, avoids all the confusion about regex pattern syntax vs wildcard pattern syntax, ambiguity in ls's output, etc. It's simple, clear, and it just works.
That's probably enough to get it to work for you, but there are a couple of other things you really should fix if you're doing something like this. First, you should double-quote variable references (i.e. use "$file" instead of just $file). This, is another part of the error you're getting; look at the second awk command:
awk -v nomeprog=$file 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
With file set to "src1.c\nsrc2.c\nsrc3.c", the shell will do its word-split-and-wildcard-expand thing on it, giving:
awk -v nomeprog=src1.c src2.c src3.c 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
awk will thus set its nomeprog variable to "src1.c", and then try to run "src2.c" as an awk command (on input files named "src3.c" and "BEGIN{FS=..."). "src2.c" is, of course, not a valid awk command, so you get syntax error.
This sort of confusion is typical of the chaos that can result from unquoted variable references. Double-quote your variable references.
The other thing, which is much less important, is that you have a useless use of cat. Anytime you have the pattern:
cat somefile | somecommand
(and it's just a single file, not several that need to be catenated together), you should just use:
somecommand <somefile
and in some cases like awk and grep, the command itself can take input filename(s) directly as arguments, so you can just use:
somecommand somefile
so in your case, rather than
cat "$file" | awk '/.*open/{print $0}' | awk -v nomeprog="$file" 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
I'd just use:
awk '/.*open/{print $0}' "$file" | awk -v nomeprog="$file" 'BEGIN{FS="(";printf "the file e %s with the open call:", nameprog}//{ print $2}'
(Although, as Charles Duffy pointed out, even that can be simplified quite a lot.)

Utilising variables in tail command

I am trying to export characters from a reference file in which their byte position is known. To do this, I have a long list of numbers stored as a variable which have been used as the input to a tail command.
For example, the reference file looks like:
ggaaatgcattcaaacatgc
And the list looks like:
5
10
7
15
I have tried using this code:
list=$(<pos.txt)
echo "$list"
cat ref.txt | tail -c +"list" | head -c1 > out.txt
However, it keeps returning "invalid number of bytes: '+5\n10\n7\n15...'"
My expected output would be
a
t
g
a
...
Can anybody tell me what I'm doing wrong? Thanks!
It looks like you are trying to access your list variable in your tail command. You can access it like this: $list rather than just using quotes around it.
Your logic is flawed even after fixing the variable access. The list variable includes all lines of your list.txt file. Including the newline character \n which is invisible in many UIs and programs, but it is of course visible when you are manually reading single bytes. You need to feed the lines one by one to make it work properly.
Also unless those numbers are indexes from the end, you need to feed them to head instead of tail.
If I understood what you are attempting to do correctly, this should work:
while read line
do
head -c $line ref.txt | tail -c 1 >> out.txt
done < pos.txt
The reason for your command failure is simple. The variable list contains a multi-line string stored from the pos.txt files including newlines. You cannot pass not more than one integer value for the -c flag.
Your attempts can be fixed quite easily with removing calls to cat and using a temporary variable to hold the file content
while IFS= read -r lineNo; do
tail -c "$lineNo" ref.txt | head -c1
done < pos.txt
But then if your intentions is print the desired output in a new-line every time, head does not output that way. It just forms a string atga for your given input in a single line and not across multiple lines with one character at each line.
As Gordon mentions in one of the comments, for much more efficient FASTA files processing, you could just use one invocation of awk though (skipping multiple forks to head/tail). Your provided input does not involve any headers to skip which would be straightforward as
awk ' FNR==NR{ n = split($0,arr,""); for(i=1;i<=n;i++) hash[i] = arr[i] }
( $0 in hash ){ print hash[$0] } ' ref.txt pos.txt
You could use cut instead of tail:
pos=$(<pos.txt)
cut -c ${pos//$'\n'/,} --output-delimiter=$'\n' ref.txt
Or just awk:
awk -F '' 'NR==FNR{c[$0];next} {for(i in c) print $i}' pos.txt ref.txt
both yield:
a
g
t
a

How do I delete all rows with a blank space in the third column within a file?

So, I have a file which contains the results of some calculations I've run in the past weeks. I've collected the results in a file which I intend to plot. It is basically a bunch of rows with the format "x" "y" "f(x,y)", like this:
1.7 4.7 -460.5338556921
1.7 4.9 -460.5368762353
1.7 5.5
However, some lines, exemplified by the last one, contain a blank space in the 3rd column, resulting from failed calculations. I'd still like to plot the viable points, but, as there are thousands of points (and therefore rows) that task just be accomplished easily by hand. I'd like to know how to make a script or program (I'd prefer a shell script, but I'll gladly go along with whatever works), which identifies those lines and deletes them. Does anyone know a way to do it?
awk '$3' <filename>
or better
awk 'NF > 2' <filename> # if in any entry in the column-3 happens to be zero
This will do the purpose!
The simplest form of grep command that should probably be understood by any shell these days:
grep -v '^[^[:space:]]*[[:space:]]*[^[:space:]]*[[:space:]]*$' <filename>
With grep:
grep ' .* [^ ]' file
or using ERE:
grep -E '\s\S+\s\S' file
I would to use:
perl -lanE 'print if #F==3 && /^[\d\s\.+-]+$/' file
will print only lines:
which contains 3 fields
and contains only numbers, spaces, and .+-
I do not know how you are going to plot. You would like a grep or awk solution and pipe all valid lines into your plotting application.
When you need to call a program for each set of values, you can skip the invalid lines when you are reading the values:
while read -r x y fxy; do
if [ -n "${fxy}" ]; then
myplotter "$x" "$y" "${fxy}"
fi
done < file

Using sed to dynamically generate a file name

I have a CSV file that I'd like to split up based on a field in the file. Essentially, there can be two brands, GVA and HBVL. I'd like to split the file into a file for each brand before I import it into a database.
Sample of the CSV file
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0
Part of the problem is the size of the file. It's about 39mb. My original attempt at this looked like this:
while read line ; do
name=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\2/ p' | tr [:upper:] [:lower:] `
info=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\1\3/ p'`
echo "${info}" >> ${BASEDIR}/${today}/${name}.txt
done < ${file}
After about 2.5 hours, only about 1/2 of the file had been processed. I have another file that could potentially be up to 250 mb in size and I can't imagine how long that would take.
What I'd like to do is pull out the brand out of the line and write the line to a file named after the brand. I can remove the brand, but I don't now how to use it to create a file. I've started in sed, but I'm not above using another language if it's more appropriate.
The original while loop with multiple commands per line is DIRE!
sed -e '/"GVA"/w gva.file' -e '/"HBVL"/w hbvl.file' -n $file
The sed script says:
write lines that match the GVA tag to gva.file
write lines that match the HBVL tag to hbvl.file
and don't print anything else ('-n')
Note that different versions of sed can handle different numbers of auxilliary files. If you need more than, say, twenty output files at once, you may need to look at other technology (but test what the limit is on your machine). If the file is sorted so that all the GVA records appear together followed by all the HBVL records, you could consider using csplit. Alternatively, a scripting language like Perl could handle more. If you exceed the number of file descriptors allowed to your process, it becomes hard to do the splitting in a single pass over the data file.
grep '"GVA"' $file >GVA.txt
grep '"HVBL"' $file >HVBL.txt
# awk -F"," '{o=$5;gsub(/\"/,"",o);print $0 > o}' OFS="," file
# more GVA
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
# more HBVL
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0

Resources