Merging fastq files by identifiers with a shell script - bash

I have to merge files with the following naming pattern :
[SampleID]_[custom_ID01]_ID[RUN_ID]_L001_R1.fastq
[SampleID]_[custom_ID02]_ID[RUN_ID]_L002_R1.fastq
[SampleID]_[custom_ID03]_ID[RUN_ID]_L003_R1.fastq
[SampleID]_[custom_ID04]_ID[RUN_ID]_L004_R1.fastq
I need to merge all files with identical [SampleID] but different "Lanes" (L001-L004).
The following script works fine when directly run in the terminal:
custom_id="000"
RUN_ID="0025"
wd="/path/to/script/" # was missing/ incorrect
# get ALL sample identifiers
touch temp1.txt
for line in $wd/*.fastq ; do
fastq_identifier=$(echo "$line" | cut -d"_" -f1);
echo $fastq_identifier >> temp1.txt
done
# get all uniqe samples identical
cat temp1.txt | uniq > temp2.txt
input_var=$(cat temp2.txt)
# concatenate all fastq (different lanes) with identical identifier
for line in $input_var; do
cat $line*fastq >> $line"_"$custom_id"_ID"$Run_ID"_L001_R1.fastq"
done
rm temp1.txt temp2.txt;
But if I create a script file (concatenate_fastq.sh) and make it executable
$ chomd +x concatenate_fastq.sh
and run it
$ ./concatenate_fastq.sh
I got the following error:
$ concatenate_fastq.sh: line 17: /*.fastq_000_ID_L001_R1.fastq: Keine Berechtigung # = Permission denied
Thx to your hints below I solved the problem by fixing
wd=/path/to/script/

The immediate problem seems to be that wd is unset. If you script really genuinely contains exactly the line
wd="/path/to/script/"
then I would suspect invisible control characters in the script file (using a Windows editor is a common way to shoot yourself in the foot).
More generally, your script should cope correctly when the wildcard does not match any files. A common way to do that is to shopt -s nullglob but the subsequent script would still need adaptation then.
Refactoring the script to loop only over actual matches would help avoid trouble. Perhaps something like this:
shopt -s nullglob # bashism
printf '%s\n' "$wd"/*.fastq |
cut -d_ -f1 |
uniq |
while read -r line; do
cat "$line"*fastq >> "${line}_${custom_id}_ID${Run_ID}_L001_R1.fastq"
done
You'll notice that this simplifies the script tremendously, and avoids the pesky temporary files.

I solved it with:
if [ $# -ne 3 ] ; then
echo -e "Usage: $0 {path_to_working_directory} {custom_ID:Z+} {run_ID:ZZZZ}\n"
exit 1
fi
cwd=$(pwd)
wd=$1
custom_id=$2
RUN_ID=$3
folder=$(basename $wd)
input_var=$(ls *fastq | cut --fields 1 -d "_" | uniq)
for line in $input_var; do
cat $line*fastq >> $line"_"$custom_id"_ID"$RUN_ID"_L001_R1.fastq"
done

Related

cat multiple files in separate directories file1 file2 file3....file100 using loop in bash script

I have several files in multiple directories like in directory 1/file1 2/file2 3/file3......100/file100. I want to cat all those files to a single file using loop over index in bash script. Is there easy loop for doing so?
Thanks,
seq 100 | sed 's:.*:dir&/file&:' | xargs cat
seq 100 generates list of numbers from 1 to 100
sed
s substitutes
: separates parts of the command
.* the whole line
: separator. Usually / is used, but it's used in replacement string.
dir&/file& by dir<whole line>/file<whole line>
: separator
so it generates list of dir1/file1 ... dir100/file100
xargs - pass input as arguments to ...
cat - so it will execute cat dir1/file1 dir2/file2 ... dir100/file100.
This code should do the trick;
for((i=1;i<=`ls -l | wc -l`;i++)); do cat dir${i}/file${i} >> output; done
I made an example of what you're describing about your directory structure and files. Create directories and files with It's own content.
for ((i=1;i<=100;i++)); do
mkdir "$i" && touch "$i/file$i" && echo content of "$(pwd) $i" > "$i/file$i"
done
Check the created directories.
ls */*
ls */* | sort -n
If you see that the directories and files are created then proceed to the next step.
This solution does not involve any external command from the shell except of course cat :-)
Now we can check the contents of each files using bash syntax.
i=1
while [[ -e "$i" ]]; do
cat "$i"/*
((i++))
done
This code was tested in dash.
i=1
while [ -e "$i" ]; do
cat "$i"/*
i=$((i+1))
done
Just add the redirection of the output to the file after the done.
You can add some more test if you like see help test
One more thing :-), you can just check the contents using tail and brace expansion
tail -n +1 {1..100}/*
Using cat also you can redirect the output already, just remember brace expansion is bash3+ feature/syntax.
cat {1..100}/*

What's wrong with this file renaming loop?

I'm trying to iterate through all the files in a directory and rename them from the prefix ABC to XYZ using the command below
while read file; do mv \"$file\" \"$(echo $file | sed -e s/ABC/XYZ/g)\" ; done < <(ls -1)
When I throw an echo in front of the mv, everything looks like it should work fine and copy/pasting the outputted command works fine but it won't execute correctly within the context of the loop giving me a usage error as if the command is malformed like below.
usage: mv [-f | -i | -n] [-v] source target
mv [-f | -i | -n] [-v] source ... directory
Even though the outputted command from the check with echo gives
mv "ABC Test1" "XYZ Test1"
which should be a valid command and works if I copy paste.
Any idea what is going on?
Relace:
while read file; do mv \"$file\" \"$(echo $file | sed -e s/ABC/XYZ/g)\" ; done < <(ls -1)
With:
for file in *
do
mv "$file" "${file//ABC/XYZ}"
done
Notes:
This is very important: Never parse ls. ls is only designed to produce human-friendly output.
To iterate over all files in a directory, use for file in *; do ...; done. This will work reliably for all manor of file names including file names with newlines, blanks, or other difficult characters.
\" produces a literal character, not a syntactic character. Since we want the syntactic meaning of " here, we leave it unescaped.
There are times when one needs sed but this isn't one of them.
The shell is capable of doing simple substitutions without all the issues associated with command substitution. Thus, $(echo $file | sed -e s/ABC/XYZ/g) can be replaced with ${file//ABC/XYZ}.
The form ${var//old/new} is called pattern substitution and is documented in man bash.
Very stupid mistake. There was no need to escape the quotes in the mv command. Taking those out makes it work as expected. Escaping the quotes shows the correct output with echo but does not give intended behavior.
while read file; do mv "$file" "$(echo $file | sed -e s/ABC/XYZ/g)" ; done < <(ls -1)

bash replace string with floating variable

Im trying to replace a number in a file with a calculated floating variable in a bash file. So im trying to replace 1.1111 with the value of "km" and save it in the mesh.in file. I keep getting an error on the sed line, I think there may be an issue with the floating variable. Echo "$km" does work so i know that the km is not the issue
#!/bin/bash
read -p "Angle in degrees : " n1
read -p "bcsa : " n2
cd viv_example_se2d
sed s/^bcsa.\*/"bcsa $n2"/ runfile.viv >temp
mv -f temp runfile.viv
cd ../
for i in $(seq 2 0.5 12)
do
if [ ! -d U*_$i ];then
mkdir U*_$i
fi
printf -v "km" "%.4f\n" $(echo | bc | awk "BEGIN {print 4*3.14159265359*3.14159265359/($i*$i)}")
echo "$km"
cd viv_example_se2d
sed s/1.1111/$km/g mesh_master.in > temp$i
mv -f temp$i mesh.in
cd ../
echo $home/lustre/projects/p057_swin/ogoldman/Ellipse_$n1/U*_$i | xargs -n 1 cp viv_example_se2d/*
done;
The problem is the newline in the value of $km. It is confusing sed.
That being said this script is also a bit of a mess.
You should quote your variables when you use them to protect against problems with whitespace and glob characters in the values.
You don't need xargs to cp multiple files that you can expand via a glob. cp will happily take multiple files to copy directly. (Oh, or is that copying multiple files to directories produced via that glob?)
You have a useless echo | bc | bit near the awk command.
Using full/relative paths in sed/etc. is better than cding around generally.

How could I append '\' in front of the space within a file name?

I was working on a program that could transfer files using sftp program:
sftp -oBatchMode=no -b ${BATCH_FILE} user#$123.123.123.123:/home << EOF
bye
EOF
One of my requirement is I must have a BATCH_FILE use with sftp and the batch file was generate using following script:
files=$(ls -1 ${SRC_PATH}/*.txt)
echo "$files" > ${TEMP_FILE}
while read file
do
if [ -s "${file}" ]
then
echo ${file} >> "${PARSE_FILE}" ## line 1
fi
done < ${TEMP_FILE}
awk '$0="put "$0' ${PARSE_FILE} > ${BATCH_FILE}
Somehow my program doesn't able to handle files with space in it. I did try using following code to replace line 1 but failed, the output of this will show filename\.txt.
newfile=`echo $file | tr ' ' '\\ '`
echo ${newfile} >> "${PARSE_FILE}"
In order to handle file name with space, how could I append a \ in front of the space within a file name?
THE PROBLEM
The problem is that tr SET1 SET2 will replace the Nth character in SET1 with the Nth character in SET2, which means that you are effectively replacing every space by \, instead of adding a backslash before every space.
PROPOSED SOLUTION
Instead of manually trying to fix the missing spaces, upon using your variable that might contain spaces; wrap it in quotes and let the shell handle the trouble for you.
See the below example:
$ echo $FILENAME
file with spaces.txt
$ ls $FILENAME
ls: cannot access file: No such file or directory
ls: cannot access with: No such file or directory
ls: cannot access spaces.txt: No such file or directory
$ ls "$FILENAME"
file with spaces.txt
But I really wanna replace stuff..
Well, if you really want a command to change every ' ' (space) into '\ ' (backslash, space) you could use sed with a basic replace-pattern, as the below:
$ echo "file with spaces.txt" | sed 's, ,\\ ,g'
file\ with\ spaces.txt
I haven't looked too closely at what you're trying to do there, but I do know that bash can handle filenames with spaces in them if you double-quote them. Why not try quoting every filename variable and see if that works? You're quoting some of them but not all yet.
Like try these: "${newfile}" or just "$newfile" "$file" "$tempfile" etc...
You can further simplify your code if you're using Bash:
function generate_batch_file {
for FILE in "${SRC_PATH}"/*.txt; do
[[ -s $FILE ]] && echo "put {$FILE// /\\ }"
done
}
sftp -oBatchMode=no -b <(generate_batch_file) user#$123.123.123.123:/home <<< "bye"
you can try to rename the file to work and rename it again after it has done.

Trying to write a script to clean <script.aa=([].slice+'hjkbghkj') from multiple htm files, recursively

I am trying to modify a bash script to remove a glob of malicious code from a large number of files.
The community will benefit from this, so here it is:
#!/bin/bash
grep -r -l 'var createDocumentFragm' /home/user/Desktop/infected_site/* > /home/user/Desktop/filelist.txt
for i in $(cat /home/user/Desktop/filelist.txt)
do
cp -f $i $i.bak
done
for i in $(cat /home/user/Desktop/filelist.txt)
do
$i | sed 's/createDocumentFragm.*//g' > $i.awk
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
This is where the script bombs out with this message:
+ for i in '$(cat /home/user/Desktop/filelist.txt)'
+ sed 's/createDocumentFragm.*//g'
+ /home/user/Desktop/infected_site/index.htm
I get 2 errors and the script stops.
/home/user/Desktop/infected_site/index.htm: line 1: syntax error near unexpected token `<'
/home/user/Desktop/infected_site/index.htm: line 1: `<html><head><script>(function (){ '
I have the first 2 parts done.
The files containing createDocumentfragm have been enumerated in a text file correctly.
The files in the textfile.txt have been duplicated, in their original location with a .bak added to them IE: infected_site/some_directory/infected_file.htm and infected_file.htm.bak
effectively making sure we have a backup.
All I need to do now is write an AWK command that will use the list of files in filelist.txt, use the entire glob of malicious text as a pattern, and remove it from the files. Using just the uppercase script as the starting point, and the lower case script is too generic and could delete legitimate text
I suspect this may help me, but I don't know how to use it correctly.
http://backreference.org/2010/03/13/safely-escape-variables-in-awk/
Once I have this part figured out, and after you have verified that the files weren't mangled you can do this to clean out the bak files:
for i in $(cat /home/user/Desktop/filelist.txt)
do
rm -f $i.bak
done
Several things:
You have:
$i | sed 's/var createDocumentFragm.*//g' > $i.awk
You should probably meant this (using your use of cat which we'll talk about in a moment):
cat $i | sed 's/var createDocumentFragm.*//g' > $i.awk
You're treating each file in your file list as if it was a command and not a file.
Now, about your use of cat. If you're using cat for almost anything but concatenating multiple files together, you probably are doing something not quite right. For example, you could have done this:
sed 's/var createDocumentFragm.*//g' "$i" > $i.awk
I'm also a bit confused about the awk statement. Exactly what file are you using awk on? Your awk statement is using STDIN and STDOUT, so it's reading file names from the for loop and then printing the output on the screen. Is the sed statement suppose to feed into the awk statement?
Note that I don't have to print out my file to STDOUT, then pipe that into sed. The sed command can take the file name directly.
You also want to avoid for loops over a list of files. That is very inefficient, and can cause problems with the command line getting overloaded. Not a big issue today, but can affect you when you least suspect it. What happens is that your $(cat /home/user/Desktop/filelist.txt) must execute first before the for loop can even start.
A little rewriting of your program:
cd ~/Desktop
grep -r -l 'var createDocumentFragm' infected_site/* > filelist.txt
while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
We can use one loop, and we made it a while loop. I could even feed the grep into that while loop:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
and then I don't even have to create a temporary file.
Let me know what's going on with the awk. I suspect you wanted something like this:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" \
| awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p' > "$i.awk"
done < filelist.txt
Also note I put quotes around file names. This helps prevent problems if file name has a space in it.

Resources