Bash script to concatenate text files with specific substrings in filenames - bash

Within a certain directory I have many directories containing a bunch of text files. I’m trying to write a script that concatenates only those files in each directory that have the string ‘R1’ in their filename into one file within that specific directory, and those that have ‘R2’ in another . This is what I wrote but it’s not working.
#!/bin/bash
for f in */*.fastq; do
if grep 'R1' $f ; then
cat "$f" >> R1.fastq
fi
if grep 'R2' $f ; then
cat "$f" >> R2.fastq
fi
done
I get no errors and the files are created as intended but they are empty files. Can anyone tell me what I’m doing wrong?
Thank you all for the fast and detailed responses! I think I wasn't very clear in my question, but I need the script to only concatenate the files within each specific directory so that each directory has a new file ( R1 and R2). I tried doing
cat /*R1*.fastq >*/R1.fastq
but it gave me an ambiguous redirect error. I also tried Charles Duffy's for loop but looping through the directories and doing a nested loop to run though each file within a directory like so
for f in */; do
for d in "$f"/*.fastq;do
case "$d" in
*R1*) cat "$d" >&3
*R2*) cat "$d" >&4
esac
done 3>R1.fastq 4>R2.fastq
done
but it was giving an unexpected token error regarding ')'.
Sorry in advance if I'm missing something elementary, I'm still very new to bash.

A Note To The Reader
Please review edit history on the question in considering this answer; several parts have been made less relevant by question edits.
One cat Per Output File
For the purpose at hand, you can probably just let shell globbing do all the work (if R1 or R2 will be in the filenames, as opposed to the directory names):
set -x # log what's happening!
cat */*R1*.fastq >R1.fastq
cat */*R2*.fastq >R2.fastq
One find Per Output File
If it's a really large number of files, by contrast, you might need find:
find . -mindepth 2 -maxdepth 2 -type f -name '*R1*.fastq' -exec cat '{}' + >R1.fastq
find . -mindepth 2 -maxdepth 2 -type f -name '*R2*.fastq' -exec cat '{}' + >R2.fastq
...this is because of the OS-dependent limit on command-line length; the find command given above will put as many arguments onto each cat command as possible for efficiency, but will still split them up into multiple invocations where otherwise the limit would be exceeded.
Iterate-And-Test
If you really do want to iterate over everything, and then test the names, consider a case statement for the job, which is much more efficient than using grep to check just one line:
for f in */*.fastq; do
case $f in
*R1*) cat "$f" >&3
*R2*) cat "$f" >&4
esac
done 3>R1.fastq 4>R2.fastq
Note the use of file descriptors 3 and 4 to write to R1.fastq and R2.fastq respectively -- that way we're only opening the output files once (and thus truncating them exactly once) when the for loop starts, and reusing those file descriptors rather than re-opening the output files at the beginning of each cat. (That said, running cat once per file -- which find -exec {} + avoids -- is probably more overhead on balance).
Operating Per-Directory
All of the above can be updated to work on a per-directory basis quite trivially. For example:
for d in */; do
find "$d" -name R1.fastq -prune -o -name '*R1*.fastq' -exec cat '{}' + >"$d/R1.fastq"
find "$d" -name R2.fastq -prune -o -name '*R2*.fastq' -exec cat '{}' + >"$d/R2.fastq"
done
There are only two significant changes:
We're no longer specifying -mindepth, to ensure that our input files only come from subdirectories.
We're excluding R1.fastq and R2.fastq from our input files, so we never try to use the same file as both input and output. This is a consequence of the prior change: Previously, our output files couldn't be considered as input because they didn't meet the minimum depth.

Your grep is searching the file contents instead of file name. You could rewrite it this way:
for f in */*.fastq; do
[[ -f $f ]] || continue
if [[ $f = *R1* ]]; then
cat "$f" >> R1.fastq
elif [[ $f = *R2* ]]; then
cat "$f" >> R2.fastq
fi
done

Find in a forloop might suit this:
for i in R1 R2
do
find . -type f -name "*${i}*" -exec cat '{}' + >"$i.txt"
done

Related

How to use bash string formatting to reverse date format?

I have a lot of files that are named as: MM-DD-YYYY.pdf. I want to rename them as YYYY-MM-DD.pdf I’m sure there is some bash magic to do this. What is it?
For files in the current directory:
for name in ./??-??-????.pdf; do
if [[ "$name" =~ (.*)/([0-9]{2})-([0-9]{2})-([0-9]{4})\.pdf ]]; then
echo mv "$name" "${BASH_REMATCH[1]}/${BASH_REMATCH[4]}-${BASH_REMATCH[3]}-${BASH_REMATCH[2]}.pdf"
fi
done
Recursively, in or under the current directory:
find . -type f -name '??-??-????.pdf' -exec bash -c '
for name do
if [[ "$name" =~ (.*)/([0-9]{2})-([0-9]{2})-([0-9]{4})\.pdf ]]; then
echo mv "$name" "${BASH_REMATCH[1]}/${BASH_REMATCH[4]}-${BASH_REMATCH[3]}-${BASH_REMATCH[2]}.pdf"
fi
done' bash {} +
Enabling the globstar shell option in bash lets us do the following (will also, like the above solution, handle all files in or below the current directory):
shopt -s globstar
for name in **/??-??-????.pdf; do
if [[ "$name" =~ (.*)/([0-9]{2})-([0-9]{2})-([0-9]{4})\.pdf ]]; then
echo mv "$name" "${BASH_REMATCH[1]}/${BASH_REMATCH[4]}-${BASH_REMATCH[3]}-${BASH_REMATCH[2]}.pdf"
fi
done
All three of these solutions uses a regular expression to pick out the relevant parts of the filenames, and then rearranges these parts into the new name. The only difference between them is how the list of pathnames is generated.
The code prefixes mv with echo for safety. To actually rename files, remove the echo (but run at least once with echo to see that it does what you want).
A direct approach example from the command line:
$ ls
10-01-2018.pdf 11-01-2018.pdf 12-01-2018.pdf
$ ls [0-9]*-[0-9]*-[0-9]*.pdf|sed -r 'p;s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\1-\2/'|xargs -n2 mv
$ ls
2018-10-01.pdf 2018-11-01.pdf 2018-12-01.pdf
The ls output is piped to sed , then we use the p flag to print the argument without modifications, in other words, the original name of the file, and s to perform and output the conversion.
The ls + sed result is a combined output that consist of a sequence of old_file_name and new_file_name.
Finally we pipe the resulting feed through xargs to get the effective rename of the files.
From xargs man:
-n number Execute command using as many standard input arguments as possible, up to number arguments maximum.
You can use the following command very close to the one of klashxx:
for f in *.pdf; do echo "$f"; mv "$f" "$(echo "$f" | sed 's#\(..\)-\(..\)-\(....\)#\3-\2-\1#')"; done
before:
ls *.pdf
12-01-1998.pdf 12-03-2018.pdf
after:
ls *.pdf
1998-01-12.pdf 2018-03-12.pdf
Also if you have other pdf files that does not respect this format in your folder, what you can do is to select only the files that respect the format: MM-DD-YYYY.pdf to do so use the following command:
for f in `find . -maxdepth 1 -type f -regextype sed -regex './[0-9]\{2\}-[0-9]\{2\}-[0-9]\{4\}.pdf' | xargs -n1 basename`; do echo "$f"; mv "$f" "$(echo "$f" | sed 's#\(..\)-\(..\)-\(....\)#\3-\2-\1#')"; done
Explanations:
find . -maxdepth 1 -type f -regextype sed -regex './[0-9]\{2\}-[0-9]\{2\}-[0-9]\{4\}.pdf this find command will look only for files in the current working directory that respect your syntax and extract their basename (remove the ./ at the beginning, folders and other type of files that would have the same name are not taken into account, other *.pdf files are also ignored.
for each file you do a move and the resulting file name is computed using sed and back reference to the 3 groups for MM,DD and YYYY
For these simple filenames, using a more verbose pattern, you can simplify the body of the loop a bit:
twodigit=[[:digit:]][[:digit:]]
fourdigit="$twodigit$twodigit"
for f in $twodigit-$twodigit-$fourdigit.pdf; do
IFS=- read month day year <<< "${f%.pdf}"
mv "$f" "$year-$month-$day.pdf"
done
This is basically #Kusalananda's answer, but without the verbosity of regular-expression matching.

Searching for hundreds of files on a server

I have a list of 577 image files that I need to search for on a large server. I am no expert when it comes to bash so the best I could do myself was 577 lines of this:
find /source/directory -type f -iname "alternate1_1052956.tif" -exec cp {} /dest/directory \;
...repeating this line for each file name. It works... but it's unbelievably slow because it searches the entire server for one file and then moves on to the next line, but each search could take 20 minutes. I left this overnight and it only found 29 of them by the morning which is just way too slow. It could take two weeks at that rate to find all of these.
I've tried separating each line with -o as an OR separator in the hopes that it would search once for 577 files but I can't get it to work.
Does anyone have any suggestions? I also tried using the .txt file I have of the file names as a basis for the search but couldn't get that to work either. Unfortunately I don't have the paths for these files, only the basenames.
If you want to copy all .tif files
find /source/directory -type f -name "*.tif" -exec cp {} /dest/directory \;
# ^
On MacOS, use the mdfind command that will look for the filename in the SpotLight index. This is very fast as it is only an index lookup, just like the locate command in Linux:
cp $(mdfind alternate1_1052956.tif) /dest/directory
If you have all the filenames in a file (one line per file) use xargs
xargs -L 1 -I {} cp $(mdfind {}) /dest/directory < file_with_list
Create a file with all filenames, then write a loop which runs through that file and executes command in background.
Note, that this will take a lot of memory, as you will be executing this simultaneously multiple times. So make sure you have enough memory for this.
while read -r line; do
find /source/directory -type f -iname "$line" -exec cp {} /dest/directory \ &;
done < input.file
There are a few assumption made in this answer. You have a list of all 577 file names, let's call it, inputfile.list. There are no whitespaces in the file names. Following may work:
$ cat findcopy.sh
#!/bin/bash
cmd=$(
echo -n 'find /path/to/directory -type f '
readarray -t filearr < inputfile.list # Read the list to an array
n=0
for f in "${filearr[#]}" # Loop over the array and print -iname
do
(( n > 0 )) && echo "-o -iname ${f}" || echo "-iname ${f}"
((n++))
done
echo -n ' | xargs -I {} cp {} /path/to/destination/'
)
eval $cmd
execute: ./findcopy.sh
Note for MacOS. It doesn't have readarray. Instead use any other simple method to feed the list into array, for example,
filearr=($(cat inputfile.list))

Find pipe to multiple commands (grep and file)

Here is my problem : I am trying to parse a lot of files on a system to find some tokens. My tokens are stored in a file, one token on each line (for example token.txt). My path to parse are also stored in an other file, one path on each line (for example path.txt).
I use a combination of find and grep to do my stuff. Here is one attempt:
for path in $(cat path.txt)
do
for line in $(find $path -type f -print0 | xargs -0 grep -anf token.txt 2>/dev/null);
do
#some stuffs here
done
done
It seems to work fine, I don't really know if there is an other way to make it faster though (I am a beginner in programmation and shell).
My problem is : For each file found by the find command, I want to get all the files that are compressed. For this, I wanted to use the file command. The problem is that I need the output of the find command for both grep and file.
What is the best way to achieve this ? To summarize my problem, I would like something like this :
for path in $(cat path.txt)
do
for line in $(find $path -type f);
do
#Use file command to test all the files, and do some stuff
#Use grep to find some tokens in all the files, and do some stuff
done
done
I don't know if my explanations are clear, I tried my best.
EDIT : I read that doing for loop to read a file is bad, but some people claims that doing while read loop is also bad. I am a bit lost to be honest, I can't really find the proper way to do my stuffs.
The way you are doing it is fine, but here is another way to do it. With this method you won't have to add additional loops to iterate of each item in your configuration files. There are ways to simplify this further, but it would not be as readable.
To test this:
In "${DIR}/path" I have two directories listed (one on each line). Both directories are contained in the same parent directory as this script. In the "${DIR}/token" file, I have three tokens (one on each line) to search for.
#!/usr/bin/env bash
#
# Directory where this script is located
#
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
#
# Loop through each file contained in our path list
#
for f in $(find $(cat "${DIR}/path") -type f); do
for c in $(cat "${f}" | grep "$(cat ${DIR}/token)"); do
echo "${f}"
echo "${c}"
# Do your file command here
done
done
I think you need something like this:
find $(cat places.txt) -type f -exec bash -c 'file "$1" | grep -q compressed && echo $1 looks compressed' _ {} \;
Sample Output
/Users/mark/tmp/a.tgz looks compressed
This script is looking in all the places listed in places.txt and running a new bash shell for each file it finds. Inside the bash shell it is testing if the file is compressed and echoing a message if it is - I guess you will do something else but you don't say what.
Another way of writing that more verbosely if you have lots to do:
#!/bin/bash
while read -r d; do
find "$d" -type f -exec bash -c '
file "$1" | grep -q "compressed"
if [ $? -eq 0 ]; then
echo "$1" is compressed
else
echo "$1" is not compressed
fi' _ {} \;
done < <(cat places.txt)

Globbing for only files in Bash

I'm having a bit of trouble with globs in Bash. For example:
echo *
This prints out all of the files and folders in the current directory.
e.g. (file1 file2 folder1 folder2)
echo */
This prints out all of the folders with a / after the name.
e.g. (folder1/ folder2/)
How can I glob for just the files?
e.g. (file1 file2)
I know it could be done by parsing ls but also know that it is a bad idea. I tried using extended blobbing but couldn't get that to work either.
WIthout using any external utility you can try for loop with glob support:
for i in *; do [ -f "$i" ] && echo "$i"; done
I don't know if you can solve this with globbing, but you can certainly solve it with find:
find . -type f -maxdepth 1
You can do what you want in bash like this:
shopt -s extglob
echo !(*/)
But note that what this actually does is match "not directory-likes."
It will still match dangling symlinks, symlinks pointing to not-directories, device nodes, fifos, etc.
It won't match symlinks pointing to directories, though.
If you want to iterate over normal files and nothing more, use find -maxdepth 1 -type f.
The safe and robust way to use it goes like this:
find -maxdepth 1 -type f -print0 | while read -d $'\0' file; do
printf "%s\n" "$file"
done
My go to in this scenario is to use the find command. I just had to use it, to find/replace dozens of instances in a given directory. I'm sure there are many other ways of skinning this cat, but the pure for example above, isn't recursive.
for file in $( find path/to/dir -type f -name '*.js' );
do sed -ie 's#FIND#REPLACEMENT#g' "$file";
done

Suppress output to StdOut when piping echo

I'm making a bash script that crawls through a directory and outputs all files of a certain type into a text file. I've got that working, it just also writes out a bunch of output to console I don't want (the names of the files)
Here's the relevant code so far, tmpFile is the file I'm writing to:
for DIR in `find . -type d` # Find problem directories
do
for FILE in `ls "$DIR"` # Loop through problems in directory
do
if [[ `echo ${FILE} | grep -e prob[0-9]*_` ]]; then
`echo ${FILE} >> ${tmpFile}`
fi
done
done
The files I'm putting into the text file are in the format described by the regex prob[0-9]*_ (something like prob12345_01)
Where I pipe the output from echo ${FILE} into grep, it still outputs to stdout, something I want to avoid. I think it's a simple fix, but it's escaping me.
All this can be done in one single find command. Consider this:
find . -type f -name "prob[0-9]*_*" -exec echo {} >> ${tmpFile} \;
EDIT:
Even simpler: (Thanks to #GlennJackman)
find . -type f -name "prob[0-9]*_*" >> $tmpFile
To answer your specific question, you can pass -q to grep for silent output.
if echo "hello" | grep -q el; then
echo "found"
fi
But since you're already using find, this can be done with just one command:
find . -regex ".*prob[0-9]*_.*" -printf '%f\n' >> ${tmpFile}
find's regex is a match on the whole path, which is why the leading and trailing .* is needed.
The -printf '%f\n' prints the file name without directory, to match what your script is doing.
what you want to do is, read the output of the find command,
for every entry find returned, you want to get all (*) the files under that location
and then you want to check whether that filename matches the pattern you want
if it matches then add it to the tmpfile
while read -r dir; do
for file in "$dir"/*; do # will not match hidden files, unless dotglob is set
if [[ "$file" =~ prob[0-9]*_ ]]; then
echo "$file" >> "$tmpfile"
fi
done < <(find . -type d)
however find can do that alone
anubhava got me there ;)
so look his answer on how that's done

Resources