Merge files with ID before underscore - bash

I am looking for a way to merge files that that same ID before the first undescore in the filename. The output should contain the ID only, followed by the fastq.gz. The output must be gzipped.
in
0394_L007_R1.fastq.gz
0394_L008_R1.fastq.gz
0444_L005_R1.fastq.gz
0444_L006_R1.fastq.gz
out
0394.fastq.gz
0444.fastq.gz
Something more convenient than:
cat 0394_L007_R1.fastq.gz 0394_L008_R1.fastq.gz > 0394.fastq.gz

A simple loop that keeps appending to the target file. So it's really just a matter of finding the correct "target file" for current file and appending to it.
#! /bin/bash
for x in *.fastq.gz; do
currid=$(echo "$x" | cut -d'_' -f1)
cat "$x" >> "$currid".fastq.gz
done

First, collect the unique identifiers in an associative array:
declare -A ids
for f in *.fastq.gz; do
ids[${f%%_*}]=1
done
Then use gzcat to pipe the (uncompressed) contents of each
matching file to gzip to recompress the output into a single file.
for id in "${!ids[#]}"; do
gzcat "$id"_*.fastq.gz | gzip -c > "$id".fastq.gz
done
(Or, because I forgot that concatenated Gzip files are themselves valid Gzip files,
for id in "${!ids[#]}"; do
cat "$id"_*.fastq.gz > "$id".fastq.gz
done
)

Using a simple command:
ls | tr '_' '.' | cut -d'.' -f1,4,5 | uniq

Related

looping into files having same string in second part of its name

I am using loop instruction to zip many csv file based in their prefix (first element of its name)
printf '%s\n' *_*.csv | cut -d_ -f1 | uniq |
while read -r prefix
do
zip $ZIP_PATH/"$prefix"_"$DATE_EXPORT"_M2.zip "$prefix"_*.csv
done
And it works very well, as an input I have
123_20211124_DONG.csv
123_20211124_FINA.csv
123_20211124_INDEM.csv
123_20211202_FINA.csv
123_20211202_INDEM.csv
and the zip loop will pack all these files because they have the same prefix
Or, I would like to pack only those which has $DATE_EXPORT= 20211202, in other word, I want to pack only those which has second element in file name=20211202 == DATE_EXPORT variable
I tried using grep function like :
printf '%s\n' *_*.csv | grep $DATE_EXPORT | cut -d_ -f1 | uniq |
while read -r prefix
do
zip $ZIP_PATH/"$prefix"_"$DATE_EXPORT"_M2.zip "$prefix"_*.csv
done
But, does not work, any help, please ?
"$prefix"_*.csv in the zip command in your second example is not filtered for "$DATE_EXPORT". Try "${prefix}_${DATE_EXPORT}_"*.csv or similar. You can also use *"_${DATE_EXPORT}_"*.csv with printf, instead of grep.
Also, I'm not sure what's going on with $cut, but obviously cut is the usual name.

More universal alternative to this sed command?

I have a variable called $dirs storing directories in a dir tree:
root/animals/rats/mice
root/animals/cats
And I have another variable called $remove for example that holds the names of the directories I want to remove from the dirs variable:
rats
crabs
I am using a for loop to do that:
for d in $remove; do
dirs=$(echo "$dirs" | sed "/\b$d\b/d")
done
After that loop is done, what I should be left with is:
root/animals/cats
because the loop found rats.
I have tested this approach on 3 systems but it only works as expected on 2.
Is there a more universal approach that would work on all shells?
You are looking for something like
echo "${dirs}" | grep -Ev "rats|crabs"
When you can't store the exclusion list in the format with |, try to change it on the fly:"
echo "${dirs}" | grep -Ev $(echo "${remove}" | tr -s "\n" "|" | sed 's/|$//')
You can use the excludeFile technique without a temp file with
echo "${dirs}" | grep -vf <(echo "${remove}")
I am not sure which of there solutions will be best supported.

File Name comparision in Bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.
awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list
First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext
So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.
Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

Faster grep in many files from several strings in a file

I have the following working script to grep in a directory of Many files from some specific strings previously saved into a file.
I use the files extension to grep all files as its name are random and note that every string from my previously file should be searched in all the files.
Also, I cut the outputting grep as it return 2 or 3 lines of the matched file and I only want a specific part that shows the filename.
I might be using something redundant, how it could be faster?
#!/bin/bash
#working but slow
cd /var/FILES_DIRECTORY
while read line
do
LC_ALL=C fgrep "$line" *.cps | cut -c1-27 >> /var/tmp/test_OUT.txt
done < "/var/tmp/test_STRINGS.txt"
grep -F -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27
Isn't what you're looking for ?
this should speed up your script :
#!/bin/bash
#working fast
cd /var/FILES_DIRECTORY
export LC_ALL=C
grep -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27 > /var/tmp/test_OUT.txt

How to loop over files in natural order in Bash?

I am looping over all the files in a directory with the following command:
for i in *.fas; do some_code; done;
However, I get them in this order
vvchr1.fas
vvchr10.fas
vvchr11.fas
vvchr2.fas
...
instead of
vvchr1.fas
vvchr2.fas
vvchr3.fas
...
what is natural order.
I have tried sort command, but to no avail.
readarray -d '' entries < <(printf '%s\0' *.fas | sort -zV)
for entry in "${entries[#]}"; do
# do something with $entry
done
where printf '%s\0' *.fas yields a NUL separated list of directory entries with the extension .fas, and sort -zV sorts them in natural order.
Note that you need GNU sort installed in order for this to work.
With option sort -g it compares according to general numerical value
for FILE in `ls ./raw/ | sort -g`; do echo "$FILE"; done
0.log
1.log
2.log
...
10.log
11.log
This will only work if the name of the files are numerical. If they are string you will get them in alphabetical order. E.g.:
for FILE in `ls ./raw/* | sort -g`; do echo "$FILE"; done
raw/0.log
raw/10.log
raw/11.log
...
raw/2.log
You will get the files in ASCII order. This means that vvchr10* comes before vvchr2*. I realise that you can not rename your files (my bioinformatician brain tells me they contain chromosome data, and we simply don't call chromosome 1 "chr01"), so here's another solution (not using sort -V which I can't find on any operating system I'm using):
ls *.fas | sed 's/^\([^0-9]*\)\([0-9]*\)/\1 \2/' | sort -k2,2n | tr -d ' ' |
while read filename; do
# do work with $filename
done
This is a bit convoluted and will not work with filenames containing spaces.
Another solution: Suppose we'd like to iterate over the files in size-order instead, which might be more appropriate for some bioinformatics tasks:
du *.fas | sort -k2,2n |
while read filesize filename; do
# do work with $filename
done
To reverse the sorting, just add r after -k2,2n (to get -k2,2nr).
You mean that files with the number 10 comes before files with number 3 in your list? Thats because ls sorts its result very simple, so something-10.whatever is smaller than something-3.whatever.
One solution is to rename all files so they have the same number of digits (the files with single-digit in them start with 0 in the number).
while IFS= read -r file ; do
ls -l "$file" # or whatever
done < <(find . -name '*.fas' 2>/dev/null | sed -r -e 's/([0-9]+)/ \1/' | sort -k 2 -n | sed -e 's/ //;')
Solves the problem, presuming the file naming stays consistent, doesn't rely on very-recent versions of GNU sort, does not rely on reading the output of ls and doesn't fall victim to the pipe-to-while problems.
Like #Kusalananda's solution (perhaps easier to remember?) but catering for all files(?):
array=("$(ls |sed 's/[^0-9]*\([0-9]*\)\..*/\1 &/'| sort -n | sed 's/^[^ ]* //')")
for x in "${array[#]}";do echo "$x";done
In essence add a sort key, sort, remove sort key.
EDIT: moved comment to appropriate solution
use sort -rh and the while loop
du -sh * | sort -rh | grep -P "avi$" |awk '{print $2}' | while read f; do fp=`pwd`/$f; echo $fp; done;

Resources