looping into files having same string in second part of its name - shell

I am using loop instruction to zip many csv file based in their prefix (first element of its name)
printf '%s\n' *_*.csv | cut -d_ -f1 | uniq |
while read -r prefix
do
zip $ZIP_PATH/"$prefix"_"$DATE_EXPORT"_M2.zip "$prefix"_*.csv
done
And it works very well, as an input I have
123_20211124_DONG.csv
123_20211124_FINA.csv
123_20211124_INDEM.csv
123_20211202_FINA.csv
123_20211202_INDEM.csv
and the zip loop will pack all these files because they have the same prefix
Or, I would like to pack only those which has $DATE_EXPORT= 20211202, in other word, I want to pack only those which has second element in file name=20211202 == DATE_EXPORT variable
I tried using grep function like :
printf '%s\n' *_*.csv | grep $DATE_EXPORT | cut -d_ -f1 | uniq |
while read -r prefix
do
zip $ZIP_PATH/"$prefix"_"$DATE_EXPORT"_M2.zip "$prefix"_*.csv
done
But, does not work, any help, please ?

"$prefix"_*.csv in the zip command in your second example is not filtered for "$DATE_EXPORT". Try "${prefix}_${DATE_EXPORT}_"*.csv or similar. You can also use *"_${DATE_EXPORT}_"*.csv with printf, instead of grep.
Also, I'm not sure what's going on with $cut, but obviously cut is the usual name.

Related

sort and get unique files after removing extension of filename

I am trying to remove filename after the second underscore and get the unique files. I saw many answers and formed a script. This script is working fine till the cut command but it is not able to give the unique filenames. I have tried the following command but i am not getting desired output.
script used:
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" | sort | uniq)
echo "${fname}"
done
file example:
filename1_00_1.gz
filename1_00_2.gz
filename2_00_1.gz
filename2_00_2.gz
Required output:
filename1_00
filename2_00
So, with all of that said. how can I get a unique list of files in the required output format?
Thanks a lot in advance.
Apply uniq and sort are you are done printing the files (it's better to identify uniques first before sorting them):
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" );
echo "${fname}";
done | uniq | sort
Or just do
for filename in ${path/to/files}/*.gz; do echo ${filename%_*.gz}; done | uniq | sort
for f in *.gz; do echo ${f%_*.gz}; done | sort | uniq

Merge files with ID before underscore

I am looking for a way to merge files that that same ID before the first undescore in the filename. The output should contain the ID only, followed by the fastq.gz. The output must be gzipped.
in
0394_L007_R1.fastq.gz
0394_L008_R1.fastq.gz
0444_L005_R1.fastq.gz
0444_L006_R1.fastq.gz
out
0394.fastq.gz
0444.fastq.gz
Something more convenient than:
cat 0394_L007_R1.fastq.gz 0394_L008_R1.fastq.gz > 0394.fastq.gz
A simple loop that keeps appending to the target file. So it's really just a matter of finding the correct "target file" for current file and appending to it.
#! /bin/bash
for x in *.fastq.gz; do
currid=$(echo "$x" | cut -d'_' -f1)
cat "$x" >> "$currid".fastq.gz
done
First, collect the unique identifiers in an associative array:
declare -A ids
for f in *.fastq.gz; do
ids[${f%%_*}]=1
done
Then use gzcat to pipe the (uncompressed) contents of each
matching file to gzip to recompress the output into a single file.
for id in "${!ids[#]}"; do
gzcat "$id"_*.fastq.gz | gzip -c > "$id".fastq.gz
done
(Or, because I forgot that concatenated Gzip files are themselves valid Gzip files,
for id in "${!ids[#]}"; do
cat "$id"_*.fastq.gz > "$id".fastq.gz
done
)
Using a simple command:
ls | tr '_' '.' | cut -d'.' -f1,4,5 | uniq

How to write a shell script that reads all the file names in the directory and finds a particular string in file names?

I need a shell script to find a string in file like the following one:
FileName_1.00_r0102.tar.gz
And then pick the highest value from multiple occurrences.
I am interested in "1.00" part of the file name.
I am able to get this part separately in the UNIX shell using the commands:
find /directory/*.tar.gz | cut -f2 -d'_' | cut -f1 -d'.'
1
2
3
1
find /directory/*.tar.gz | cut -f2 -d'_' | cut -f2 -d'.'
00
02
05
00
The problem is there are multiple files with this string:
FileName_1.01_r0102.tar.gz
FileName_2.02_r0102.tar.gz
FileName_3.05_r0102.tar.gz
FileName_1.00_r0102.tar.gz
I need to pick the file with FileName_("highest value")_r0102.tar.gz
But since I am new to shell scripting I am not able to figure out how to handle these multiple instances in script.
The script which I came up with just for the integer part is as follows:
#!/bin/bash
for file in /directory/*
file_version = find /directory/*.tar.gz | cut -f2 -d'_' | cut -f1 -d'.'
done
OUTPUT: file_version:command not found
Kindly help.
Thanks!
If you just want the latest version number:
cd /path/to/files
printf '%s\n' *r0102.tar.gz | cut -d_ -f2 | sort -n -t. -k1,2 |tail -n1
If you want the file name:
cd /path/to/files
lastest=$(printf '%s\n' *r0102.tar.gz | cut -d_ -f2 | sort -n -t. -k1,2 |tail -n1)
printf '%s\n' *${lastest}_r0102.tar.gz
You could try the following which finds all the matching files, sorts the filenames, takes the last in that list, and then extracts the version from the filename.
#!/bin/bash
file_version=$(find ./directory -name "FileName*r0102.tar.gz" | sort | tail -n1 | sed -r 's/.*_(.+)_.*/\1/g')
echo ${file_version}
I have tried and thats worth working below script line, that You need.
echo `ls ./*.tar.gz | sort | sed -n /[0-9]\.[0-9][0-9]/p|tail -n 1`;
It's unnecessary to parse the filename's version number prior to finding the actual filename. Use GNU ls's -v (natural sort of (version) numbers within text) option:
ls -v FileName_[0-9.]*_r0102.tar.gz | tail -1

using cut on a line having multiple instances of the same delimiter - unix

I am trying to write a generic script which can have different file name inputs.
This is just a small part of my bash script.
for example, lets say folder 444-55 has 2 files
qq.filter.vcf
ee.filter.vcf
I want my output to be -
qq
ee
I tried this and it worked -
ls /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf | sort | cut -f1 -d "." | xargs -n 1 basename
But lets say I have a folder like this -
/data2/delivery/Stack_overflow/de.1111_2222_3333_23/secondary/444-55/*.filter.vcf
My script's output would then be
de
de
How can I make it generic?
Thank you so much for your help.
Something like this in a script will "cut" it:
for i in /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf
do
basename "$i" | cut -f1 -d.
done | sort
advantages:
it does not parse the output of ls, which is frowned upon
it cuts after having applied the basename treatment, and the cut ignores the full path.
it also sorts last so it's guaranteed to be sorted according to the prefix
Just move the basename call earlier in the pipeline:
printf "%s\n" /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf |
xargs -n 1 basename |
sort |
cut -f1 -d.

File Name comparision in Bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.
awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list
First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext
So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.
Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

Resources