using cut on a line having multiple instances of the same delimiter - unix - bash

I am trying to write a generic script which can have different file name inputs.
This is just a small part of my bash script.
for example, lets say folder 444-55 has 2 files
qq.filter.vcf
ee.filter.vcf
I want my output to be -
qq
ee
I tried this and it worked -
ls /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf | sort | cut -f1 -d "." | xargs -n 1 basename
But lets say I have a folder like this -
/data2/delivery/Stack_overflow/de.1111_2222_3333_23/secondary/444-55/*.filter.vcf
My script's output would then be
de
de
How can I make it generic?
Thank you so much for your help.

Something like this in a script will "cut" it:
for i in /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf
do
basename "$i" | cut -f1 -d.
done | sort
advantages:
it does not parse the output of ls, which is frowned upon
it cuts after having applied the basename treatment, and the cut ignores the full path.
it also sorts last so it's guaranteed to be sorted according to the prefix

Just move the basename call earlier in the pipeline:
printf "%s\n" /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf |
xargs -n 1 basename |
sort |
cut -f1 -d.

Related

looping into files having same string in second part of its name

I am using loop instruction to zip many csv file based in their prefix (first element of its name)
printf '%s\n' *_*.csv | cut -d_ -f1 | uniq |
while read -r prefix
do
zip $ZIP_PATH/"$prefix"_"$DATE_EXPORT"_M2.zip "$prefix"_*.csv
done
And it works very well, as an input I have
123_20211124_DONG.csv
123_20211124_FINA.csv
123_20211124_INDEM.csv
123_20211202_FINA.csv
123_20211202_INDEM.csv
and the zip loop will pack all these files because they have the same prefix
Or, I would like to pack only those which has $DATE_EXPORT= 20211202, in other word, I want to pack only those which has second element in file name=20211202 == DATE_EXPORT variable
I tried using grep function like :
printf '%s\n' *_*.csv | grep $DATE_EXPORT | cut -d_ -f1 | uniq |
while read -r prefix
do
zip $ZIP_PATH/"$prefix"_"$DATE_EXPORT"_M2.zip "$prefix"_*.csv
done
But, does not work, any help, please ?
"$prefix"_*.csv in the zip command in your second example is not filtered for "$DATE_EXPORT". Try "${prefix}_${DATE_EXPORT}_"*.csv or similar. You can also use *"_${DATE_EXPORT}_"*.csv with printf, instead of grep.
Also, I'm not sure what's going on with $cut, but obviously cut is the usual name.

sort and get unique files after removing extension of filename

I am trying to remove filename after the second underscore and get the unique files. I saw many answers and formed a script. This script is working fine till the cut command but it is not able to give the unique filenames. I have tried the following command but i am not getting desired output.
script used:
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" | sort | uniq)
echo "${fname}"
done
file example:
filename1_00_1.gz
filename1_00_2.gz
filename2_00_1.gz
filename2_00_2.gz
Required output:
filename1_00
filename2_00
So, with all of that said. how can I get a unique list of files in the required output format?
Thanks a lot in advance.
Apply uniq and sort are you are done printing the files (it's better to identify uniques first before sorting them):
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" );
echo "${fname}";
done | uniq | sort
Or just do
for filename in ${path/to/files}/*.gz; do echo ${filename%_*.gz}; done | uniq | sort
for f in *.gz; do echo ${f%_*.gz}; done | sort | uniq

How to write a shell script that reads all the file names in the directory and finds a particular string in file names?

I need a shell script to find a string in file like the following one:
FileName_1.00_r0102.tar.gz
And then pick the highest value from multiple occurrences.
I am interested in "1.00" part of the file name.
I am able to get this part separately in the UNIX shell using the commands:
find /directory/*.tar.gz | cut -f2 -d'_' | cut -f1 -d'.'
1
2
3
1
find /directory/*.tar.gz | cut -f2 -d'_' | cut -f2 -d'.'
00
02
05
00
The problem is there are multiple files with this string:
FileName_1.01_r0102.tar.gz
FileName_2.02_r0102.tar.gz
FileName_3.05_r0102.tar.gz
FileName_1.00_r0102.tar.gz
I need to pick the file with FileName_("highest value")_r0102.tar.gz
But since I am new to shell scripting I am not able to figure out how to handle these multiple instances in script.
The script which I came up with just for the integer part is as follows:
#!/bin/bash
for file in /directory/*
file_version = find /directory/*.tar.gz | cut -f2 -d'_' | cut -f1 -d'.'
done
OUTPUT: file_version:command not found
Kindly help.
Thanks!
If you just want the latest version number:
cd /path/to/files
printf '%s\n' *r0102.tar.gz | cut -d_ -f2 | sort -n -t. -k1,2 |tail -n1
If you want the file name:
cd /path/to/files
lastest=$(printf '%s\n' *r0102.tar.gz | cut -d_ -f2 | sort -n -t. -k1,2 |tail -n1)
printf '%s\n' *${lastest}_r0102.tar.gz
You could try the following which finds all the matching files, sorts the filenames, takes the last in that list, and then extracts the version from the filename.
#!/bin/bash
file_version=$(find ./directory -name "FileName*r0102.tar.gz" | sort | tail -n1 | sed -r 's/.*_(.+)_.*/\1/g')
echo ${file_version}
I have tried and thats worth working below script line, that You need.
echo `ls ./*.tar.gz | sort | sed -n /[0-9]\.[0-9][0-9]/p|tail -n 1`;
It's unnecessary to parse the filename's version number prior to finding the actual filename. Use GNU ls's -v (natural sort of (version) numbers within text) option:
ls -v FileName_[0-9.]*_r0102.tar.gz | tail -1

More universal alternative to this sed command?

I have a variable called $dirs storing directories in a dir tree:
root/animals/rats/mice
root/animals/cats
And I have another variable called $remove for example that holds the names of the directories I want to remove from the dirs variable:
rats
crabs
I am using a for loop to do that:
for d in $remove; do
dirs=$(echo "$dirs" | sed "/\b$d\b/d")
done
After that loop is done, what I should be left with is:
root/animals/cats
because the loop found rats.
I have tested this approach on 3 systems but it only works as expected on 2.
Is there a more universal approach that would work on all shells?
You are looking for something like
echo "${dirs}" | grep -Ev "rats|crabs"
When you can't store the exclusion list in the format with |, try to change it on the fly:"
echo "${dirs}" | grep -Ev $(echo "${remove}" | tr -s "\n" "|" | sed 's/|$//')
You can use the excludeFile technique without a temp file with
echo "${dirs}" | grep -vf <(echo "${remove}")
I am not sure which of there solutions will be best supported.

How to compare a file to a list in linux with one line code?

Hey so got another predicament that I am stuck in. I wanted to see approximately how many Indian people are using the stampede computer. So I set up an indian txt file in vim that has about 50 of the most common surnames in india and I want to compare those names in the file to the user name list.
So far this is the code I have
getent passwd | cut -f 5 -d: | cut -f -d' '
getent passwd gets the userid list which is going to look like this
tg827313:x:827313:8144474:Brandon Williams
the cut functions will get just the last name so the output of the example will be
Williams
Now can use the grep function to compare files but how do I use it to compare the getent passwd list with the file?
To count how many of the last names of computer users appear in the file namefile, use:
getent passwd | cut -f 5 -d: | cut -f -d' ' | grep -wFf namefile | wc -l
How it works
getent passwd | cut -f 5 -d: | cut -f -d' '
This is your code which I will assume works as intended for you.
grep -wFf namefile
This selects names that match a line in namefile. The -F option tells grep not to use regular expressions for the names. The names are assumed to be fixed strings. The option -f tells grep to read the strings from file. -w tells grep to match whole words only.
wc -l
This returns a count of the lines in the output.

Resources