To count distinct file name in a directory using Shell scripting - shell

I have one Folder which consist of around 400 plus Files what i have to do to count number of distinct files as there may be more than one version of file.
Like For Eg If in a folder i have 8 files:-
V07Y_0021_YP_0100_001.PDF - This is unique
V07Y_0021_YP_0099_001.PDF - This is unique
V07Y_0021_YP_0003_001.PDF - This is duplicate _001.PDF is first version
V07Y_0021_YP_0003_002.PDF - This is duplicate _002.PDF is second Version
V07Y_0021_YP_0109_001.PDF - This is duplicate _002.PDF is first Version
V07Y_0021_YP_0108_001.PDF - This is unique
V07Y_0021_YP_0109_002.PDF - This is duplicate _002.PDF is second Version
In Above Files _0109,_0100,_0099 is Page Number and after these numbers _001,_002 is version.Also there can be more than two versions also of same file (Page No)
SO i have to implement a logic which will give me count as 5 as 2 files are duplicate so it will be counted only once.
I have tried various ways like find directoryName -type f -printf '%f\n' | sort -u
This dosent Worked for me as i have to find a pattern too.
If Anybody knows the ogic Please share.
Thanks in advance.

find . -type f -printf '%f\n' |
# Remove the version part
sed 's!_[0-9][0-9][0-9].PDF$!!' |
# remove duplicates
sort -u
would output:
V07Y_0021_YP_0003
V07Y_0021_YP_0099
V07Y_0021_YP_0100
V07Y_0021_YP_0108
V07Y_0021_YP_0109
Tested on repl

If you just want to count :
ls targetDirectory/V07Y_0021_YP* | cut -d'_' -f4 | sort -u | wc -l
This would send you the number of unique items.
ls : list the files, cut : get the fourth item with '_' separator, sort : remove duplicates, wc : count lines
You can remove | wc -l to get the list of files.

Related

terminal find file with latest patch number

I have a folder with a lot of patch files with pattern
1.1.hotfix1
1.2.hotfix2
2.1.hotfix1
2.1.hotfix2 ...etc
and I have to find out the latest patch(2.1.hotfix2 should be the result of the example) with a bash
how can i achieve it?
Reverse order all files by time and print the first line.
In case you have some other files then you can print files having hotfix text only.
ls -t1 *hotfix* | head -n 1
You can use find with regex, and take the last line from sort:
find * -type f -regex "^[^\d]+\.[^\d]+\.hotfix[^\d]+$" | sort | tail -1

combine files based on similar ID in the middle

01_002_H10_S190_L004_R1_001.fastq.gz
01_002_H10_S190_L008_R1_001.fastq.gz
01_002_H11_S191_L004_R1_001.fastq.gz
01_002_H11_S191_L008_R1_001.fastq.gz
I want to merge to and to files that have similar ID based on the letter and two numbers H10, H11 etc. All the files have ID of 1 letter followed by to numbers. Also, the string before H10,H11 is always 01_002_
I have a bash script to combine files, but not sure how to get the two files that belong together (XXXX in my skript below.
declare -A ids for f in XXXXXXXX; do ids[${f%%_*}]=1;done
out
01_002_H10.fastq.gz
01_002_H11.fastq.gz
This returns all names seperated by newlines:
find . -printf "%f\n" | egrep -o "^01_002_[A-Z0-9]+" | sort | uniq
You can integrate it like this:
for f in $(find . -printf "%f\n" | egrep -o "^01_002_[A-Z0-9]+" | sort | uniq);

Count and remove extraneous files (bash)

I am getting stuck on finding a succint solution to the following.
In a given directory, I have the following files:
10_MIDAP.nii.gz
12_MIDAP.nii.gz
14_MIDAP.nii.gz
16_restAP.nii.gz
18_restAP.nii.gz
I am only supposed to have two "MIDAP" files and one "restAP" file. The additional files may not contain the full data, so I need to remove them. These are likely going to be smaller in size and/or the earlier sequence number (e.g., 10).
I know how to count / echo the number of files:
MIDAP=`find $DATADIR -name "*MIDAP.nii.gz" | wc -l`
RestAP=`find $DATADIR -name "*restAP.nii.gz" | wc -l`
echo "MIDAP files = $MIDAP"
echo "RestAP files = $RestAP"
Any suggestions on how to succinctly remove the unneeded files, such that I end up with two "MIDAP" files and one "restAP" (in cases where there are extraneous files)? As of now, imagining it would be something like this...
if (( $MIDAP > 2 )); then
...magic happens
fi
Thanks for any advice!
here is an approach
create test files
$ for i in {1..10}; do touch ${i}_restAP; touch ${i}_MIDAP; done
sort based on numbers, and remove the top N-1 (or N-2) files.
$ find . -name '*restAP*' | sort -V | head -n -1 | xargs rm
$ find . -name '*MIDAP*' | sort -V | head -n -2 | xargs rm
$ ls -1
10_MIDAP
10_restAP
9_MIDAP
you may want to change the sort if based on file size.

Using 'find' to select unknown patterns in file names with bash

Let's say I have a directory with 4 files in it.
path/to/files/1_A
path/to/files/1_B
path/to/files/2_A
path/to/files/2_B
I want to create a loop, which on each iteration, does something with two files, a matching X_A and X_B. I need to know how to find these files, which sounds simple enough using pattern matching. The problem is, there are too many files, and I do not know the prefixes aka patterns (1_ and 2_ in the example). Is there some way to group files in a directory based on the first few characters in the filename? (Ultimately to store as a variable to be used in a loop)
You could get all the 3-character prefixes by printing out all the file names, trimming them to three characters, and then getting the unique strings.
find -printf '%f\n' | cut -c -3 | sort -u
Then if you wanted to loop over each prefix, you could write a loop like:
find -printf '%f\n' | cut -c -3 | sort -u | while IFS= read -r prefix; do
echo "Looking for $prefix*..."
find -name "$prefix*"
done

Counting the maximum "levels" aka largest number of subfolders in a directory on a server, using mac OSX

I have a server with a lot of files and folders in it. We are talking a very large number - in the thousands easily for both files and folders, probably a lot more but I haven't done a count yet.
I'll give an example of my hierarchy and then ask my question.
Example Folders:
Starting folder-->
folder1a -->
folder2a -->
folder3a
file3a
folder1b -->
folder2b -->
| folder3b
| file3b
file 2b (at the same level as folder2b)
Here, folder3b and folder3a are both 2 levels down from the original starting folder on the server, 'starting folder' (is levels even the right word?). I'm trying to count the deepest number of folders in this directory. I think I know how to count the actual folder number in the terminal - if you put in
ls -lR | grep ^d | wc -l
It should give you the number of subdirectories in the directory specified. For example in this example folder, it should give 6 subdirectories. However I just want the deepest level - in this case the answers give the same, with 3 levels down from the starting directory for both folder 3a and 3b. So in this case I would want my code to return 3 instead of 6.
If i am also reading about the following code correctly, running:
echo */ | wc
Should return 2 for the number of sub-directories in a directory but doesn't count more than one level down.
Is it possible to run through the whole server and return the number that is the lowest level down? If i am thinking about this correctly, I want it to return the number of double clicks it will take me to get to the deepest folder in the server (note that I don't actually need to know the file path, just the number).
I'll happily explain myself if I'm not making as much sense as I should be.
You can use find like this to list directories only
find . -type d
If you then pass the list of names through awk you can count how many fields (i.e. levels) there are if you use the directory separator (/) as awk's field separator
find . -type d | awk -F'/' '{print NF}'
then you can run that through sort to find the largest
find . -type d | awk -F'/' '{print NF}' | sort -rn
Updated
If you want to check that I count the same as you do, you can run this command and it will show you both the numbers of directories and all the names:
find . -type d | awk -F'/' '{print NF,$0}' | sort -rn
If I understand your question correctly, this should work for you:
find . -type d | sed -e s':[^/]::g' | wc -L
first we search for all subdirectories and havve the path names printed
in the pathnames (which will be like "./folder1a/folder2a") we eliminate everything but the slashes
we use wc to return the longest line length, which is the maximum number of slashes, thus the maximum level of a subdirectory

Resources