combine files based on similar ID in the middle - bash

01_002_H10_S190_L004_R1_001.fastq.gz
01_002_H10_S190_L008_R1_001.fastq.gz
01_002_H11_S191_L004_R1_001.fastq.gz
01_002_H11_S191_L008_R1_001.fastq.gz
I want to merge to and to files that have similar ID based on the letter and two numbers H10, H11 etc. All the files have ID of 1 letter followed by to numbers. Also, the string before H10,H11 is always 01_002_
I have a bash script to combine files, but not sure how to get the two files that belong together (XXXX in my skript below.
declare -A ids for f in XXXXXXXX; do ids[${f%%_*}]=1;done
out
01_002_H10.fastq.gz
01_002_H11.fastq.gz

This returns all names seperated by newlines:
find . -printf "%f\n" | egrep -o "^01_002_[A-Z0-9]+" | sort | uniq
You can integrate it like this:
for f in $(find . -printf "%f\n" | egrep -o "^01_002_[A-Z0-9]+" | sort | uniq);

Related

To count distinct file name in a directory using Shell scripting

I have one Folder which consist of around 400 plus Files what i have to do to count number of distinct files as there may be more than one version of file.
Like For Eg If in a folder i have 8 files:-
V07Y_0021_YP_0100_001.PDF - This is unique
V07Y_0021_YP_0099_001.PDF - This is unique
V07Y_0021_YP_0003_001.PDF - This is duplicate _001.PDF is first version
V07Y_0021_YP_0003_002.PDF - This is duplicate _002.PDF is second Version
V07Y_0021_YP_0109_001.PDF - This is duplicate _002.PDF is first Version
V07Y_0021_YP_0108_001.PDF - This is unique
V07Y_0021_YP_0109_002.PDF - This is duplicate _002.PDF is second Version
In Above Files _0109,_0100,_0099 is Page Number and after these numbers _001,_002 is version.Also there can be more than two versions also of same file (Page No)
SO i have to implement a logic which will give me count as 5 as 2 files are duplicate so it will be counted only once.
I have tried various ways like find directoryName -type f -printf '%f\n' | sort -u
This dosent Worked for me as i have to find a pattern too.
If Anybody knows the ogic Please share.
Thanks in advance.
find . -type f -printf '%f\n' |
# Remove the version part
sed 's!_[0-9][0-9][0-9].PDF$!!' |
# remove duplicates
sort -u
would output:
V07Y_0021_YP_0003
V07Y_0021_YP_0099
V07Y_0021_YP_0100
V07Y_0021_YP_0108
V07Y_0021_YP_0109
Tested on repl
If you just want to count :
ls targetDirectory/V07Y_0021_YP* | cut -d'_' -f4 | sort -u | wc -l
This would send you the number of unique items.
ls : list the files, cut : get the fourth item with '_' separator, sort : remove duplicates, wc : count lines
You can remove | wc -l to get the list of files.

How to grab the last result of a find command?

The result of my find command produces the following result
./alex20_0
./alex20_1
./alex20_2
./alex20_3
I saved this result as a variable. now the only part I really need is whatever the last part is or essentially the highest number or "latest version".
So from the above string all I need to extract is ./alex20_3 and save that as a variable. Is there a way to just extract whatever the last directory is outputted from the find command?
I would do the last nth characters command to extract it since its already in order, but it wouldn't be the same number of characters once we get to version ./alex20_10 etc.
Try this:
your_find_command | tail -n 1
find can list your files in any order. To extract the latest version you have to sort the output of find. The safest way to do this is
find . -maxdepth 1 -name "string" -print0 | sort -zV | tail -zn1
If your implementation of sort or tail does not support -z and you are sure that the filenames are free of line-breaks you can also use
find . -maxdepth 1 -name "string" -print | sort -V | tail -n1
There could be multiple ways to achieve this -
Using the 'tail' command (as suggested by #Roadowl)
find branches -name alex* | tail -n1
Using the 'awk' command
find branches -name alex* | awk 'END{print}'
Using the 'sed' command
find branches -name alex* | sed -e '$!d'
Other possible options are to use a bash script, perl or any other language. You best bet would be the one that you find is more convenient.
Since you want the file name sorted by the highest version, you can try as follows
$ ls
alex20_0 alex20_1 alex20_2 alex20_3
$ find . -iname "*alex*" -print | sort | tail -n 1
./alex20_3

Using 'find' to select unknown patterns in file names with bash

Let's say I have a directory with 4 files in it.
path/to/files/1_A
path/to/files/1_B
path/to/files/2_A
path/to/files/2_B
I want to create a loop, which on each iteration, does something with two files, a matching X_A and X_B. I need to know how to find these files, which sounds simple enough using pattern matching. The problem is, there are too many files, and I do not know the prefixes aka patterns (1_ and 2_ in the example). Is there some way to group files in a directory based on the first few characters in the filename? (Ultimately to store as a variable to be used in a loop)
You could get all the 3-character prefixes by printing out all the file names, trimming them to three characters, and then getting the unique strings.
find -printf '%f\n' | cut -c -3 | sort -u
Then if you wanted to loop over each prefix, you could write a loop like:
find -printf '%f\n' | cut -c -3 | sort -u | while IFS= read -r prefix; do
echo "Looking for $prefix*..."
find -name "$prefix*"
done

Using the command line "find | grep -wc" how do I return values that are greater than zero?

I downloaded a .htm file and typed this in
find . -iname "*.htm" | xargs grep -Ewcp 'sevenfold'
I am unsure what the –E does but I know –wcp is word count and path. A sample of what shows up it this.
./bible/bible/RecoveryVersion_htm/ZecN.htm:1
./bible/bible/RecoveryVersion_htm/ZecO.htm:0
./bible/bible/RecoveryVersion_htm/Zep.htm:0
./bible/bible/RecoveryVersion_htm/ZepN.htm:0
./bible/bible/RecoveryVersion_htm/ZepO.htm:0
This list is rather long with many zeros how to I narrow the search to only display the ones without zeros hits for the word ? Can I somehow put an if statement “if the value is 0 don’t display“ is this possible?
You can use another invocation of grep to filter the results:
find . -iname "*.htm" | xargs grep -Ewcp 'sevenfold' | grep -v ':0$'

How to get sorted list of files by modified date that match a certain filename and print out part of text in unix shell?

Sorry for the long title. I'm trying to basically write a script that will do a "find" and get a sorted list of all files named README and print out a section of text in them. It's an easy way for me to go to a directory which has a number of project folders and print out summaries. This is what I have so far:
find . -name "README" | xargs -I {} sed -n '/---/,/NOTES/p' {}
I can't seem to get this to be sorted by modified date. Any help would be great!
You can use the -printf option in find:
$ find . -name 'README' -printf '%T#\t%p\n' | sort | cut -f 2-

Resources