How to loop over files in natural order in Bash? - bash

I am looping over all the files in a directory with the following command:
for i in *.fas; do some_code; done;
However, I get them in this order
vvchr1.fas
vvchr10.fas
vvchr11.fas
vvchr2.fas
...
instead of
vvchr1.fas
vvchr2.fas
vvchr3.fas
...
what is natural order.
I have tried sort command, but to no avail.

readarray -d '' entries < <(printf '%s\0' *.fas | sort -zV)
for entry in "${entries[#]}"; do
# do something with $entry
done
where printf '%s\0' *.fas yields a NUL separated list of directory entries with the extension .fas, and sort -zV sorts them in natural order.
Note that you need GNU sort installed in order for this to work.

With option sort -g it compares according to general numerical value
for FILE in `ls ./raw/ | sort -g`; do echo "$FILE"; done
0.log
1.log
2.log
...
10.log
11.log
This will only work if the name of the files are numerical. If they are string you will get them in alphabetical order. E.g.:
for FILE in `ls ./raw/* | sort -g`; do echo "$FILE"; done
raw/0.log
raw/10.log
raw/11.log
...
raw/2.log

You will get the files in ASCII order. This means that vvchr10* comes before vvchr2*. I realise that you can not rename your files (my bioinformatician brain tells me they contain chromosome data, and we simply don't call chromosome 1 "chr01"), so here's another solution (not using sort -V which I can't find on any operating system I'm using):
ls *.fas | sed 's/^\([^0-9]*\)\([0-9]*\)/\1 \2/' | sort -k2,2n | tr -d ' ' |
while read filename; do
# do work with $filename
done
This is a bit convoluted and will not work with filenames containing spaces.
Another solution: Suppose we'd like to iterate over the files in size-order instead, which might be more appropriate for some bioinformatics tasks:
du *.fas | sort -k2,2n |
while read filesize filename; do
# do work with $filename
done
To reverse the sorting, just add r after -k2,2n (to get -k2,2nr).

You mean that files with the number 10 comes before files with number 3 in your list? Thats because ls sorts its result very simple, so something-10.whatever is smaller than something-3.whatever.
One solution is to rename all files so they have the same number of digits (the files with single-digit in them start with 0 in the number).

while IFS= read -r file ; do
ls -l "$file" # or whatever
done < <(find . -name '*.fas' 2>/dev/null | sed -r -e 's/([0-9]+)/ \1/' | sort -k 2 -n | sed -e 's/ //;')
Solves the problem, presuming the file naming stays consistent, doesn't rely on very-recent versions of GNU sort, does not rely on reading the output of ls and doesn't fall victim to the pipe-to-while problems.

Like #Kusalananda's solution (perhaps easier to remember?) but catering for all files(?):
array=("$(ls |sed 's/[^0-9]*\([0-9]*\)\..*/\1 &/'| sort -n | sed 's/^[^ ]* //')")
for x in "${array[#]}";do echo "$x";done
In essence add a sort key, sort, remove sort key.
EDIT: moved comment to appropriate solution

use sort -rh and the while loop
du -sh * | sort -rh | grep -P "avi$" |awk '{print $2}' | while read f; do fp=`pwd`/$f; echo $fp; done;

Related

sort and get unique files after removing extension of filename

I am trying to remove filename after the second underscore and get the unique files. I saw many answers and formed a script. This script is working fine till the cut command but it is not able to give the unique filenames. I have tried the following command but i am not getting desired output.
script used:
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" | sort | uniq)
echo "${fname}"
done
file example:
filename1_00_1.gz
filename1_00_2.gz
filename2_00_1.gz
filename2_00_2.gz
Required output:
filename1_00
filename2_00
So, with all of that said. how can I get a unique list of files in the required output format?
Thanks a lot in advance.
Apply uniq and sort are you are done printing the files (it's better to identify uniques first before sorting them):
for filename in ${path/to/files}/*.gz;
do
fname=$(basename ${filename} | cut -f 1-2 -d "_" );
echo "${fname}";
done | uniq | sort
Or just do
for filename in ${path/to/files}/*.gz; do echo ${filename%_*.gz}; done | uniq | sort
for f in *.gz; do echo ${f%_*.gz}; done | sort | uniq

Extract numbers and find missing numbers in filenames

In bash, I have a folder containing some subfolders and files as
folder1/
folder2/
script_1.ext
script_2.ext
script_4.ext
...
script_N.ext
, where N is a known large number. I would like to know which numbers are missing in the filenames.
I am trying to come up with simple codes that I can extract numbers from the filenames (in the example, it is 1,2,4,...,N) and find missing numbers in 1:N (for example, 3).
I am very new to bash scripts. I tried to find similar questions and answers but I couldn't.
Any input will be appreciated!
ps. I have tried
ls -1 | sed 's/script_//' | sed 's/.ext//'
and successfully extracted the numbers, but I am unsure how to save those numbers and compare with 1,...,N to obtain missing numbers.
Basically I want to find numbers in 1,...,N for a known N, that do not exist in the filenames.
Presuming file_ and .ext are common patterns among your files; loop through 1 to N, build filenames, check their existence and report if they're missing.
N=10 # known N
for ((i=1;i<=N;i++)); do
f=file_$i.ext
if [ ! -f "$f" ]; then
printf '%s is missing\n' "$f"
fi
done
Extracting the numbers is fine. Comparing with 1..N can be done as follows.
Assuming the numbers are in range 1..100:
diff <(ls -1 | sed 's/script_//' | sed 's/.ext//' | sort -n) <(echo "$(seq 1 100)") | sed -n 's/> //p'
Or if the upper bound is below 10:
diff <(ls -1 | sed 's/script_//' | sed 's/.ext//') <(echo "$(seq 1 9)") | sed -n 's/> //p'

How do I find the "full symmetric difference" of several files in bash?

I have five files that each list full file paths like so:
File one
/full/file/path/one.xlsx
/full/file/path/two.txt
/full/file/path/three.pdf
....
File two
/a/b/c/d/r.txt
/full/file/path/two.txt
....
File three
/obe/two/three/graph.m
/full/file/path/two.txt
....
File four
.....
File five
.....
All five may contain the same exact full file paths. However, I want to filter out paths that are common to each file. In other words, I want the total intersection of all files removed. Below is a visual aid describing what I want with a smaller example of three files (excuse my poor mouse drawing skills):
The page on the symmetric difference did not describe exactly what I wanted, hence the visual aid and the quotes around the phrase full symmetric difference.
Question
How do I filter lines of text in several files to get the situation I want above?
Assuming that that each file is free of duplicates you could
Concat all files (cat file1 file2 ... file5)
Count how often each line appears (sort | uniq -c)
And keep only lines which appeared less than five times (sed -En 's/^ *[1-4] //p')
sort file1 ... file5 | uniq -c | sed -En 's/^ *[1-4] //p'
However, if some file may contain the same line multiple times than you would have to remove these duplicates first.
f() { sort -u "$1"; }
sort <(f file1) ... <(f file5) | uniq -c | sed -En 's/^ *[1-4] //p'
or (a bit slower but easier to edit)
for i in file1 ... file5; do sort -u "$i"; done |
sort | uniq -c | sed -En 's/^ *[1-4] //p'
If for some reason you want to keep duplicates from individual files and also want to retain the original order of lines, then you can invert the above command to only print lines which appeared in every file and remove these lines using grep:
f() { sort -u "$1"; }
grep -Fxvhf <(sort <(f file1) ... <(f file5) |
uniq -c | sed -En 's/^ *5 //p') file1 ... file5
or (a bit slower but easier to edit)
files=(file1 ... file5)
grep -Fxvhf <(for i in "${files[#]}"; do sort -u "$i"; done |
sort | uniq -c | sed -En 's/^ *5 //p') "${files[#]}"

File Name comparision in Bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.
awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list
First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext
So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.
Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

How to sort the results of find (including nested directories) alphabetically in bash

I have a list of directories based on the results of running the "find" command in bash. As an example, the result of find are the files:
test/a/file
test/b/file
test/file
test/z/file
I want to sort the output so it appears as:
test/file
test/a/file
test/b/file
test/z/file
Is there any way to sort the results within the find command, or by piping the results into sort?
If you have the GNU version of find, try this:
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}'
To use these file names in a loop, do
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}' | while read file; do
# use $file
done
The find command prints three things for each file: (1) its directory, (2) its depth in the directory tree, and (3) its full name. By including the depth in the output we can use sort -n to sort test/file above test/a/file. Finally we use awk to strip out the first two columns since they were only used for sorting.
Using \0 as a separator between the three fields allows us to handle file names with spaces and tabs in them (but not newlines, unfortunately).
$ find test -type f
test/b/file
test/a/file
test/file
test/z/file
$ find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F'\0' '{print $3}'
test/file
test/a/file
test/b/file
test/z/file
If you are unable to modify the find command, then try this convoluted replacement:
find test -type f | while read file; do
printf '%s\0%s\0%s\n' "${file%/*}" "$(tr -dc / <<< "$file")" "$file"
done | sort -t '\0' | awk -F'\0' '{print $3}'
It does the same thing, with ${file%/*} being used to get a file's directory name and the tr command being used to count the number of slashes, which is equivalent to a file's "depth".
(I sure hope there's an easier answer out there. What you're asking doesn't seem that hard, but I am blanking on a simple solution.)
find test -type f -printf '%h\0%p\n' | sort | awk -F'\0' '{print $2}'
The result of find is, for example,
test/a'\0'test/a/file
test'\0'test/file
test/z'\0'test/z/file
test/b'\0'test/b/text file.txt
test/b'\0'test/b/file
where '\0' stands for null character.
These compound strings can be properly sorted with a simple sort:
test'\0'test/file
test/a'\0'test/a/file
test/b'\0'test/b/file
test/b'\0'test/b/text file.txt
test/z'\0'test/z/file
And the final result is
test/file
test/a/file
test/b/file
test/b/text file.txt
test/z/file
(Based on the John Kugelman's answer, with "depth" element removed which is absolutely redundant.)
If you want to sort alphabetically, the best way is:
find test -print0 | sort -z
(The example in the original question actually wanted files before directories, which is not the same and requires extra steps)
try this. for reference, it firsts sorts on the second field second char. which only exists on the file, and has a r for reverse meaning it is first, after that it will sort on the first char of the second field. [-t is field deliminator, -k is key]
find test -name file |sort -t'/' -k2.2r -k2.1
do a info sort for more info. there is a ton of different ways to use the -t and -k together to get different results.

Resources