How to sort the results of find (including nested directories) alphabetically in bash - bash

I have a list of directories based on the results of running the "find" command in bash. As an example, the result of find are the files:
test/a/file
test/b/file
test/file
test/z/file
I want to sort the output so it appears as:
test/file
test/a/file
test/b/file
test/z/file
Is there any way to sort the results within the find command, or by piping the results into sort?

If you have the GNU version of find, try this:
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}'
To use these file names in a loop, do
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}' | while read file; do
# use $file
done
The find command prints three things for each file: (1) its directory, (2) its depth in the directory tree, and (3) its full name. By including the depth in the output we can use sort -n to sort test/file above test/a/file. Finally we use awk to strip out the first two columns since they were only used for sorting.
Using \0 as a separator between the three fields allows us to handle file names with spaces and tabs in them (but not newlines, unfortunately).
$ find test -type f
test/b/file
test/a/file
test/file
test/z/file
$ find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F'\0' '{print $3}'
test/file
test/a/file
test/b/file
test/z/file
If you are unable to modify the find command, then try this convoluted replacement:
find test -type f | while read file; do
printf '%s\0%s\0%s\n' "${file%/*}" "$(tr -dc / <<< "$file")" "$file"
done | sort -t '\0' | awk -F'\0' '{print $3}'
It does the same thing, with ${file%/*} being used to get a file's directory name and the tr command being used to count the number of slashes, which is equivalent to a file's "depth".
(I sure hope there's an easier answer out there. What you're asking doesn't seem that hard, but I am blanking on a simple solution.)

find test -type f -printf '%h\0%p\n' | sort | awk -F'\0' '{print $2}'
The result of find is, for example,
test/a'\0'test/a/file
test'\0'test/file
test/z'\0'test/z/file
test/b'\0'test/b/text file.txt
test/b'\0'test/b/file
where '\0' stands for null character.
These compound strings can be properly sorted with a simple sort:
test'\0'test/file
test/a'\0'test/a/file
test/b'\0'test/b/file
test/b'\0'test/b/text file.txt
test/z'\0'test/z/file
And the final result is
test/file
test/a/file
test/b/file
test/b/text file.txt
test/z/file
(Based on the John Kugelman's answer, with "depth" element removed which is absolutely redundant.)

If you want to sort alphabetically, the best way is:
find test -print0 | sort -z
(The example in the original question actually wanted files before directories, which is not the same and requires extra steps)

try this. for reference, it firsts sorts on the second field second char. which only exists on the file, and has a r for reverse meaning it is first, after that it will sort on the first char of the second field. [-t is field deliminator, -k is key]
find test -name file |sort -t'/' -k2.2r -k2.1
do a info sort for more info. there is a ton of different ways to use the -t and -k together to get different results.

Related

filename group by a pattern and select only one from each group

I have following files(as an example, 60000+ actually) and all the log files follows this pattern:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008795-84866-201911261249.log
analyse-ABC008795-84867-201911261249.log
analyse-ABC008795-84868-201911261249.log
analyse-ABC008795-84869-201911261249.log
analyse-ABC008796-84870-201911261249.log
analyse-ABC008796-84871-201911261249.log
analyse-ABC008796-84872-201911261249.log
analyse-ABC008796-84873-201911261249.log
Only numbers get change in log files. I want to take one file from each category where files should be categorized by ABC.... number. So, as you can see, there are only two categories here:
analyse-ABC008795
analyse-ABC008796
So, what I want to have is one file(let's say first file) from each category. Output should look like this:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008796-84870-201911261249.log
This should be done in Bash/linux environment, so that after I get this, I should use grep to check if my "searching string" contain in those files
ls -l | <what should I do to group and get one file from each category> | grep "searching string"
With bash and awk.
files=(*.log)
printf '%s\n' "${files[#]}" | awk -F- '!seen[$2]++'
Or use find instead of a bash array for a more portable approach.
find . -type f -name '*.log' | awk -F- '!seen[$2]++'
If your find has the -printf flag and you don't want the leading ./ from the filename add it before the pipe |
-printf '%f\n'
The !seen[$2]++ Remove second and subsequent instances of each input line, without having to sort them first. The $2 means the second field which -F is using.

How to extract codes using the grep command?

I have a file with below input lines.
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
I need to extract only the above highlighted text using the grep command.
I tried the below command and didn't get the proper result. Getting the extra 2 unwanted characters in the output. Please suggest if there is any other way to achieve this through grep command.
find ./ -type f -name <FileName> -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\\....' |
grep -o '...\\....' | sort | uniq
Current Output:
123.NNN
456.1 a
Expected output:
123.NNN
456.1
You can use another grep regular expression.
find ./ -type f -name f -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\.[^ ]*' |
grep -o '...\..*' | sort | uniq
. matches any character, [^ ]* matches any sequence of characters until the first space
Output:
123.NNN
456.1
Your regex specifies a fixed character width for strings of variable width. Based on your examples, something like
[0-9]\+\.[A-Z0-9]\+
would seem like a better regex. However, we could probably also simplify this by merging the cut and multiple grep commands into a single Awk script.
find etc etc -exec awk -F '|' '
$4 ~ /Category is not found for local configuration\/code\/[0-9]{3}\.[0-9A-Z]/ {
split($4, a, /\/code\/);
split(a[2], b); print b[1] }' {} + |
sort -u
The two split operations are just a cheap way to pick out the text between /code/ and the next whitespace character; we have already established by way of the regex match that the string after /code/ matches the pattern we're after.
Notice also how sort has a -u option which allows you to replace (trivial cases of) uniq.
The regex variant supported by Awk is slightly different than that supported by POSIX grep; so the backslashed \+ in grep's BRE dialect is plain + in the dialect called ERE which is [more or less] supported by Awk - and grep -E. If you have grep -P you can use a third variant which has a convenient feature;
find etc etc -exec grep -oP '^([^|]*[|]){3}[^|]*Category is not found for local configuration/code/\K[0-9]{3}\.[0-9A-Z]+' {} + |
sort -u
The \K says "match up through here, but forget everything before this" and so only prints the part after this token.
With sed:
sed -E -n 's#.*code/(.*)\s+and.*#\1#p' file.txt | uniq
Output:
123.NNN
456.1
I'd use the -P option:
grep -oP '/code/\K\S+' file | sort -u
You want to extract the non-whitespace characters following /code/
An awk using match():
$ awk 'match($0,/[0-9]+\.[A-Z0-9]+/)&&++a[(b=substr($0,RSTART,RLENGTH))]==1{print b}' file
Output:
123.NNN
456.1
Pretty printed for slightly better readability:
$ awk '
match($0,/[0-9]+\.[A-Z0-9]+/) && ++a[(b=substr($0,RSTART,RLENGTH))]==1 {
print b
}' file
It's not possible just using grep. You should use AWK instead:
awk '{split($7, ar, "/"); print ar[3]}' FILE
Explanation:
The split function splits on a string, here $7, the 7th field, placing the result in an array ar, and using the string / as delimiter.
Then prints the 3rd field of the array.
Note:
I am assuming that all of your input looks like the samples you have given us, i.e.:
aaa|b|c|ddd is not found for local configuration/code/111.nnn and customer nnn
Where aaa and ddd will not contain whitespace.
I also assume you really do have a file FILE containing those lines. It's a bit unclear.
Input:
▶ cat FILE
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
Output:
▶ awk '{split($7, ar, "/"); print ar[3]}' FILE
123.NNN
123.NNN
456.1
Single sed can do the filtering.
(The pattern can be further generalized as suggested by others if that is an option. But be careful to not to over simplify so that it can match with unexpected inputs)
sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' input.txt
To replace your exact command,
find ./ -type f -name <Filename> -exec cat {} \; | sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' | sort | uniq
Simple substitutions on individual lines is the job sed is best suited for. This will work using any sed in any shell on any UNIX box:
$ cat file
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
$ sed -n 's:.*Category is not found for local configuration/code/\([^ ]*\).*:\1:p' file | sort -u
123.NNN
456.1

Bash, getting the latest folder based on its name which is a date

Can anyone tell me how to get the name of the latest folder based on its name which is formatted as a date, using bash. For example:
20161121/
20161128/
20161205/
20161212/
The output should be: 20161212
Just use GNU sort with -nr flags for based on reverse numerical sort.
find . ! -path . -type d | sort -nr | head -1
An example structure, I have a list of following folders in my current path,
find . ! -path . -type d
./20161121
./20161128
./20161205
./20161212
See how the sort picks up the folder you need,
find . ! -path . -type d | sort -nr
./20161212
./20161205
./20161128
./20161121
and head -1 for first entry alone,
find . ! -path . -type d | sort -nr | head -1
./20161212
to store it in a variable, use command-substitution $() as
myLatestFolder=$(find . ! -path . -type d | sort -nr | head -1)
Sorting everything seems like extra work if all you want is a single entry. It could especially be problematic if you need to sort a very large number of entries. Plus, you should note that find-based solutions will by default traverse subdirectories, which might or might not be what you're after.
$ shopt -s extglob
$ mkdir 20160110 20160612 20160614 20161120
$ printf '%d\n' 20+([0-9]) | awk '$1>d{d=$1} END{print d}'
20161120
$
While the pattern 20+([0-9]) doesn't precisely match dates (it's hard to validate dates without at least a couple of lines of code), we've at least got a bit of input validation via printf, and a simple "print the highest" awk one-liner to parse printf's results.
Oh, also, this handles any directory entries that are named appropriately, and does not validate that they are themselves directories. That too would require either an extra test or a different tool.
One method to require items to be directories would be the use of a trailing slash:
$ touch 20161201
$ printf '%s\n' 20+([0-9])/ | awk '$1>d{d=$1} END{print d}'
20161120/
But that loses the input validation (the %d format for printf).
If you felt like it, you could build a full pattern for your directory names though:
$ dates='20[01][0-9][01][0-9][0-3][0-9]'
$ printf '%s\n' $dates/ | awk '$1>d{d=$1} END{print d}'
20161120/

SHELL printing just right part after . (DOT)

I need to find just extension of all files in directory (if there are 2 same extensions, its just one). I already have it. But the output of my script is like
test.txt
test2.txt
hello.iso
bay.fds
hellllu.pdf
Im using grep -e -e '.' and it just highlight DOTs
And i need just these extensions give in one variable like txt,iso,fds,pdf
Is there anyone who could help? I already had it one time but i had it on array. Today I found out It's has to work on dash too.
You can use find with awk to get all unique extensions:
find . -type f -name '?*.?*' -print0 |
awk -F. -v RS='\0' '!seen[$NF]++{print $NF}'
can be done with find as well, but I think this is easier
for f in *.*; do echo "${f##*.}"; done | sort -u
if you want to assign a comma separated list of the unique extensions, you can follow this
ext=$(for f in *.*; do echo "${f##*.}"; done | sort -u | paste -sd,)
echo $ext
csv,pdf,txt
alternatively with ls
ls -1 *.* | rev | cut -d. -f1 | rev | sort -u | paste -sd,
rev/rev is required if you have more than one dot in the filename, assuming the extension is after the last dot. For any other directory simply change the part *.* to dirpath/*.* in all scripts.
I'm not sure I understand your comment. If you don't assign to a variable, by default it will print to the output. If you want to pass directory name as a variable to a script, put the code into a script file and replace dirpath with $1, assuming that will be your first argument to the script
#!/bin/bash
# print unique extension in the directory passed as an argument, i.e.
ls -1 "$1"/*.* ...
if you have sub directories with extensions above scripts include them as well, to limit only to file types replace ls .. with
find . -maxdepth 1 -type f -name "*.*" | ...

Force sort command to ignore folder names

I ran the following from a base folder ./
find . -name *.xvi.txt | sort
Which returns the following sort order:
./LT/filename.2004167.xvi.txt
./LT/filename.2004247.xvi.txt
./pred/2004186/filename.2004186.xvi.txt
./pred/2004202/filename.2004202.xvi.txt
./pred/2004222/filename.2004222.xvi.txt
As you can see, the filenames follow a regular structure, but the files themselves might be located in different parts of the directory structure. Is there a way of ignoring the folder names and/or directory structure so that the sort returns a list of folders/filenames based ONLY on the file names themselves? Like so:
./LT/filename.2004167.xvi.txt
./pred/2004186/filename.2004186.xvi.txt
./pred/2004202/filename.2004202.xvi.txt
./pred/2004222/filename.2004222.xvi.txt
./LT/filename.2004247.xvi.txt
I've tried a few different switches under the find and sort commands, but no luck. I could always copy everything out to a single folder and sort from there, but there are several hundred files, and I'm hoping that a more elegant option exists.
Thanks! Your help is appreciated.
If your find has -printf you can print both the base filename and the full filename. Sort by the first field, then strip it off.
find . -name '*.xvi.txt' -printf '%f %p\n' | sort -k1,1 | cut -f 2- -d ' '
I have chosen a space as a delimiter. If your filenames include spaces, you should choose another delimiter which is a character that's not in your filenames. If any filenames include newlines, you'll have to modify this because it won't work.
Note that the glob in the find command should be quoted.
If your find doesn't have printf, you could use awk to accomplish the same thing:
find . -name *.xvi.txt | awk -F / '{ print $NF, $0 }' | sort | sed 's/.* //'
The same caveats about spaces that Dennis Williamson mentioned apply here. And for variety, I'm using sed to strip off the sort field, instead of cut.
find . -name *.xvi.txt | sort -t'.' -k3 -n
will sort it as you want. the only problem is if filename or directory name will include additinal dots.
To avoid it you can use :
find . -name *.xvi.txt | sed 's/[0-9]\+.xvi.txt$/\\&/' | sort -t'\' -k2 | sed 's/\\//'

Resources