How do I sort file paths based on multiple embedded numbers? - bash

I have run a program to generate results with different parameters, R, C and RP, reflected in the directory name of the output files, all named results.txt.
For instance, in directory name params_R_7_C_16_RP_0, the 7 is the value of parameter R, 16 is the value of parameter C and 0 is the value of parameter RP.
I want to get all results.txt files in the current directory tree, sorted by the embedded values of R,C and RP in their hosting directories.
I first use the following command to get the results.txt files that I want to parse:
find ./ -name "results.txt"
and the output is:
./params_R_11_C_9_RP_0/results.txt
./params_R_7_C_9_RP_0/results.txt
./params_R_7_C_4_RP_0/results.txt
./params_R_11_C_16_RP_0/results.txt
./params_R_9_C_4_RP_0/results.txt
./params_R_5_C_9_RP_0/results.txt
./params_R_9_C_25_RP_0/results.txt
./params_R_7_C_16_RP_0/results.txt
./params_R_5_C_25_RP_0/results.txt
./params_R_5_C_16_RP_0/results.txt
./params_R_11_C_4_RP_0/results.txt
./params_R_9_C_16_RP_0/results.txt
./params_R_7_C_25_RP_0/results.txt
./params_R_11_C_25_RP_0/results.txt
./params_R_5_C_4_RP_0/results.txt
./params_R_9_C_9_RP_0/results.txt
and I tried the following sort command:
find ./ -name "results.txt" | sort
which results in lexical sorting:
./params_R_11_C_16_RP_0/results.txt
./params_R_11_C_25_RP_0/results.txt
./params_R_11_C_4_RP_0/results.txt
./params_R_11_C_9_RP_0/results.txt
./params_R_5_C_16_RP_0/results.txt
./params_R_5_C_25_RP_0/results.txt
./params_R_5_C_4_RP_0/results.txt
./params_R_5_C_9_RP_0/results.txt
./params_R_7_C_16_RP_0/results.txt
./params_R_7_C_25_RP_0/results.txt
./params_R_7_C_4_RP_0/results.txt
./params_R_7_C_9_RP_0/results.txt
./params_R_9_C_16_RP_0/results.txt
./params_R_9_C_25_RP_0/results.txt
./params_R_9_C_4_RP_0/results.txt
./params_R_9_C_9_RP_0/results.txt
But what I actually want is selective numerical sorting: first by R value, then C, then RP:
./params_R_5_C_4_RP_0/results.txt
./params_R_5_C_9_RP_0/results.txt
./params_R_5_C_16_RP_0/results.txt
./params_R_5_C_25_RP_0/results.txt
./params_R_7_C_4_RP_0/results.txt
./params_R_7_C_9_RP_0/results.txt
./params_R_7_C_16_RP_0/results.txt
./params_R_7_C_25_RP_0/results.txt
./params_R_9_C_4_RP_0/results.txt
./params_R_9_C_9_RP_0/results.txt
./params_R_9_C_16_RP_0/results.txt
./params_R_9_C_25_RP_0/results.txt
...
I considered padding the embedded numbers (e.g., params_R_005_C_004_RP_0) when generating the paths list, but that would require an additional processing step, which I want to avoid.
Can the desired sorting be achieved directly?

You need the -V flag for sort
find ./ -name "results.txt" | sort -V

If you use GNU sort (a recent-enough version), #Fabricator's answer, based on GNU sort's -V option, is by far the simplest solution.
Otherwise, try this POSIX-compliant solution:
find . -name 'results.txt' | sort -n -t _ -k3,3 -k5,5 -k 7,7
-n specifies numeric sorting
-t _ splits the input line into fields based on separator char. _
-k3,3 -k5,5 -k 7,7 sorts the input based first on field 3, then field 5, then field 7, corresponding to the R, C and RP values.
(Note that using -k with a single number - e.g., -k3 - would instead result in sorting from field 3 through the remainder of the line).

try find ./ -name "results.txt" | sort -k 3 -t _ -n -k 5 -n

Related

How do I concatenate and rename split file pairs with a loop?

I have a directory that looks like this:
S-100-1-54359386-left.fastq.gz
S-100-1-54469454-left.fastq.gz
S-20-1-54356384-left.fastq.gz
S-20-1-54468477-left.fastq.gz
S-40-1-54343370-left.fastq.gz
S-40-1-54465479-left.fastq.gz
S-100-2-54359386-left.fastq.gz
S-100-2-54469454-left.fastq.gz
S-20-2-54356384-left.fastq.gz
S-20-2-54468477-left.fastq.gz
S-40-2-54343370-left.fastq.gz
S-40-2-54465479-left.fastq.gz
Each pair of consecutive files need to be concatenated and given a unique name. I can use the following for each pair:
zcat S-40-2-54343370-left.fastq.gz S-40-2-54465479-left.fastq.gz | \
gzip -c > S-40-2.left.fq.gz
...but I would like something more elegant. Notice that the number after the second dash also changes. The number after the third dash (54359386, etc.) makes each file unique, but doesn't need to be preserved after they are concatenated. Any advice? I'm not sure how to make a loop structured to identify the pairs.
Assuming that each group can contain only 2 files.
complex pipeline:
echo S-*.fastq.gz | tr ' ' '\n' | sort -t'-' -k1,3n | xargs -n2 bash \
-c 'fn=$(cut -d"-" -f1-3 <<<$0)".left.fq.gz"; zcat "$0" "$1" | gzip -c > "$fn"'
sort -t'-' -k1,3n - sort file list by the first 3 fields treating - as field separator
-n2 - xargs option, tells to use at most 2 arguments per command line

Counting occurrences of unique strings in bash without first sorting the data

I'm doing some data gathering on massive log files and I need to count the occurrences of unique strings. Generally the way this is done is with a command like:
zcat <file> | grep -o <filter> | sort | uniq -c | sort -n
What I'm looking to do is not pay the performance penalty of the sort after the grep. Is this possible to do without leaving bash?
You can use awk to count the uniques and avoid sort:
zgrep -o <filter> <file> |
awk '{count[$0]++} END{for (i in count) print count[i], i}'
Also note you can avoid zcat and call zgrep directly.
Since you mentioned you don't want to leave bash: You could try it using associative arrays: You could use the input lines as key, and the count as value. To learn about associative arrays see here: http://www.gnu.org/software/bash/manual/html_node/Arrays.html.
But, be sure to benchmark the performance - you may nevertheless be better off using sort and uniq, or perl, or ...
jq has built-in associative arrays, so you could consider one of the following approaches, which are both efficient (like awk):
zgrep -o <filter> <file> |
jq -nR 'reduce inputs as $line ({}; .[$line] += 1)'
This would produce the results as a JSON object with the frequencies as the object's values, e.g.
{
"a": 2,
"b": 1,
"c": 1
}
If you want each line of output to consist of a count and value (in that order), then an appropriate jq invocation would be:
jq -nRr 'reduce inputs as $line ({}; .[$line] += 1)
| to_entries[] | "\(.value) \(.key)"'
This would produce output like so:
2 a
1 b
1 c
The jq options used here are:
-n # for use with `inputs`
-R # "raw" input
-r # "raw" output

Sorting issue in Bash Script

I have a whole file full of filenames that is outputted from the find command below:
find "$ARCHIVE" -type f -name *_[0-9][0-9] | sed 's/_[0-9][0-9]$//' > temp
I am now trying to sort these file names and count them to find out which one appears the most. The problem I am having with this is whenever I execute:
sort -g temp
It prints all the sorted file names to the command line and I am unsure why. Any help with this issue would be greatly appreciated!
You may need this:
sort temp| uniq -c | sort -nr
First we sort temp, then we prefix lines by the number of occurrences (uniq -c), next we compare according to string numerical value (sort -n) and the last command reverse the result of comparisons (sort -r).
Example file:
/home/user/testfiles/405/prob405823
/home/user/testfiles/405/prob405823
/home/user/testfiles/527/prob527149
/home/user/testfiles/518/prob518433
Output:
2 /home/user/testfiles/405/prob405823
1 /home/user/testfiles/527/prob527149
etc..
Resources:
Linux / Unix Command: sort
uniq(1) - Linux man page
ptierno - comments to improve answer
You could do everything after the find in one awk command (this one uses GNU awk 4.*):
find "$ARCHIVE" -type f -name *_[0-9][0-9] |
awk '
{ cnt[gensub(/_[0-9][0-9]$/,"","")]++ }
END {
PROCINFO["sorted_in"] = "#val_num_desc"
for (file in cnt) {
print cnt, file
}
}
'

If xargs is map, what is filter?

I think of xargs as the map function of the UNIX shell. What is the filter function?
EDIT: it looks like I'll have to be a bit more explicit.
Let's say I have to hand a program which accepts a single string as a parameter and returns with an exit code of 0 or 1. This program will act as a predicate over the strings that it accepts.
For example, I might decide to interpret the string parameter as a filepath, and define the predicate to be "does this file exist". In this case, the program could be test -f, which, given a string, exits with 0 if the file exists, and 1 otherwise.
I also have to hand a stream of strings. For example, I might have a file ~/paths containing
/etc/apache2/apache2.conf
/foo/bar/baz
/etc/hosts
Now, I want to create a new file, ~/existing_paths, containing only those paths that exist on my filesystem. In my case, that would be
/etc/apache2/apache2.conf
/etc/hosts
I want to do this by reading in the ~/paths file, filtering those lines by the predicate test -f, and writing the output to ~/existing_paths. By analogy with xargs, this would look like:
cat ~/paths | xfilter test -f > ~/existing_paths
It is the hypothesized program xfilter that I am looking for:
xfilter COMMAND [ARG]...
Which, for each line L of its standard input, will call COMMAND [ARG]... L, and if the exit code is 0, it prints L, else it prints nothing.
To be clear, I am not looking for:
a way to filter a list of filepaths by existence. That was a specific example.
how to write such a program. I can do that.
I am looking for either:
a pre-existing implementation, like xargs, or
a clear explanation of why this doesn't exist
If map is xargs, filter is... still xargs.
Example: list files in the current directory and filter out non-executable files:
ls | xargs -I{} sh -c "test -x '{}' && echo '{}'"
This could be made handy trough a (non production-ready) function:
xfilter() {
xargs -I{} sh -c "$* '{}' && echo '{}'"
}
ls | xfilter test -x
Alternatively, you could use a parallel filter implementation via GNU Parallel:
ls | parallel "test -x '{}' && echo '{}'"
So, youre looking for the:
reduce( compare( filter( map(.. list()) ) ) )
what can be rewiritten as
list | map | filter | compare | reduce
The main power of bash is a pipelining, therefore isn't need to have one special filter and/or reduce command. In fact nearly all unix commands could act in one (or more) functions as:
list
map
filter
reduce
Imagine:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
Creating a test case:
mkdir ./testcase
cd ./testcase || exit 1
for i in {1..10}
do
strings -1 < /dev/random | head -1000 > file.$i.txt
done
mkdir emptydir
You will get a directory named testcase and in this directory 10 files and one directory
emptydir file.1.txt file.10.txt file.2.txt file.3.txt file.4.txt file.5.txt file.6.txt file.7.txt file.8.txt file.9.txt
each file contains 1000 lines of random strings some lines are contains only numbers
now run the command
find testcase -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
and you will get the largest number-only line from each files like: 42. (of course, this can be done more effectively, this is only for demo)
decomposed:
The find testcase -type f -print will print every plain files so, LIST (and reduced only to files). ouput:
testcase/file.1.txt
testcase/file.10.txt
testcase/file.2.txt
testcase/file.3.txt
testcase/file.4.txt
testcase/file.5.txt
testcase/file.6.txt
testcase/file.7.txt
testcase/file.8.txt
testcase/file.9.txt
the xargs grep -H '^[0-9]*$' as MAP will run a grep command for each file from a list. The grep is usually using as filter, e.g: command | grep, but now (with xargs) changes the input (filenames) to (lines containing only digits). Output, many lines like:
testcase/file.1.txt:1
testcase/file.1.txt:8
....
testcase/file.9.txt:4
testcase/file.9.txt:5
structure of lines: filename colon number, want only numbers so calling a pure filter, what strips out the filenames from each line cut -d: -f2. It outputs many lines like:
1
8
...
4
5
Now the reduce (getting the largest number), the sort -nr sorts all number numerically and reverse order (desc), so its output is like:
42
18
9
9
...
0
0
and the head -1 print the first line (the largest number).
Of course, you can write your own list/filter/map/reduce functions directly with bash programming constructions (loops, conditions and such), or you can employ any fullblown scripting language like perl, special languages like awk, sed "language", or dc (rpn) and such.
Having an special filter command such:
list | filter_command cut -d: -f 2
is simple doesn't needed, because you can use directly the
list | cut
You can have awk do the filter and reduce function.
Filter:
awk 'NR % 2 { $0 = $0 " [EVEN]" } 1'
Reduce:
awk '{ p = p + $0 } END { print p }'
I totally understand your question here as a long time functional programmer and here is the answer: Bash/unix command pipelining isn't as clean as you'd hoped.
In the example above:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
a more pure form would look like:
find mydir | xargs -L 1 bash -c 'test -f $1 && echo $1' _ | grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^---list--^^-------filter---------------------------------^^------map----------^^--map-------^ ^reduce^
But, for example, grep also has a filtering capability: grep -q mypattern which simply return 0 if it matches the pattern.
To get a something more like what you want, you simply would have to define a filter bash function and make sure to export it so it was compatible with xargs
But then you get into some problems. Like, test has binary and unary operators. How will your filter function handle this? Hand, what would you decide to output on true for these cases? Not insurmountable, but weird. Assuming only unary operations:
filter(){
while read -r LINE || [[ -n "${LINE}" ]]; do
eval "[[ ${LINE} $1 ]]" 2> /dev/null && echo "$LINE"
done
}
so you could do something like
seq 1 10 | filter "> 4"
5
6
7
8
9
As I wrote this I kinda liked it

Bash and sort files in order

with a previous bash script I created a list of files:
data_1_box
data_2_box
...
data_10_box
...
data_99_box
the thing is that now I need to concatenate them, so I tried
ls -l data_*
but I get
.....
data_89_box
data_8_box
data_90_box
...
data_99_box
data_9_box
but I need to get in the sucession 1, 2, 3, 4, .. 9, ..., 89, 90, 91, ..., 99
Can it be done in bash?
ls data_* | sort -n -t _ -k 2
-n: sorts numerically
-t: field separator '_'
-k: sort on second field, in your case the numbers after the first '_'
How about using the -v flag to ls? The purpose of the flag is to sort files according to version number, but it works just as well here and eliminates the need to pipe the result to sort:
ls -lv data_*
If your sort has version sort, try:
ls -1 | sort -V
(that's a capital V).
This is a generic answer! You have to apply rules to the specific set of data
ls | sort
Example:
ls | sort -n -t _ -k 2
maybe you'll like SistemaNumeri.py ("fix numbers"): it renames your
data_1_box
data_2_box
...
data_10_box
...
data_99_box
in
data_01_box
data_02_box
...
data_10_box
...
data_99_box
Here's the way to do it in bash if your sort doesn't have version sort:
cat <your_former_ls_output_file> | awk ' BEGIN { FS="_" } { printf( "%03d\n",$2) }' | sort | awk ' { printf( "data_%d_box\n", $1) }'
All in one line. Keep in mind, I haven't tested this on your specific data, so it might need a little tweaking to work correctly for you. This outlines a good, robust and relatively simple solution, though. Of course, you can always swap the cat+filename in the beginning with an the actual ls to create the file data on the fly. For capturing the actual filename column, you can choose between correct ls parameters or piping through either cut or awk.
One suggestion I can think of is this :
for i in `seq 1 5`
do
cat "data_${i}_box"
done
I have files in a folder and need to sort them based on the number. E.g. -
abc_dr-1.txt
hg_io-5.txt
kls_er_we-3.txt
sd-4.txt
sl_rt_we_yh-2.txt
I need to sort them based on number.
So I used this to sort.
ls -1 | sort -t '-' -nk2

Resources