Bash and sort files in order - bash

with a previous bash script I created a list of files:
data_1_box
data_2_box
...
data_10_box
...
data_99_box
the thing is that now I need to concatenate them, so I tried
ls -l data_*
but I get
.....
data_89_box
data_8_box
data_90_box
...
data_99_box
data_9_box
but I need to get in the sucession 1, 2, 3, 4, .. 9, ..., 89, 90, 91, ..., 99
Can it be done in bash?

ls data_* | sort -n -t _ -k 2
-n: sorts numerically
-t: field separator '_'
-k: sort on second field, in your case the numbers after the first '_'

How about using the -v flag to ls? The purpose of the flag is to sort files according to version number, but it works just as well here and eliminates the need to pipe the result to sort:
ls -lv data_*

If your sort has version sort, try:
ls -1 | sort -V
(that's a capital V).

This is a generic answer! You have to apply rules to the specific set of data
ls | sort
Example:
ls | sort -n -t _ -k 2

maybe you'll like SistemaNumeri.py ("fix numbers"): it renames your
data_1_box
data_2_box
...
data_10_box
...
data_99_box
in
data_01_box
data_02_box
...
data_10_box
...
data_99_box

Here's the way to do it in bash if your sort doesn't have version sort:
cat <your_former_ls_output_file> | awk ' BEGIN { FS="_" } { printf( "%03d\n",$2) }' | sort | awk ' { printf( "data_%d_box\n", $1) }'
All in one line. Keep in mind, I haven't tested this on your specific data, so it might need a little tweaking to work correctly for you. This outlines a good, robust and relatively simple solution, though. Of course, you can always swap the cat+filename in the beginning with an the actual ls to create the file data on the fly. For capturing the actual filename column, you can choose between correct ls parameters or piping through either cut or awk.

One suggestion I can think of is this :
for i in `seq 1 5`
do
cat "data_${i}_box"
done

I have files in a folder and need to sort them based on the number. E.g. -
abc_dr-1.txt
hg_io-5.txt
kls_er_we-3.txt
sd-4.txt
sl_rt_we_yh-2.txt
I need to sort them based on number.
So I used this to sort.
ls -1 | sort -t '-' -nk2

Related

How to sort release version string in descending order with Bash

I have a list of release version strings that looks something like this:
releases=( "1.3.1243" "2.0.1231" "0.8.4454" "1.2.4124" "1.2.3231" "0.9.5231" )
How can I use bash to sort my releases array such that the array is sorted in descending order (so the value on the left has the highest precedence).
So the after sorting, the example above would be in the following order:
"2.0.1231", "1.3.1243", "1.2.4124", "1.2.3231", "0.9.5231", "0.8.4454"
You can actually do it quite easily with command substitution and the version sort option to sort, e.g.
releases=($(printf "%s\n" "${releases[#]}" | sort -rV))
(note: the printf-trick simply separates the elements on separate lines so they can be piped to sort for sorting. printf "%s\n", despite having only one "%s" conversion specifier, will process all input)
Now releases contains:
releases=("2.0.1231" "1.3.1243" "1.2.4124" "1.2.3231" "0.9.5231" "0.8.4454")
releases=( "1.3.1243" "2.0.1231" "0.8.4454" "1.2.4124" "1.2.3231" "0.9.5231" )
sorted=( $(echo ${releases[*]} | sed 's/ /\n/g' | sort -t. -k1,1rn -k2,2rn -k3,3rn) )
echo ${sorted[*]}
This uses sed and sort to reverse sort the items, using . as the field separator, and treating each field as numeric:
2.0.1231 1.3.1243 1.2.4124 1.2.3231 0.9.5231 0.8.4454
releases=( "1.3.1243" "2.0.1231" "0.8.4454" "1.2.4124" "1.2.3231" "0.9.5231"
readarray -t sorted < <(printf '%s\n' "${releases[#]}" | sort -Vr)
declare -p sorted
declare -a sorted=([0]="2.0.1231" [1]="1.3.1243" [2]="1.2.4124" [3]="1.2.3231" [4]="0.9.5231" [5]="0.8.4454")

Counting occurrences of unique strings in bash without first sorting the data

I'm doing some data gathering on massive log files and I need to count the occurrences of unique strings. Generally the way this is done is with a command like:
zcat <file> | grep -o <filter> | sort | uniq -c | sort -n
What I'm looking to do is not pay the performance penalty of the sort after the grep. Is this possible to do without leaving bash?
You can use awk to count the uniques and avoid sort:
zgrep -o <filter> <file> |
awk '{count[$0]++} END{for (i in count) print count[i], i}'
Also note you can avoid zcat and call zgrep directly.
Since you mentioned you don't want to leave bash: You could try it using associative arrays: You could use the input lines as key, and the count as value. To learn about associative arrays see here: http://www.gnu.org/software/bash/manual/html_node/Arrays.html.
But, be sure to benchmark the performance - you may nevertheless be better off using sort and uniq, or perl, or ...
jq has built-in associative arrays, so you could consider one of the following approaches, which are both efficient (like awk):
zgrep -o <filter> <file> |
jq -nR 'reduce inputs as $line ({}; .[$line] += 1)'
This would produce the results as a JSON object with the frequencies as the object's values, e.g.
{
"a": 2,
"b": 1,
"c": 1
}
If you want each line of output to consist of a count and value (in that order), then an appropriate jq invocation would be:
jq -nRr 'reduce inputs as $line ({}; .[$line] += 1)
| to_entries[] | "\(.value) \(.key)"'
This would produce output like so:
2 a
1 b
1 c
The jq options used here are:
-n # for use with `inputs`
-R # "raw" input
-r # "raw" output

Sorting issue in Bash Script

I have a whole file full of filenames that is outputted from the find command below:
find "$ARCHIVE" -type f -name *_[0-9][0-9] | sed 's/_[0-9][0-9]$//' > temp
I am now trying to sort these file names and count them to find out which one appears the most. The problem I am having with this is whenever I execute:
sort -g temp
It prints all the sorted file names to the command line and I am unsure why. Any help with this issue would be greatly appreciated!
You may need this:
sort temp| uniq -c | sort -nr
First we sort temp, then we prefix lines by the number of occurrences (uniq -c), next we compare according to string numerical value (sort -n) and the last command reverse the result of comparisons (sort -r).
Example file:
/home/user/testfiles/405/prob405823
/home/user/testfiles/405/prob405823
/home/user/testfiles/527/prob527149
/home/user/testfiles/518/prob518433
Output:
2 /home/user/testfiles/405/prob405823
1 /home/user/testfiles/527/prob527149
etc..
Resources:
Linux / Unix Command: sort
uniq(1) - Linux man page
ptierno - comments to improve answer
You could do everything after the find in one awk command (this one uses GNU awk 4.*):
find "$ARCHIVE" -type f -name *_[0-9][0-9] |
awk '
{ cnt[gensub(/_[0-9][0-9]$/,"","")]++ }
END {
PROCINFO["sorted_in"] = "#val_num_desc"
for (file in cnt) {
print cnt, file
}
}
'

How do I sort file paths based on multiple embedded numbers?

I have run a program to generate results with different parameters, R, C and RP, reflected in the directory name of the output files, all named results.txt.
For instance, in directory name params_R_7_C_16_RP_0, the 7 is the value of parameter R, 16 is the value of parameter C and 0 is the value of parameter RP.
I want to get all results.txt files in the current directory tree, sorted by the embedded values of R,C and RP in their hosting directories.
I first use the following command to get the results.txt files that I want to parse:
find ./ -name "results.txt"
and the output is:
./params_R_11_C_9_RP_0/results.txt
./params_R_7_C_9_RP_0/results.txt
./params_R_7_C_4_RP_0/results.txt
./params_R_11_C_16_RP_0/results.txt
./params_R_9_C_4_RP_0/results.txt
./params_R_5_C_9_RP_0/results.txt
./params_R_9_C_25_RP_0/results.txt
./params_R_7_C_16_RP_0/results.txt
./params_R_5_C_25_RP_0/results.txt
./params_R_5_C_16_RP_0/results.txt
./params_R_11_C_4_RP_0/results.txt
./params_R_9_C_16_RP_0/results.txt
./params_R_7_C_25_RP_0/results.txt
./params_R_11_C_25_RP_0/results.txt
./params_R_5_C_4_RP_0/results.txt
./params_R_9_C_9_RP_0/results.txt
and I tried the following sort command:
find ./ -name "results.txt" | sort
which results in lexical sorting:
./params_R_11_C_16_RP_0/results.txt
./params_R_11_C_25_RP_0/results.txt
./params_R_11_C_4_RP_0/results.txt
./params_R_11_C_9_RP_0/results.txt
./params_R_5_C_16_RP_0/results.txt
./params_R_5_C_25_RP_0/results.txt
./params_R_5_C_4_RP_0/results.txt
./params_R_5_C_9_RP_0/results.txt
./params_R_7_C_16_RP_0/results.txt
./params_R_7_C_25_RP_0/results.txt
./params_R_7_C_4_RP_0/results.txt
./params_R_7_C_9_RP_0/results.txt
./params_R_9_C_16_RP_0/results.txt
./params_R_9_C_25_RP_0/results.txt
./params_R_9_C_4_RP_0/results.txt
./params_R_9_C_9_RP_0/results.txt
But what I actually want is selective numerical sorting: first by R value, then C, then RP:
./params_R_5_C_4_RP_0/results.txt
./params_R_5_C_9_RP_0/results.txt
./params_R_5_C_16_RP_0/results.txt
./params_R_5_C_25_RP_0/results.txt
./params_R_7_C_4_RP_0/results.txt
./params_R_7_C_9_RP_0/results.txt
./params_R_7_C_16_RP_0/results.txt
./params_R_7_C_25_RP_0/results.txt
./params_R_9_C_4_RP_0/results.txt
./params_R_9_C_9_RP_0/results.txt
./params_R_9_C_16_RP_0/results.txt
./params_R_9_C_25_RP_0/results.txt
...
I considered padding the embedded numbers (e.g., params_R_005_C_004_RP_0) when generating the paths list, but that would require an additional processing step, which I want to avoid.
Can the desired sorting be achieved directly?
You need the -V flag for sort
find ./ -name "results.txt" | sort -V
If you use GNU sort (a recent-enough version), #Fabricator's answer, based on GNU sort's -V option, is by far the simplest solution.
Otherwise, try this POSIX-compliant solution:
find . -name 'results.txt' | sort -n -t _ -k3,3 -k5,5 -k 7,7
-n specifies numeric sorting
-t _ splits the input line into fields based on separator char. _
-k3,3 -k5,5 -k 7,7 sorts the input based first on field 3, then field 5, then field 7, corresponding to the R, C and RP values.
(Note that using -k with a single number - e.g., -k3 - would instead result in sorting from field 3 through the remainder of the line).
try find ./ -name "results.txt" | sort -k 3 -t _ -n -k 5 -n

Best way to simulate "group by" from bash?

Suppose you have a file that contains IP addresses, one address in each line:
10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1
You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
One way to do this is:
cat ip_addresses |uniq |while read ip
do
echo -n $ip" "
grep -c $ip ip_addresses
done
However it is really far from being efficient.
How would you solve this problem more efficiently using bash?
(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)
ADDITIONAL INFO:
Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.
I liked the hashtable-like solution - anybody can provide improvements to that solution?
ADDITIONAL INFO #2:
Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.
So please, don't blame the question, just ignore it if you don't like it. :-)
sort ip_addresses | uniq -c
This will print the count first, but other than that it should be exactly what you want.
The quick and dirty method is as follows:
cat ip_addresses | sort -n | uniq -c
If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.
PS
If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.
for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )
cat file
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000
awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file
US|A|3000
US|B|3000
US|C|3000
UK|1|9000
The canonical solution is the one mentioned by another respondent:
sort | uniq -c
It is shorter and more concise than what can be written in Perl or awk.
You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.
cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'
this command would give you desired output
Solution ( group by like mysql)
grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n
Result
3249 googleplus
4211 linkedin
5212 xing
7928 facebook
It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.
Among those versions, saua's solution is the best (and simplest):
sort -n ip_addresses.txt | uniq -c
I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...
I feel awk associative array is also handy in this case
$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt
A group by post here
You probably can use the file system itself as a hash table. Pseudo-code as follows:
for every entry in the ip address file; do
let addr denote the ip address;
if file "addr" does not exist; then
create file "addr";
write a number "0" in the file;
else
read the number from "addr";
increase the number by 1 and write it back;
fi
done
In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.
Most of the other solutions count duplicates. If you really need to group key value pairs, try this:
Here is my example data:
find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
This will print the key value pairs grouped by the md5 checksum.
cat table.txt | awk '{print $1}' | sort | uniq | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
GROUP BY under bash
Regarding this SO thread, there are some different answer regarding different needs.
1. Counting IP as SO request (GROUP BY IP address).
As IP are easy to convert to single integer, for small bunch of address, if you need to repeat this kind of operation many time, using a pure bash function could be a lot more efficient!
Pure bash (no fork!)
There is a way, using a bash function. This way is very quick as there is no fork!...
countIp () {
local -a _ips=(); local _a
while IFS=. read -a _a ;do
((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
done
for _a in ${!_ips[#]} ;do
printf "%.16s %4d\n" \
$(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
done
}
Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays!
time countIp < ip_addresses
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
real 0m0.001s
user 0m0.004s
sys 0m0.000s
time sort ip_addresses | uniq -c
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
real 0m0.010s
user 0m0.000s
sys 0m0.000s
On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.
2. GROUP BY duplicates (files content)
By using checksum you could indentfy duplicate files somewhere:
find . -type f -exec sha1sum {} + |
sort |
sed '
:a;
$s/^[^ ]\+ \+//;
N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2\o11\3/;
ta;
s/^[^ ]\+ \+//;
P;
D;
ba
'
This will print all duplicates, by line, separated by Tabulation ($'\t' or octal 011 ou could change /\1 \2\o11\3/; by /\1 \2|\3/; for using | as separator).
./b.txt ./e.txt
./a.txt ./c.txt ./d.txt
Could be written as (with | as separator):
find . -type f -exec sha1sum {} + | sort | sed ':a;$s/^[^ ]\+ \+//;N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2|\3/;ta;s/^[^ ]\+ \+//;P;D;ba'
Pure bash way
By using nameref, you could build bash arrays holding all duplicates:
declare -iA sums='()'
while IFS=' ' read -r sum file ;do
declare -n list=_LST_$sum
list+=("$file")
sums[$sum]+=1
done < <(
find . -type f -exec sha1sum {} +
)
From there, you have a bunch of arrays holding all duplicates file name as separated element:
for i in ${!sums[#]};do
declare -n list=_LST_$i
printf "%d %d %s\n" ${sums[$i]} ${#list[#]} "${list[*]}"
done
This may output something like:
2 2 ./e.txt ./b.txt
3 3 ./c.txt ./a.txt ./d.txt
Where count of files by md5sum (${sums[$shasum]}) match count of element in arrays ${_LST_ShAsUm[#]}.
for i in ${!sums[#]};do
declare -n list=_LST_$i
echo ${list[#]#A}
done
declare -a _LST_22596363b3de40b06f981fb85d82312e8c0ed511=([0]="./e.txt" [1]="./b.txt")
declare -a _LST_f572d396fae9206628714fb2ce00f72e94f2258f=([0]="./c.txt" [1]="./a.txt" [2]="./d.txt")
Note that this method could handle spaces and special characters in filenames!
3. GROUP BY columns in a table
As efficient sample using awk was provided by Anonymous, here is a pure bash solution.
So you want to sumarize columns 3 to last column and group by columns 1 and 2, having table.txt looking like
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000|3000
UK|1|1000|2000|3000|4000
For not too big tables, you could:
myfunc() {
local -iA restabl='()';
local IFS=+
while IFS=\| read -ra ar; do
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${!restabl[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Could ouput something like:
myfunc <table.txt
UK|1|19000
US|A|3000
US|C|3000
US|B|3000
And to have table sorted:
myfunc() {
local -iA restabl='()';
local IFS=+ sorted=()
while IFS=\| read -ra ar; do
sorted[64#${ar[0]}${ar[1]}]="${ar[0]}|${ar[1]}"
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${sorted[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Must return:
myfunc <table
UK|1|19000
US|A|3000
US|B|3000
US|C|3000
I'd have done it like this:
perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses
but uniq might work for you.
Importing data to sqlite db and using sql syntax (just an other idea).
I know it's too much for this example but would be useful for complex queries with multiple files (tables)
#!/bin/bash
trap clear_db EXIT
clear_db(){ rm -f "mydb$$"; }
# add header to input_file (IP)
INPUT_FILE=ips.txt
# import file into db
sqlite3 -csv mydb$$ ".import ${INPUT_FILE} mytable"
# using sql statements on table 'mytable'
sqlite3 mydb$$ -separator " " "SELECT IP, COUNT(*) FROM mytable GROUP BY IP;"
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:
mySet = set()
for line in open("ip_address_file.txt"):
line = line.rstrip()
mySet.add(line)
As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.
Edit:
I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.
mydict = {}
for line in open("ip_address_file.txt"):
line = line.rstrip()
if line in mydict:
mydict[line] += 1
else:
mydict[line] = 1
The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.
This does not answer the count element of the original question, but this question is the first search engine result when searching for what I wanted to achieve, so I thought this may help someone as it relates to 'group by' functionality.
I wanted to order files based on groupings of them, where the presence of some string in the filename determined the group.
It uses a temporary grouping/ordering prefix which is removed after ordering; sed substitute expressions (s#pattern#replacement#g) match the target string and prepend an integer to the line corresponding to the desired sort order of that target string. Then, grouping prefix is removed with cut.
Note that the sed expressions could be joined (e.g. sed -e '<expr>; <expr>; <expr>;') but here they're split for readability.
It's not pretty and probably not fast (I'm dealing with <50 items) but it at-least conceptually simple and doesn't require learning awk.
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*${target_string_A}.*)#${target_string_A_sort_index}:\1#;" \
| sed -E -e "s#^(.*${target_string_B}.*)#${target_string_B_sort_index}:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
e.g. Input
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/special/test/d
/this/is/a/another/test/e
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*special.*)#10:\1#;" \
| sed -E -e "s#^(.*another.*)#05:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/another/test/e
/this/is/a/special/test/d
A combination of awk + sort (with version sort flag) is probably fastest (if ur environment has awk at all):
echo "${input...}" |
{m,g}awk '{ __[$+_]++ } END { for(_ in __) { print "",+__[_],_ } }' FS='^$' OFS='\t' |
gsort -t$'\t' -k 3,3 -V
Only the post GROUP-BY summary rows are being sent to the sorting utility - which is far less system intensive sort compared to pre-sorting the input rows for no reason.
For small inputs, e.g. fewer than 1000 rows or so, just directly sort|uniq -c it.
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
Sort may be omitted if order is not significant
uniq -c <source_file>
or
echo "$list" | uniq -c
if the source list is a variable

Resources