I have a line that goes like:
string 2 2 3 3 1 4
where the 2nd, 4th and 6th columns represent an ID (assuming each ID number is unique) and 3rd, 5th and 7th columns represent some data associated with respective ID.
How can I re-arrange the line so that it will be sorted by the ID?
string 1 4 2 2 3 3
Note: a line may have any number of IDs, unlike the example.
Using shell script, I'm thinking something like
while read n
do
echo $(echo $n | sork -k (... stuck here) )
done < infile
Another bash alternative which does not rely on how many ids there are:
#!/usr/bin/env bash
x='string 2 2 3 3 1 4'
out="${x%% *}"
in=($x)
for (( i = 1; i < ${#in[*]}; i += 2 ))
do
new[${in[i]}]=${in[i+1]}
done
for i in ${!new[#]}
do
out="$out $i ${new[i]}"
done
echo $out
You can put a loop around the lot if you then want to read a file
I'll add an gawk solution to your long list of options.
This is a standalone script:
#!/usr/bin/env gawk -f
{
line=$1
# Collect the tuples into values of an array,
for (i=2;i<NF;i+=2) a[i]=$i FS $(i+1)
# This sorts the array "a" by value, numerically, ascending...
asort(a, a, "#val_num_asc")
# And this for loop gathers the result.
for (i=0; i<length(a); i++) line=line FS a[i]
# Finally, print the line,
print line
# and clear the array for the next round.
delete a
}
This works by copying your tuples into an array, sorting the array, then reassembling the sorted tuples in a for loop that prints the array elements.
Note that it's gawk-only (not traditional awk) because of the use of asort().
$ cat infile
string 2 2 3 3 1 4
other 5 1 20 9 3 7
$ ./sorttuples infile
string 1 4 2 2 3 3
other 3 7 5 1 20 9
As a bash script this can be done with:
Code:
#!/usr/bin/env bash
# send field pairs as separate lines
function emit_line() {
while [ $# -gt 0 ] ; do
echo "$1" "$2"
shift; shift
done
}
# break the line into pieces and send to sort
function sort_line() {
echo $1
shift
emit_line $* | sort
}
# loop through the lines in the file and sort by key-value pairs
while read n; do
echo $(sort_line $n)
done < infile
File infile:
string 2 2 3 3 1 4
string 2 2 0 3 4 4 1 7
string 2 2 0 3 2 1
Output:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4
string 0 3 2 1 2 2
Update:
Cribbing the sort from grail's version, to remove the (much slower) external sort:
function sort_line() {
line="$1"
shift
while [ $# -gt 0 ] ; do
data[$1]=$2
shift; shift
done
for i in ${!data[#]}; do
out="$line $i ${data[i]}"
done
unset data
echo $line
}
while read n; do
sort_line $n
done < infile
You can use python for this. This function breaks up the column into a list of tuples that can then be sorted. itertools.chain is then used to re-assemble the key values pairs.
Code:
import itertools as it
def sort_line(line):
# split the line on white space
x = line.split()
# make a tuple of key value pairs
as_tuples = [tuple(x[i:i+2]) for i in range(1, len(x), 2)]
# sort the tuples, and flatten them with chain
sorted_kv = list(it.chain(*sorted(as_tuples)))
# join the results back into a string
return ' '.join([x[0]] + sorted_kv)
Test Code:
data = [
"string 2 2 3 3 1 4",
"string 2 2 0 3 4 4 1 7",
]
for line in data:
print(sort_line(line))
Results:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4
Related
Is there an efficient way to count the number of occurrences for each item of list A in list B? This question has been solved in different programming languages (e.g., C/C++, Java, Python) but I have not found the solution in BASH. My very naive idea is to use a nested for loop to solve it but I think there should be a better approach for this.
# input
listA=(1 2 3 4 5)
listB=(3 1 2 4 1 3 4 5 2 6 8 7 3 9 6 5 1 2)
# expected output
# 1: 3
# 2: 3
# 3: 3
# 4: 2
# 5: 2
Any comments/suggestions are appreciated!
You say in comment you are not comfortable with using for loops, so here is a solution without them:
$ join -2 2 <(printf '%s\n' "${listA[#]}" | sort) \
<(printf '%s\n' "${listB[#]}" | sort | uniq -c)
1 3
2 3
3 3
4 2
5 2
Explanation
<(...) are Bash's process substitutions. join is given pseudo-files that actually correspond to the output of the commands.
printf '%s\n' "${listA[#]}" | sort sorts the element in listA and print them one by line.
printf '%s\n' "${listB[#]}" | sort | uniq -c does the same with listB but uses uniq -c to prefix each element with its number of occurrences.
join keeps the lines in this second output that matches a line in the first output.
A solution in plain bash using an associative array, without using nested loops:
#!/bin/bash
listA=(1 2 3 4 5)
listB=(3 1 2 4 1 3 4 5 2 6 8 7 3 9 6 5 1 2)
declare -A freq # associative array to hold frequencies
for elem in "${listA[#]}"; do freq[$elem]=0; done
for elem in "${listB[#]}"; do [[ ${freq[$elem]} ]] && ((++freq[$elem])); done
for elem in "${listA[#]}"; do printf '%s: %d\n' "$elem" "${freq[$elem]}"; done
Notes:
Elements are not restricted to integers nor single-character elements;
script should work for any kind of element (including elements containing spaces, tabs, newlines... etc, except the null byte ('\0'), of course).
Its efficiency depends on how associative arrays are implemented internally in bash.
Associative arrays were introduced into bash with version 4.0.
I am using bash in order to process software responses on-the-fly and I am looking for a way to find the
index of the maximum element in the array.
The data that gets fed to the bash script is like this:
25 9
72 0
3 3
0 4
0 7
And so I create two arrays. There is
arr1 = [ 25 72 3 0 0 ]
arr2 = [ 9 0 3 4 7 ]
And what I need is to find the index of the maximum number in arr1 in order to use it also for arr2.
But I would like to see if there is a quick - optimal way to do this.
Would it maybe be better to use a dictionary structure [key][value] with the data I have? Would this make the process easier?
I have also found [1] (from user jhnc) but I don't quite think it is what I want.
My brute - force approach is the following:
function MAX {
arr1=( 25 72 3 0 0 )
arr2=( 9 0 3 4 7 )
local indx=0
local max=${arr1[0]}
local flag
for ((i=1; i<${#arr1[#]};i++)); do
#To avoid invalid arithmetic operators when items are floats/doubles
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
if [ $flag == "True" ]; then
indx=${i}
max=${arr1[${i}]}
fi
done
echo "MAX:INDEX = ${max}:${indx}"
echo "${arr1[${indx}]}"
echo "${arr2[${indx}]}"
}
This approach obviously will work, BUT, is it the optimal one? Is there a faster way to perform the task?
arr1 = [ 99.97 0.01 0.01 0.01 0 ]
arr2 = [ 0 6 4 3 2 ]
In this example, if an array contains floats then I would get a
syntax error: invalid arithmetic operator (error token is ".97)
So, I am using
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
In order to overcome this issue.
Finding a maximum is inherently an O(n) operation. But there's no need to spawn a Python process on each iteration to perform the comparison. Write a single awk script instead.
awk 'BEGIN {
split(ARGV[1], a1);
split(ARGV[2], a2);
max=a1[1];
indx=1;
for (i in a1) {
if (a1[i] > max) {
indx = i;
max = a1[i];
}
}
print "MAX:INDEX = " max ":" (indx - 1)
print a1[indx]
print a2[indx]
}' "${arr1[*]}" "${arr2[*]}"
The two shell arrays are passed as space-separated strings to awk, which splits them back into awk arrays.
It's difficult to do it efficiently if you really do need to compare floats. Bash can't do floats, which means invoking an external program for every number comparison. However, comparing every number in bash, is not necessarily needed.
Here is a fast, pure bash, integer only solution, using comparison:
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
# Get the maximum, and also save its index(es)
for i in "${!arr1[#]}"; do
if ((arr1[i]>arr1_max)); then
arr1_max=${arr1[i]}
max_indexes=($i)
elif [[ "${arr1[i]}" == "$arr1_max" ]]; then
max_indexes+=($i)
fi
done
# Print the results
printf '%s\n' \
"Array1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Here is another optimal method, that can handle floats. Comparison in bash is avoided altogether. Instead the much faster sort(1) is used, and is only needed once. Rather than starting a new python instance for every number.
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
arr1_max=$(printf '%s\n' "${arr1[#]}" | sort -n | tail -1)
for i in "${!arr1[#]}"; do
[[ "${arr1[i]}" == "$arr1_max" ]] &&
max_indexes+=($i)
done
# Print the results
printf '%s\n' \
"Array 1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Example output:
Array 1 max is 72
The index(s) of the maximum are:
1
The corresponding values from array 2 are:
0
Unless you need those arrays, you can also feed your input script directly in to something like this:
#!/bin/bash
input-script |
sort -nr |
awk '
(NR==1) {print "Max: "$1"\nCorresponding numbers:"; max = $1}
{if (max == $1) print $2; else exit}'
Example (with some extra numbers):
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
sort -nr |
awk '(NR==1) {max = $1; print "Max: "$1"\nCorresponding numbers:"}
{if (max == $1) print $2; else exit}'
Max: 72
Corresponding numbers:
4
11
0
You can also do it 100% in awk, including sorting:
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
awk '
{
col1[a++] = $1
line[a-1] = $0
}
END {
asort(col1)
col1_max = col1[a-1]
print "Max is "col1_max"\nCorresponding numbers are:"
for (i in line) {
if (line[i] ~ col1_max"\\s") {
split(line[i], max_line)
print max_line[2]
}
}
}'
Max is 72
Corresponding numbers are:
0
11
4
Or, just to get the maximum of column 1, and any single number from column 2, that corresponds with it. As simply as possible:
$ echo \
'25 9
72 0
3 3
0 4
0 7' |
sort -nr |
head -1
72 0
I have a text file containing a line of various numbers (i.e. 2 4 1 7 12 1 4 4 3 1 1 2)
I'm trying to get the index for each occurrence of 1. This is my code for what I'm currently doing (subtracting each index value by 1 since my indexing starts at 0).
eq='0'
gradvec=()
count=0
length=0
for item in `cat file`
do
((count++))
if (("$item"=="$eq"))
then
((length++))
if (("$length"=='1'))
then
gradvec=$((count -1))
else
gradvec=$gradvec' '$((count - 1))
fi
fi
done
Although the code works, I was wondering if there was a shorter way of doing this? The result is the gradvec variable being
2 5 9 10
Consider this as the input file:
$ cat file
2 4 1 7 12 1
4 4 3 1 1 2
To get the indices of every occurrence of 1 in the input file:
$ awk '$1==1 {print NR-1}' RS='[[:space:]]+' file
2
5
9
10
How it works:
$1==1 {print NR-1}
If the value in any record is 1, print the record number minus 1.
RS='[[:space:]]+'
Define the record separator as one or more of any kind of space.
For a file that contains entries similar to as follows:
foo 1 6 0
fam 5 11 3
wam 7 23 8
woo 2 8 4
kaz 6 4 9
faz 5 8 8
How would you replace the nth field of every mth line with the same element using bash or awk?
For example, if n = 1 and m = 3 and the element = wot, the output would be:
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
I understand you can call / print every mth line using e.g.
awk 'NR%7==0' file
So far I have tried to keep this in memory but to no avail... I need to keep the rest of the file as well.
I would prefer answers using bash or awk, but sed solutions would also be helpful. I'm a beginner in all three. Please explain your solution.
awk -v m=3 -v n=1 -v el='wot' 'NR % m == 0 { $n = el } 1' file
Note, however, that the inter-field whitespace is not guaranteed to be preserved as-is, because awk splits a line into fields by any run of whitespace; as written, the output fields of modified lines will be separated by a single space.
If your input fields are consistently separated by 2 spaces, however, you can effectively preserve the input whitespace by adding -F' ' -v OFS=' ' to the awk invocation.
-v m=3 -v n=1 -v el='wot' defines Awk variables m, n, and el
NR % m == 0 is a pattern (condition) that evaluates to true for every m-th line.
{ $n = el } is the associated action that replaces the nth field of the input line with variable el, causing the line to be rebuilt, implicitly using OFS, the output-field separator, which defaults to a space.
1 is a common Awk shorthand for printing the (possibly modified) input line at hand.
Great little exercise. While I would probably lean toward an awk solution, in bash you can also rely on parameter expansion with substring replacement to replace the nth field of every mth line. Essentially, you can read every line, preserving whitespace, then check your line count, e.g. if c is your line counter and m your variable for mth line, you could use:
if (( $((c % m )) == 0)) ## test for mth line
If the line is a replacement line, you can read each word into an array after restoring default word-splitting and then use your array element index n-1 to provide the replacement (e.g. ${line/find/replace} with ${line/"${array[$((n-1))]}"/replace}).
If it isn't a replacement line, simply output the line unchanged. A short example could be similar to the following (to which you can add additional validations as required)
#!/bin/bash
[ -n "$1" -a -r "$1" ] || { ## filename given an readable
printf "error: insufficient or unreadable input.\n"
exit 1
}
n=${2:-1} ## variables with default n=1, m=3, e=wot
m=${3:-3}
e=${4:-wot}
c=1 ## line count
while IFS= read -r line; do
if (( $((c % m )) == 0)) ## test for mth line
then
IFS=$' \t\n'
a=( $line ) ## split into array
IFS=
echo "${line/"${a[$((n-1))]}"/$e}" ## nth replaced with e
else
echo "$line" ## otherwise just output line
fi
((c++)) ## advance counter
done <"$1"
Example Use/Output
n=1, m=3, e=wot
$ bash replmn.sh dat/repl.txt
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
n=1, m=2, e=baz
$ bash replmn.sh dat/repl.txt 1 2 baz
foo 1 6 0
baz 5 11 3
wam 7 23 8
baz 2 8 4
kaz 6 4 9
baz 5 8 8
n=3, m=2, e=99
$ bash replmn.sh dat/repl.txt 3 2 99
foo 1 6 0
fam 5 99 3
wam 7 23 8
woo 2 99 4
kaz 6 4 9
faz 5 99 8
An awk solution is shorter (and avoids problems with duplicate occurrences of the replacement string in $line), but both would need similar validation of field existence, etc.. Learn from both and let me know if you have any questions.
I have a file like this
file.txt
0 1 a
1 1 b
2 1 d
3 1 d
4 2 g
5 2 a
6 3 b
7 3 d
8 4 d
9 5 g
10 5 g
.
.
.
I want reset row number count to 0 in first column $1 whenever value of field in second column $2 changes, using awk or bash script.
result
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
.
.
.
As long as you don't mind a bit of excess memory usage, and the second column is sorted, I think this is the most fun:
awk '{$1=a[$2]+++0;print}' input.txt
This awk one-liner seems to work for me:
[ghoti#pc ~]$ awk 'prev!=$2{first=0;prev=$2} {$1=first;first++} 1' input.txt
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
Let's break apart the script and see what it does.
prev!=$2 {first=0;prev=$2} -- This is what resets your counter. Since the initial state of prev is empty, we reset on the first line of input, which is fine.
{$1=first;first++} -- For every line, set the first field, then increment variable we're using to set the first field.
1 -- this is awk short-hand for "print the line". It's really a condition that always evaluates to "true", and when a condition/statement pair is missing a statement, the statement defaults to "print".
Pretty basic, really.
The one catch of course is that when you change the value of any field in awk, it rewrites the line using whatever field separators are set, which by default is just a space. If you want to adjust this, you can set your OFS variable:
[ghoti#pc ~]$ awk -vOFS=" " 'p!=$2{f=0;p=$2}{$1=f;f++}1' input.txt | head -2
0 1 a
1 1 b
Salt to taste.
A pure bash solution :
file="/PATH/TO/YOUR/OWN/INPUT/FILE"
count=0
old_trigger=0
while read a b c; do
if ((b == old_trigger)); then
echo "$((count++)) $b $c"
else
count=0
echo "$((count++)) $b $c"
old_trigger=$b
fi
done < "$file"
This solution (IMHO) have the advantage of using a readable algorithm. I like what's other guys gives as answers, but that's not that comprehensive for beginners.
NOTE:
((...)) is an arithmetic command, which returns an exit status of 0 if the expression is nonzero, or 1 if the expression is zero. Also used as a synonym for let, if side effects (assignments) are needed. See http://mywiki.wooledge.org/ArithmeticExpression
Perl solution:
perl -naE '
$dec = $F[0] if defined $old and $F[1] != $old;
$F[0] -= $dec;
$old = $F[1];
say join "\t", #F[0,1,2];'
$dec is subtracted from the first column each time. When the second column changes (its previous value is stored in $old), $dec increases to set the first column to zero again. The defined condition is needed for the first line to work.