reset row number count in awk - bash

I have a file like this
file.txt
0 1 a
1 1 b
2 1 d
3 1 d
4 2 g
5 2 a
6 3 b
7 3 d
8 4 d
9 5 g
10 5 g
.
.
.
I want reset row number count to 0 in first column $1 whenever value of field in second column $2 changes, using awk or bash script.
result
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
.
.
.

As long as you don't mind a bit of excess memory usage, and the second column is sorted, I think this is the most fun:
awk '{$1=a[$2]+++0;print}' input.txt

This awk one-liner seems to work for me:
[ghoti#pc ~]$ awk 'prev!=$2{first=0;prev=$2} {$1=first;first++} 1' input.txt
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
Let's break apart the script and see what it does.
prev!=$2 {first=0;prev=$2} -- This is what resets your counter. Since the initial state of prev is empty, we reset on the first line of input, which is fine.
{$1=first;first++} -- For every line, set the first field, then increment variable we're using to set the first field.
1 -- this is awk short-hand for "print the line". It's really a condition that always evaluates to "true", and when a condition/statement pair is missing a statement, the statement defaults to "print".
Pretty basic, really.
The one catch of course is that when you change the value of any field in awk, it rewrites the line using whatever field separators are set, which by default is just a space. If you want to adjust this, you can set your OFS variable:
[ghoti#pc ~]$ awk -vOFS=" " 'p!=$2{f=0;p=$2}{$1=f;f++}1' input.txt | head -2
0 1 a
1 1 b
Salt to taste.

A pure bash solution :
file="/PATH/TO/YOUR/OWN/INPUT/FILE"
count=0
old_trigger=0
while read a b c; do
if ((b == old_trigger)); then
echo "$((count++)) $b $c"
else
count=0
echo "$((count++)) $b $c"
old_trigger=$b
fi
done < "$file"
This solution (IMHO) have the advantage of using a readable algorithm. I like what's other guys gives as answers, but that's not that comprehensive for beginners.
NOTE:
((...)) is an arithmetic command, which returns an exit status of 0 if the expression is nonzero, or 1 if the expression is zero. Also used as a synonym for let, if side effects (assignments) are needed. See http://mywiki.wooledge.org/ArithmeticExpression

Perl solution:
perl -naE '
$dec = $F[0] if defined $old and $F[1] != $old;
$F[0] -= $dec;
$old = $F[1];
say join "\t", #F[0,1,2];'
$dec is subtracted from the first column each time. When the second column changes (its previous value is stored in $old), $dec increases to set the first column to zero again. The defined condition is needed for the first line to work.

Related

Split a file based on number of groups in first column in bash and maximum line number

Consider the following (sorted) file test.txt where in the first column a occurs 3 times, b occurs once, c occurs 2 times and d occurs 4 times.
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
I would like to split this file to smaller files with maximum 4 lines. However, I need to retain the the groups in the smaller files, meaning that all lines that start with the same value in column $1 need to be in the same file. The size of the group is in this example never larger than the desired output length.
The expected output would be:
file1:
a 1
a 2
a 1
b 1
file2:
c 1
c 1
file3:
d 2
d 1
d 2
d 1
From the expected output, you can see that it if two or more groups together have less than the maximum line number (here 4), they should go into the same file.
Therefore: a + b have together 4 entries and they can go into the same file. However, c + d have together 6 entris. Therefore c has to go in its own file.
I am aware of this Awk oneliner:
awk '{print>$1".test"}' test.txt
But this results in a separate file for each group. This would not make much sense in the real-world problem that I am facing since it would lead to a lot of files being transferred to the HPC and back and making the overhead too intense.
A bash solution would be preferred. But it could also be Python.
Another awk. Had a busy day and this is only tested with your sample data so anything could happen. It creates files named filen.txt, where n>0:
$ awk -v n=4 '
BEGIN {
fc=1 # file numbering initialized
}
{
if($1==p||FNR==1) # when $1 remains same
b=b (++cc==1?"":ORS) $0 # keep buffering
else {
if(n-(cc+cp)>=0) { # if room in previous file
print b >> sprintf("file%d.txt",fc) # append to it
cp+=cc
} else { # if it just won t fit
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc) # creat new
cp=cc
}
b=$0
cc=1
}
p=$1
}
END { # same as the else above
if(n-(cc+cp)>=0)
print b >> sprintf("file%d.txt",fc)
else {
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc)
}
}' file
I hope I have understood your requirement correctly, could you please try following once written and tested with GNU awk.
awk -v count="1" '
FNR==NR{
max[$1]++
if(!a[$1]++){
first[++count2]=$1
}
next
}
FNR==1{
for(i in max){
maxtill=(max[i]>maxtill?max[i]:maxtill)
}
prev=$1
}
{
if(!b[$1]++){++count1};
c[$1]++
if(prev!=$1 && prev){
if((maxtill-currentFill)<max[$1]){count++}
else if(maxtill==max[$1]) {count++}
}
else if(prev==$1 && c[$1]==maxtill && count1<count2){
count++
}
else if(c[$1]==maxtill && prev==$1){
if(max[first[count1+1]]>(maxtill-c[$1])){ count++ }
}
prev=$1
outputFile="outfile"count
print > (outputFile)
currentFill=currentFill==maxtill?1:++currentFill
}
' Input_file Input_file
Testing of above solution with OP's sample Input_file:
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
It will create 3 output files named outputfile1, outputfile2 and outputfile3 as follows.
cat outfile1
a 1
a 2
a 1
b 1
cat outfile2
c 1
c 1
cat outfile3
d 2
d 1
d 2
d 1
2nd time testing(with my custom samples): With my own sample Input_file, lets say following is Input_file.
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
d 4
d 5
When I run above solution then 2 outputfiles will be created with name outputfile1 and outputfile2 as follows.
cat outputfile1
a 1
a 2
a 1
b 1
c 1
c 1
cat outfile2
d 2
d 1
d 2
d 1
d 4
d 5
This might work for you (GNU sed, bash and csplit):
f(){
local g=$1
shift
while (( $#>1))
do
(($#==2)) && echo $2 && break
(($2-$1==$g)) && echo $2 && shift && continue
(($3-$1==$g)) && echo $3 && shift 2 && continue
(($2-$1<$g)) && (($3-$1>$g)) && echo $2 && shift && continue
set -- $1 ${#:3}
done
}
csplit file $(f 4 $(sed -nE '1=;N;/^(\S+\s).*\n\1/!=;D' file))
This will split file into separate files named xxnn where nn is 00,01,02,...
The sed command produces a list of line numbers that splits the file on change of key.
The function f then rewrites these numbers grouping them in to lengths of 4 or less.
~

Replace a value in a file by another one (bash/awk)

I have a file (a coordinates file for those who know what it is) like following :
1 C 1
2 C 1 1 1.60000
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
and so on.. My idea is to replace the value "1.60000" in the second line, by other values using a for loop.
I would like the value to start at, lets say 0, and stop at 2.0 for example, with a increment step of 0.05
Here is what I already tried:
#! /bin/bash
a=0;
for ((i=0; i<=10 (for example); i++)); do
awk '{if ((NR==2) && ($5=="1.60000")) {($5=a)} print $0 }' file.dat > ${i}_file.dat
a=$((a+0.05))
done
But, unfortunately it doesn't work. I tried a lot of combination for the {$5=a} statement but without conclusive results.
Here is what I obtained:
1 C 1
2 C 1 1
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
The value 1.6000 simply dissapear or at least replaced by a blank.
Any advice ?
Thanks a lot,
Pierre-Louis
for this perhaps sed is a better alternative
$ v=0.00; for((i=0; i<=40; i++)) do
sed '2s/1.60/'"$v"'/' file > file_"$i";
v=$(echo "$v + 0.05" | bc | xargs printf "%.2f\n");
done
Explanation
sed '2s/1.60/'"$v"'/' file change the value 1.60 on second line with the value of variable v
floating point arithmetic in bash is hard, this adds 0.05 to the value and formats it (0.05 instead of .05) so that we can use it in the substitution with sed.
Exercise to you: in bash try to add 0.05 to 0.05 and format the output as 0.10 with leading zero.
example with awk (glenn's suggestion)
for ((i=0; i<=10; i++)); do
awk -v "i=$i" '
(FNR==2){ $5=sprintf("%2.1f ",i*0.5); print $0 }
' file.dat # > $i_file.dat # uncomment for a file output
done
advantage: it's awk who manage floating-point arithmetic

Efficient way of indexing a specific number from a text file

I have a text file containing a line of various numbers (i.e. 2 4 1 7 12 1 4 4 3 1 1 2)
I'm trying to get the index for each occurrence of 1. This is my code for what I'm currently doing (subtracting each index value by 1 since my indexing starts at 0).
eq='0'
gradvec=()
count=0
length=0
for item in `cat file`
do
((count++))
if (("$item"=="$eq"))
then
((length++))
if (("$length"=='1'))
then
gradvec=$((count -1))
else
gradvec=$gradvec' '$((count - 1))
fi
fi
done
Although the code works, I was wondering if there was a shorter way of doing this? The result is the gradvec variable being
2 5 9 10
Consider this as the input file:
$ cat file
2 4 1 7 12 1
4 4 3 1 1 2
To get the indices of every occurrence of 1 in the input file:
$ awk '$1==1 {print NR-1}' RS='[[:space:]]+' file
2
5
9
10
How it works:
$1==1 {print NR-1}
If the value in any record is 1, print the record number minus 1.
RS='[[:space:]]+'
Define the record separator as one or more of any kind of space.

Replace the nth field of every mth line using awk or bash

For a file that contains entries similar to as follows:
foo 1 6 0
fam 5 11 3
wam 7 23 8
woo 2 8 4
kaz 6 4 9
faz 5 8 8
How would you replace the nth field of every mth line with the same element using bash or awk?
For example, if n = 1 and m = 3 and the element = wot, the output would be:
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
I understand you can call / print every mth line using e.g.
awk 'NR%7==0' file
So far I have tried to keep this in memory but to no avail... I need to keep the rest of the file as well.
I would prefer answers using bash or awk, but sed solutions would also be helpful. I'm a beginner in all three. Please explain your solution.
awk -v m=3 -v n=1 -v el='wot' 'NR % m == 0 { $n = el } 1' file
Note, however, that the inter-field whitespace is not guaranteed to be preserved as-is, because awk splits a line into fields by any run of whitespace; as written, the output fields of modified lines will be separated by a single space.
If your input fields are consistently separated by 2 spaces, however, you can effectively preserve the input whitespace by adding -F' ' -v OFS=' ' to the awk invocation.
-v m=3 -v n=1 -v el='wot' defines Awk variables m, n, and el
NR % m == 0 is a pattern (condition) that evaluates to true for every m-th line.
{ $n = el } is the associated action that replaces the nth field of the input line with variable el, causing the line to be rebuilt, implicitly using OFS, the output-field separator, which defaults to a space.
1 is a common Awk shorthand for printing the (possibly modified) input line at hand.
Great little exercise. While I would probably lean toward an awk solution, in bash you can also rely on parameter expansion with substring replacement to replace the nth field of every mth line. Essentially, you can read every line, preserving whitespace, then check your line count, e.g. if c is your line counter and m your variable for mth line, you could use:
if (( $((c % m )) == 0)) ## test for mth line
If the line is a replacement line, you can read each word into an array after restoring default word-splitting and then use your array element index n-1 to provide the replacement (e.g. ${line/find/replace} with ${line/"${array[$((n-1))]}"/replace}).
If it isn't a replacement line, simply output the line unchanged. A short example could be similar to the following (to which you can add additional validations as required)
#!/bin/bash
[ -n "$1" -a -r "$1" ] || { ## filename given an readable
printf "error: insufficient or unreadable input.\n"
exit 1
}
n=${2:-1} ## variables with default n=1, m=3, e=wot
m=${3:-3}
e=${4:-wot}
c=1 ## line count
while IFS= read -r line; do
if (( $((c % m )) == 0)) ## test for mth line
then
IFS=$' \t\n'
a=( $line ) ## split into array
IFS=
echo "${line/"${a[$((n-1))]}"/$e}" ## nth replaced with e
else
echo "$line" ## otherwise just output line
fi
((c++)) ## advance counter
done <"$1"
Example Use/Output
n=1, m=3, e=wot
$ bash replmn.sh dat/repl.txt
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
n=1, m=2, e=baz
$ bash replmn.sh dat/repl.txt 1 2 baz
foo 1 6 0
baz 5 11 3
wam 7 23 8
baz 2 8 4
kaz 6 4 9
baz 5 8 8
n=3, m=2, e=99
$ bash replmn.sh dat/repl.txt 3 2 99
foo 1 6 0
fam 5 99 3
wam 7 23 8
woo 2 99 4
kaz 6 4 9
faz 5 99 8
An awk solution is shorter (and avoids problems with duplicate occurrences of the replacement string in $line), but both would need similar validation of field existence, etc.. Learn from both and let me know if you have any questions.

How to update matrix-like-data (.txt) file in bash programming?

I am new to bash programming. I have this file, the file contain is:
A B C D E
1 X 0 X 0 0
2 0 X X 0 0
3 0 0 0 0 0
4 X X X X X
Where X means, it has value, 0 means its empty.
From there, let say user enter B3, which is a 0, means I will need to replace it to X. What is the best way to do it? I need to constantly update this file.
FYI: This is a homework question, thus please, dont give a direct code/answer. But any sample code like (how to use this function etc) will be very much appreciated).
Regards,
Newbie Bash Scripter
EDIT:
If I am not wrong, Bash can call/update directly the specific column. Can it be done with row+column?
If you can use sed I'll throw out this tidbit:
sed -i "/^2/s/. /X /4" /path/to/matrix_file
Input
A B C D E
1 0 0 0 0 0
2 X 0 0 X 0
3 X X 0 0 X
Output
A B C D E
1 0 0 0 0 0
2 X 0 X X 0
3 X X 0 0 X
Explanation
^2: This restricts sed to only work on lines which begin with 2, i.e. the 2nd row
s: This is the replacement command
Note for the next two, '_' represents a white space
._: This is the pattern to match. The . is a regular expression that matches any character, thus ._ matches "any character followed by a space". Note that this could also be [0X]_ if you are guaranteed that the only two characters you can have are '0' and 'X'
X_: This is the replacement text. In this case we are replacing 'any character followed by a space' as described above with 'X followed by a space'
4: This matches the 4th occurrence of the pattern text above, i.e. the 4th row including the row index.
What would be left for you to do is use variables in the place of ^2 and 4 such as ^$row and $col and then map the letters A - E to 1 - 5
something to get you started
#!/bin/bash
# call it using ./script B 1
# make it executable using "chmod 755 script"
# read input parameters
col=$1
row=$2
# construct below splits on whitespace
while read -a line
do
for i in ${line[#]}; do
array=( "${array[#]}" $i );
done
done < m
# Now you have the matrix in a one-dimensional array that can be indexed.
# Lets print it
for i in ${array[#]}; do
echo $i;
done
Here's a starter for you using AWK:
awk -v col=B -v row=3 'BEGIN{getline; for (i=1;i<=NF;i++) cols[$i]=i+1} NR==row+1{print $cols[col]}'
The i+1 and row+1 account for the row heading and column heading, respectively.

Resources