Increment last number of first line in file - bash

I want to write a shell script which can increment the last value of the first line of a certain file structure:
File-structure:
p cnf integer integer
integer integer ... 0
For Example:
p cnf 11 9
1 -2 0
3 -1 5 0
To:
p cnf 11 10
1 -2 0
3 -1 5 0
The dots should stay the same.

If you could use perl:
perl -pe 's/(-*\d+)$/$1+1/e' if $. == 1' inputfile
Here (-*\d+)$ is capturing integer value(optionally negative) at the end of the line and e flag allows the execution of code before replacement, so the value increments.

With GNU awk:
awk 'NR==1{$NF++} {print}' file
or
awk 'NR==1{$NF++}1' file
Output:
p cnf 11 10
1 -2 0
3 -1 5 0
$NF contains last column.

Related

use bash or awk to replace part of a string

I have the following example lines in a file:
sweet_25 2 0 4
guy_guy 2 4 6
ging_ging 0 0 3
moat_2 0 1 0
I want to process the file and have the following output:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
Notice that the required effect happened in lines 2 and 3 - that an underscore and text follwing a text is remove on lines where this pattern occurs.
I have not succeeded with the follwing:
sed -E 's/([a-zA-Z])_[a-zA-Z]/$1/g' file.txt >out.txt
Any bash or awk advice will be welcome.Thanks
If you want to replace the whole word after the underscore, you have to repeat the character class one or more times using [a-zA-Z]+ and use \1 in the replacement.
sed -E 's/([a-zA-Z])_[a-zA-Z]+/\1/g' file.txt >out.txt
If the words should be the same before and after the underscore, you can use a repeating capture group with a backreference.
If you only want to do this for the start of the string you can prepend ^ to the pattern and omit the /g at the end of the sed command.
sed -E 's/([a-zA-Z]+)(_\1)+/\1/g' file.txt >out.txt
The pattern matches:
([a-zA-Z]+) Capture group 1, match 1 or more occurrences of a char a-zA-Z
(_\1)+ Capture group 2, repeat matching _ and the same text captured by group 1
The file out.txt will contain:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
With your shown samples, please try following awk code.
awk 'split($1,arr,"_") && arr[1] == arr[2]{$1=arr[1]} 1' Input_file
Explanation: Simple explanation would be, using awk's split function that splits 1st field into an array named arr with delimiter _ AND then checking condition if 1st element of arr is EQAUL to 2nd element of arr then save only 1st element of arr to first field($1) and by mentioning 1 printing edited/non-edited lines.
You can do it more simply, like this:
sed -E 's/_[a-zA-Z]+//' file.txt >out.txt
This just replaces an underscore followed by any number of alphabetical characters with nothing.
$ awk 'NR~/^[23]$/{sub(/_[^ ]+/,"")} 1' file
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
I would do:
awk '$1~/[[:alpha:]]_[[:alpha:]]/{sub(/_.*/,"",$1)} 1' file
Prints:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0

Iterating over a text file in bash and rounding each number

My file looks like this
0 0 1 0.2 1 1
1 1 0.8 0.1 1
0.2 0.4 1 0 1
And I need to a create a new output file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
i.e. if the number is greater than 0.5, it is rounded up to 1, and if it less than 0.5, it is rounded down to 0 and put into a new file.
The file is quite large, with ~ 1400000000 values. I would quite like to write a bash script to do this.
I am guessing the best way to do this would be to iterate over each value in a for loop, with an if statement inside which tests whether the number is greater or less than 0.5 and then prints 0 or 1 dependent.
The pseudocode would look like this, but my bash isn't great so - before you tell my it isnt syntatically correct, I already know
#!/bin/bash
#reads in each line
while read p; do
#loops through each number in each line
for i in p; do
#tests if each number is greater than or equal to 0.5 and prints accordingly
if [i => 0.5]
then
print 1
else
print 0
fi
done < test.txt >
I'm not really sure how to do this. Can anyone help? Thanks.
awk '{
for( i=1; i<=NF; i++ )
$i = $i<0.5 ? 0 : 1
}1' input_file > output_file
$i = $i<0.5 ? 0 : 1 changes each field to 0 or 1 and {...}1 will print the line with the changed values afterwards.
another awk without loops...
$ awk -v RS='[ \n]' '{printf ($1>=0.5) RT}' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
if the values are not between 0 and 1, you may want to change to
$ awk -v RS='[ \n]' '{printf "%.0f%s", $1, RT}' file
note that default rounding is to the even (i.e. 0.5 -> 0, but 1.5 -> 2). If you want always to round up
$ awk -v RS='[ \n]' '{i=int($1); printf "%d%s", i+(($1-i)>=0.5), RT}' file
should take of non-negative numbers. For negatives, there are again two alternatives, round towards zero or towards negative infinity.
Here's one in Perl using regex and look-ahead:
$ perl -p -e 's/0(?=\.[6789])/1/g;s/\.[0-9]+//g' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
I went with the if it less than 0.5, it is rounded down to 0 part.

Replace the nth field of every mth line using awk or bash

For a file that contains entries similar to as follows:
foo 1 6 0
fam 5 11 3
wam 7 23 8
woo 2 8 4
kaz 6 4 9
faz 5 8 8
How would you replace the nth field of every mth line with the same element using bash or awk?
For example, if n = 1 and m = 3 and the element = wot, the output would be:
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
I understand you can call / print every mth line using e.g.
awk 'NR%7==0' file
So far I have tried to keep this in memory but to no avail... I need to keep the rest of the file as well.
I would prefer answers using bash or awk, but sed solutions would also be helpful. I'm a beginner in all three. Please explain your solution.
awk -v m=3 -v n=1 -v el='wot' 'NR % m == 0 { $n = el } 1' file
Note, however, that the inter-field whitespace is not guaranteed to be preserved as-is, because awk splits a line into fields by any run of whitespace; as written, the output fields of modified lines will be separated by a single space.
If your input fields are consistently separated by 2 spaces, however, you can effectively preserve the input whitespace by adding -F' ' -v OFS=' ' to the awk invocation.
-v m=3 -v n=1 -v el='wot' defines Awk variables m, n, and el
NR % m == 0 is a pattern (condition) that evaluates to true for every m-th line.
{ $n = el } is the associated action that replaces the nth field of the input line with variable el, causing the line to be rebuilt, implicitly using OFS, the output-field separator, which defaults to a space.
1 is a common Awk shorthand for printing the (possibly modified) input line at hand.
Great little exercise. While I would probably lean toward an awk solution, in bash you can also rely on parameter expansion with substring replacement to replace the nth field of every mth line. Essentially, you can read every line, preserving whitespace, then check your line count, e.g. if c is your line counter and m your variable for mth line, you could use:
if (( $((c % m )) == 0)) ## test for mth line
If the line is a replacement line, you can read each word into an array after restoring default word-splitting and then use your array element index n-1 to provide the replacement (e.g. ${line/find/replace} with ${line/"${array[$((n-1))]}"/replace}).
If it isn't a replacement line, simply output the line unchanged. A short example could be similar to the following (to which you can add additional validations as required)
#!/bin/bash
[ -n "$1" -a -r "$1" ] || { ## filename given an readable
printf "error: insufficient or unreadable input.\n"
exit 1
}
n=${2:-1} ## variables with default n=1, m=3, e=wot
m=${3:-3}
e=${4:-wot}
c=1 ## line count
while IFS= read -r line; do
if (( $((c % m )) == 0)) ## test for mth line
then
IFS=$' \t\n'
a=( $line ) ## split into array
IFS=
echo "${line/"${a[$((n-1))]}"/$e}" ## nth replaced with e
else
echo "$line" ## otherwise just output line
fi
((c++)) ## advance counter
done <"$1"
Example Use/Output
n=1, m=3, e=wot
$ bash replmn.sh dat/repl.txt
foo 1 6 0
fam 5 11 3
wot 7 23 8
woo 2 8 4
kaz 6 4 9
wot 5 8 8
n=1, m=2, e=baz
$ bash replmn.sh dat/repl.txt 1 2 baz
foo 1 6 0
baz 5 11 3
wam 7 23 8
baz 2 8 4
kaz 6 4 9
baz 5 8 8
n=3, m=2, e=99
$ bash replmn.sh dat/repl.txt 3 2 99
foo 1 6 0
fam 5 99 3
wam 7 23 8
woo 2 99 4
kaz 6 4 9
faz 5 99 8
An awk solution is shorter (and avoids problems with duplicate occurrences of the replacement string in $line), but both would need similar validation of field existence, etc.. Learn from both and let me know if you have any questions.

If operator inside for loop

I have input file as below, need to do this conversion col1*0 + col2*1 + col3*2 for every 3 column triplet.
input.txt - All positive numbers, can be decimals, real file has 1000s of columns.
0 0 0 1 0 0
0 1 0 0 0 1
0 0 1 0 0 0
I have the below gawk line that does that:
gawk '{for(i=1;i<=NF;i+=3)x=(x?x FS:"")(($(i+1))+($(i+2)*2));print x;x=y}' input.txt
0 0
1 2
2 0
Additionally, I need to check if 3 numbers are all zeros, if they are all zeros then the conversion should be -9.
Pseudo code:
if($i==0 & $(i+1)==0 & $(i+2)==0) {-9} else {$(i+1)+$(i+2)*2}
#or as all numbers are positive.
if(($i+$(i+1)+$(i+2))==0) {-9} else {$(i+1)+$(i+2)*2}
Expected output:
-9 0
1 2
2 -9
Data description:
This data is output from IMPUTE2 software - a genotype imputation and haplotype phasing program. Rows are SNPs, columns are samples. Every SNP is represented by 3 columns. 3 numbers per SNP with range 0-1 (probability of allele AA AB BB). So in above example we have 3 SNPs and 2 samples. Imputation can also be represented as dosage value, 1 number per SNP with range 0-2. We are trying to covert probability format into dosage format. When IMPUTE2 can't give any probabilities to any of the alleles, it outputs as 0 0 0, then we should convert as no call -9.
You want the sum to be different if the three given columns are 0. For this, you can expand the ternary operator to something like>
gawk '{ for(i=1;i<=NF;i+=3) {
x=$(i+1) + $(i+2)*2; # the sum
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
}
print res; res="" # print stored line and empty for next loop
}' file
That is, append the value -9 if all the elements are 0. Otherwise, the calculated x:
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^
if three columns are 0..........|
If all values are positive, the check can be reformatted to just compare if the sum is 0 or not.
($i + $(i+1) + $(i+2)) ? x : -9
Testing with your file apparently works:
$ gawk '{for(i=1;i<=NF;i+=3) {x=$(i+1) + $(i+2)*2; res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)} print res; res=""}' file
-9 0
1 2
2 -9
another awk one-liner (assuming non-negative input values)
$ awk '{c1=$2+2*$3;c2=$5+2*$6; print c1||$1?c1:-9,c2||$4?c2:-9}' lop
-9 0
1 2
2 -9

reset row number count in awk

I have a file like this
file.txt
0 1 a
1 1 b
2 1 d
3 1 d
4 2 g
5 2 a
6 3 b
7 3 d
8 4 d
9 5 g
10 5 g
.
.
.
I want reset row number count to 0 in first column $1 whenever value of field in second column $2 changes, using awk or bash script.
result
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
.
.
.
As long as you don't mind a bit of excess memory usage, and the second column is sorted, I think this is the most fun:
awk '{$1=a[$2]+++0;print}' input.txt
This awk one-liner seems to work for me:
[ghoti#pc ~]$ awk 'prev!=$2{first=0;prev=$2} {$1=first;first++} 1' input.txt
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
Let's break apart the script and see what it does.
prev!=$2 {first=0;prev=$2} -- This is what resets your counter. Since the initial state of prev is empty, we reset on the first line of input, which is fine.
{$1=first;first++} -- For every line, set the first field, then increment variable we're using to set the first field.
1 -- this is awk short-hand for "print the line". It's really a condition that always evaluates to "true", and when a condition/statement pair is missing a statement, the statement defaults to "print".
Pretty basic, really.
The one catch of course is that when you change the value of any field in awk, it rewrites the line using whatever field separators are set, which by default is just a space. If you want to adjust this, you can set your OFS variable:
[ghoti#pc ~]$ awk -vOFS=" " 'p!=$2{f=0;p=$2}{$1=f;f++}1' input.txt | head -2
0 1 a
1 1 b
Salt to taste.
A pure bash solution :
file="/PATH/TO/YOUR/OWN/INPUT/FILE"
count=0
old_trigger=0
while read a b c; do
if ((b == old_trigger)); then
echo "$((count++)) $b $c"
else
count=0
echo "$((count++)) $b $c"
old_trigger=$b
fi
done < "$file"
This solution (IMHO) have the advantage of using a readable algorithm. I like what's other guys gives as answers, but that's not that comprehensive for beginners.
NOTE:
((...)) is an arithmetic command, which returns an exit status of 0 if the expression is nonzero, or 1 if the expression is zero. Also used as a synonym for let, if side effects (assignments) are needed. See http://mywiki.wooledge.org/ArithmeticExpression
Perl solution:
perl -naE '
$dec = $F[0] if defined $old and $F[1] != $old;
$F[0] -= $dec;
$old = $F[1];
say join "\t", #F[0,1,2];'
$dec is subtracted from the first column each time. When the second column changes (its previous value is stored in $old), $dec increases to set the first column to zero again. The defined condition is needed for the first line to work.

Resources