Replace exact numbers in a column keeping order - bash

I have this file, and I would like to replace the number of the 3rd column so that they appear in order. Also, I would need to skip the first row (the header of the file).
Initial file:
#results from program A
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 3 CTTG 61 145M
2005 17 3 TTCG 30 145M
91823 17 4 ATGAAGC 22 146M
91823 17 4 GTAGGCC 19 146M
16523 17 5 GGGGGTCGGT 45 30M1D115M
Modified file:
#results from program A
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 2 CTTG 61 145M
2005 17 2 TTCG 30 145M
91823 17 3 ATGAAGC 22 146M
91823 17 3 GTAGGCC 19 146M
16523 17 4 GGGGGTCGGT 45 30M1D115M
Do you know how I could do it?

Could you please try following.
awk 'prev!=$1{++count}{$3=count;prev=$1;$1=$1} 1' OFS="\t" Input_file
To remove headers use following:
awk 'FNR==1{print;next}prev!=$1{++count}{$3=count;prev=$1;$1=$1} 1' OFS="\t" Input_file
Solution 2nd: In case your Input_file's 1st field is NOT in order then following may help you here.
awk 'FNR==NR{if(!a[$1]++){b[$1]=++count};next} {$3=b[$1];$1=$1} 1' OFS="\t" Input_file Input_file
To remove headers for 2nd solution above use following.
awk 'FNR==1{if(++val==1){print};next}FNR==NR{if(!a[$1]++){b[$1]=++count};next} {$3=b[$1];$1=$1} 1' OFS="\t" Input_file Input_file

another minimalist awk
$ awk '{$3=c+=p!=$1;p=$1}1' file | column -t
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 2 CTTG 61 145M
2005 17 2 TTCG 30 145M
91823 17 3 ATGAAGC 22 146M
91823 17 3 GTAGGCC 19 146M
16523 17 4 GGGGGTCGGT 45 30M1D115M
with header version
$ awk 'NR==1; NR>1{$3=c+=p!=$1;p=$1; print | "column -t"}' file
#results from program A
8536 17 1 CGTCGCCTAT 116 147M2D
8536 17 1 CGTCGCTTAT 116 147M2D
8536 17 1 CGTTGCCTAT 116 147M2D
8536 17 1 CGTTGCTTAT 116 147M2D
2005 17 2 CTTG 61 145M
2005 17 2 TTCG 30 145M
91823 17 3 ATGAAGC 22 146M
91823 17 3 GTAGGCC 19 146M
16523 17 4 GGGGGTCGGT 45 30M1D115M

Related

splitting file to smaller max n-chars files without cutting any line

Here is a sample input text file generated with the cal command:
$ cal 2743 > sample_text
In this example this file have 2180 characters
$ wc sample_text
36 462 2180 sample_text
I want to split it into smaller files each one having no more than 700 lines but preserving lines in complete state (no line can be cut)
I can view each such block with following awk code:
$ awk '{l=length+l;if(l<=700){print l,$0}else{l=length;print "\nnext block\n",l,$0}}' sample_text
32 2743
98 January February March
164 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
230 1 2 1 2 3 4 5 6 1 2 3 4 5 6
296 3 4 5 6 7 8 9 7 8 9 10 11 12 13 7 8 9 10 11 12 13
362 10 11 12 13 14 15 16 14 15 16 17 18 19 20 14 15 16 17 18 19 20
428 17 18 19 20 21 22 23 21 22 23 24 25 26 27 21 22 23 24 25 26 27
494 24 25 26 27 28 29 30 28 28 29 30 31
560 31
560
626 April May June
692 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
next block
66 1 2 3 1 1 2 3 4 5
132 4 5 6 7 8 9 10 2 3 4 5 6 7 8 6 7 8 9 10 11 12
198 11 12 13 14 15 16 17 9 10 11 12 13 14 15 13 14 15 16 17 18 19
264 18 19 20 21 22 23 24 16 17 18 19 20 21 22 20 21 22 23 24 25 26
330 25 26 27 28 29 30 23 24 25 26 27 28 29 27 28 29 30
396 30 31
396
462 July August September
528 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
594 1 2 3 1 2 3 4 5 6 7 1 2 3 4
660 4 5 6 7 8 9 10 8 9 10 11 12 13 14 5 6 7 8 9 10 11
next block
66 11 12 13 14 15 16 17 15 16 17 18 19 20 21 12 13 14 15 16 17 18
132 18 19 20 21 22 23 24 22 23 24 25 26 27 28 19 20 21 22 23 24 25
198 25 26 27 28 29 30 31 29 30 31 26 27 28 29 30
264
264
330 October November December
396 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
462 1 2 1 2 3 4 5 6 1 2 3 4
528 3 4 5 6 7 8 9 7 8 9 10 11 12 13 5 6 7 8 9 10 11
594 10 11 12 13 14 15 16 14 15 16 17 18 19 20 12 13 14 15 16 17 18
660 17 18 19 20 21 22 23 21 22 23 24 25 26 27 19 20 21 22 23 24 25
next block
66 24 25 26 27 28 29 30 28 29 30 26 27 28 29 30 31
132 31
I have the problem to save each max 700 chars block into separate file - with following command it only produces one file.0, and expected were split files file.0, file.1, file.2 and file.3 for this input example
$ awk 'c=0;{l=length+l;if(l<=700){print>"file."c}else{c=c++;l=length;print>"file."c}}' sample_text
$ cksum *
3868619974 2180 file.0
3868619974 2180 sample_text
This should do it:
BEGIN {
maxChars = 700
out = "file.0"
}
{
numChars = length($0)
totChars += numChars
if ( totChars > maxChars ) {
close(out)
out = "file." ++cnt
totChars = numChars
}
print > out
}

Remove rows that have a specific numeric value in a field

I have a very bulky file about 1M lines like this:
4001 168991 11191 74554 60123 37667 125750 28474
8 145 25 101 83 51 124 43
2985 136287 4424 62832 50788 26847 89132 19184
3 129 14 101 88 61 83 32 1 14 10 12 7 13 4
6136 158525 14054 100072 134506 78254 146543 41638
1 40 4 14 19 10 35 4
2981 112734 7708 54280 50701 33795 75774 19046
7762 339477 26805 148550 155464 119060 254938 59592
1 22 2 12 10 6 17 2
6 136 16 118 184 85 112 56 1 28 1 5 18 25 40 2
1 26 2 19 28 6 18 3
4071 122584 14031 69911 75930 52394 89733 30088
1 9 1 3 4 3 11 2 14 314 32 206 253 105 284 66
I want to remove rows that have a value less than 100 in the second column.
How to do this with sed?
I would use awk to do this. Example:
awk ' $2 >= 100 ' file.txt
this will only display every row from file.txt that has a column $2 greater than 100.
Use the following approach:
sed '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' -E /tmp/test.txt
(replace /tmp/test.txt with your current file path)
([0-9]{1,2}|[0][0-9]+) - will match either digits from 0 to 99 OR a digits with leading zero (ex. 012, 00982)
d - delete the pattern space;
-E(--regexp-extended) - Use extended regular expressions rather than basic regular expressions
To remove matched lines in place use -i option:
sed -i -E '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' /tmp/test.txt

awk do math between lines

i'd like to put an empty line between lines which $1 differ > then 1.
thats the code sample:
104 9
110 8
111 5
116 6
117 7
130 11
131 16
132 15
133 10
134 6
146 8
147 8
148 8
the try was:
awk '{a=$1; b=$2; getline; c=$1; d=$2; if (c-a>1) print a"\t"b"\n"c"\t"d;else print "\n"}' file
but the result is mixed up:
110 8
111 5
116 6
117 7
130 11
131 16
132 15
133 10
147 8
148 8
what i'm missing?
This awk should work:
awk 'NR>1 && $1>p+1{print ""} {p=$1} 1' file
104 9
110 8
111 5
116 6
117 7
130 11
131 16
132 15
133 10
134 6
146 8
147 8
148 8

Get the average of the selected cells line by line in a file?

I have a single file with the multiple columns. I want to select few and get average for selected cell in a line and output the entire average as column.
For example:
Month Low.temp Max.temp Pressure Wind Rain
JAN 17 36 120 5 0
FEB 10 34 110 15 3
MAR 13 30 115 25 5
APR 14 33 105 10 4
.......
How to get average temperature (Avg.temp) and Humidity (Hum)as column?
Avg.temp = (Low.temp+Max.temp)/2
Hum = Wind * Rain
To get the Avg.temp
Month Low.temp Max.temp Pressure Wind Rain Avg.temp Hum
JAN 17 36 120 5 0 26.5 0
FEB 10 34 110 15 3 22 45
MAR 13 30 115 25 5 21.5 125
APR 14 33 105 10 4 23.5 40
.......
I don't want to do it in excel. Is there any simple shell command to do this?
I would use awk like this:
awk 'NR==1 {print $0, "Avg.temp", "Hum"; next} {$(NF+1)=($2+$3)/2; $(NF+1)=$5*$6}1' file
or:
awk 'NR==1 {print $0, "Avg.temp", "Hum"; next} {print $0, ($2+$3)/2, $5*$6}' file
This consists in doing the calculations and appending them to the original values.
Let's see it in action, piping to column -t for a nice output:
$ awk 'NR==1 {print $0, "Avg.temp", "Hum"; next} {$(NF+1)=($2+$3)/2; $(NF+1)=$5*$6}1' file | column -t
Month Low.temp Max.temp Pressure Wind Rain Avg.temp Hum
JAN 17 36 120 5 0 26.5 0
FEB 10 34 110 15 3 22 45
MAR 13 30 115 25 5 21.5 125
APR 14 33 105 10 4 23.5 40

Split into chunks of six entries each using bash

91
58
54
108
52
18
8
81
103
110
129
137
84
15
14
18
11
17
12
6
1
28
6
14
8
8
0
0
28
24
25
23
21
13
9
4
18
17
18
30
13
3
I want to split into chunks of six entries each.After that it will break the loop.Then it will continue the entries 7..12, then of 13..18 etc.
(for loop?continue?break?)
You can use paste:
paste -d' ' - - - - - - < inputfile
For your input, it'd return:
91 58 54 108 52 18
8 81 103 110 129 137
84 15 14 18 11 17
12 6 1 28 6 14
8 8 0 0 28 24
25 23 21 13 9 4
18 17 18 30 13 3
$ xargs -n 6 < file_name
91 58 54 108 52 18
8 81 103 110 129 137
84 15 14 18 11 17
12 6 1 28 6 14
8 8 0 0 28 24
25 23 21 13 9 4
18 17 18 30 13 3

Resources