label timestamps by intervals in bash - bash

I am using awk to split a file where I have three splits separated by spaces: 1. starting point; 2. ending point; 3. label
I want to create new labels within defined frames which requires an if which is where I am a little stuck.
I am looking for something like this:
num_intervals == (tail -1 | ending point)/250000
count == 1
interval == 2500000
current_interval_start == 0
current_interval_end == current_interval_start + interval
for interval in num_intervals
if starting_point >= current_interval_start and if ending_point <= current_interval_end then
print count + label
count == count + 1
current_interval_start == current_interval_end
current_interval_end == current_interval_start + interval
*observation if two labels are in the same interval range, take the first one, but I could post process this.
My data looks like this:
0 2300000 null
2300000 4300000 h
4300000 8000000 aa
8000000 11500000 t
11500000 28400001 null
What I would like as output would be this:
0 2500000 null
2500000 5000000 h
5000000 7500000 aa
7500000 1000000 aa
1000000 1250000 t
1250000 1500000 null
1500000 1750000 null
1750000 2000000 null
2000000 2250000 null
2500000 2750000 null
2750000 3000000 null

You can do with only awk:
awk -v s=2500000 '{
f=int($1/s);
l=int($2/s);
if((l-f) > 0){
for(i=f+1;i<=l;i++){
a[i]=$3
}
}
}
END {
e=int($2/s);
for (i=0;i<=e;i++){
if (i in a ){
print i*s,(i+1)*s,a[i]
}
else{
print i*s,(i+1)*s,"null"
}
}
}'

Related

Find specific keyword on column 1 and append new line on column 2 shell script

I have one text file look like the followings:
empty 2
23 8
19 1
empty
11
I am trying to append new line on column 2 if column 1 has keyword "empty". Any one know how to do this? The following is the expected output:
empty
23 2
19 8
empty
11 1
Here is a script for gnu awk:
{ col1[ FNR ] = $1
col2[ FNR ] = sprintf("%s %s",$2, $3)
}
END {
k2 = 0;
for( k1 = 1; k1 <= FNR; k1++) {
if( col1[ k1 ] != "empty" ){
k2++
print col1[ k1], col2[ k2]
}
else print col1[ k1]
}
}
It stores the values of column1 and (column 2 + column 3) in two different arrays. During the output ( in the END) it consumes a value from the second array only if the first column is not "empty".
awk to the rescue!
$ awk 'p{t=$2;$2=p;p=t} $1=="empty"{if($2!=""){p=$2;$2=""}}1' file
empty
23 2
19 8
empty
11 1

AWK printing fields in multiline records

I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.

How to transform genotypes into T/F -1/0/1 format using awk?

I have a very large dataset that I would like to transform from genotypes to a coded format. The genotypes should be represented as follows:
A A -> -1
A B -> 0
B B -> 1
I have thought about this using awk but I cannot seem to get a working solution that can read two columns and output a single code in place of the genotypes. The input file looks like this:
AnimalID Locus Allele1 Allele2
1 1 A B
1 2 A A
1 3 B B
2 1 B A
2 2 B A
2 3 A A
And should be coded to an output file to look like this:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1
I am assuming this can be done using boolean T/F? Any suggestions would be welcomed. Thanks.
Here is something to get you started:
I have stored the mapping in BEGIN block. If the locus is missing for a particular ID, this will just print blank for that. You didnt specify what B A would mean, so I took the liberty of mapping it to 0 based on your output.
awk '
BEGIN {
map["A","A"] = -1;
map["A","B"] = 0;
map["B","B"] = 1;
map["B","A"] = 0;
}
NR>1 {
idCount = (idCount<$1) ? $1 : idCount;
locusCount = (locusCount<$2) ? $2 : locusCount
code[$1,$2] = map[$3,$4]
}
END {
printf "%s ", "AnimalID";
for(cnt=1; cnt<=locusCount; cnt++) {
printf "%s%s", "Locus" cnt, ((cnt==locusCount) ? "\n" : " ")
}
for(cnt=1; cnt<=idCount; cnt++) {
printf "%s\t", cnt;
for(locus=1; locus<=locusCount; locus++) {
printf "%s%s", code[cnt,locus], ((locus==locusCount) ? "\n" : "\t")
}
}
}' inputFile
Output:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

how to write bash script in ubuntu to normalize the index of text comparison

I had a input which is a result from text comparison. It is in a very simple format. It has 3 columns, position, original texts and new texts.
But some of the records looks like this
4 ATCG ATCGC
10 1234 123
How to write the short script to normalize it to
7 G GC
12 34 3
probably, the whole original texts and the whole new text is like below respectively
ACCATCGGA1234
ACCATCGCGA123
"Normalize" means "trying to move the position in the first column to the position that changes gonna occur", or "we would remove the common prefix ATG, add its length 3 to the first field; similarly on line 2 the prefix we remove is length 2"
This script
awk '
BEGIN {OFS = "\t"}
function common_prefix_length(str1, str2, max_len, idx) {
idx = 1
if (length(str1) < length(str2))
max_len = length(str1)
else
max_len = length(str2)
while (substr(str1, idx, 1) == substr(str2, idx, 1) && idx < max_len)
idx++
return idx - 1
}
{
len = common_prefix_length($2, $3)
print $1 + len, substr($2, len + 1), substr($3, len + 1)
}
' << END
4 ATCG ATCGC
10 1234 123
END
outputs
7 G GC
12 34 3

Resources