Creating a table with scrambled and unordered data - bash

I'm trying to create a nice little table of values and I'm doing it using bash, but not all the values are in order. Not only that the values also happen to be in their own file. My first few thoughts are to use cat and grep to grab the values, but from there I'm not sure what is appropriate. I feel like awk would do wonders in this situation, but I do not know awk very well.
file1 might look like this
V 0.001
A 98.6
N Measurement1
T 14:15:01
S 20.2
F 212.86
G 28.19
file2 might look like this
V 0.008
A 103.4
N Measurement2
T 16:20:31
S 21.2
F 215.86
G 28.19
The final file would look like this
N Measurement1 Measurement2
T 14:15:01 16:20:31
V 0.001 0.008
G 28.19 28.19
A 98.6 103.4
S 20.2 21.2
F 212.86 215.86

self commented, code is provided for understanding awk
awk '
# cretate new reference (per file)
FNR==1{Ref++}
# each line
{ # add label to memory
N[$1]
# add value in 2 dimension array
V[Ref ":" $1] = $2
# remember maximum length of this serie
if( length( $2 ) > M[Ref] ) M[Ref] = length( $2 )
}
# after last file
END{
# print header (name of the serie)
printf( "N ")
for( i=1;i<=Ref;i++) printf( "%" M[i] "s ", V[ i ":N" ] )
printf( "\n")
# print each data for this label (format suite the size to be aligned)
# don t print a second time the name of the serie
for ( n in N ){
if( n != "N" ){
printf( "%s ", n)
for( i=1;i<=Ref;i++) printf( "%" M[i] "s ", V[ i ":" n ] )
printf( "\n")
}
}
}
' file*

Related

grep variable pattern and output match and sequence position

Given the following string,
>Q07092
MWVSWAPGLWLLGLWATFGHGANTGAQCPPSQQEGLKLEHSSSLPANVTGFNLIHRLSLMKTSAIKKIRNPKGPLILRLGAAPVTQPTRRVFPRGLPEEFALVLTLLLKKHTHQKTWYLFQVTDANGYPQISLEVNSQERSLELRAQGQDGDFVSCIFPVPQLFDLRWHKLMLSVAGRVASVHVDCSSASSQPLGPRRPMRPVGHVFLGLDAEQGKPVSFDLQQVHIYCDPELVLEEGCCEILPAGCPPETSKARRDTQSNELIEINPQSEGKVYTRCFCLEEPQNSEVDAQLTGRISQKAERGAKVHQETAADECPPCVHGARDSNVTLAPSGPKGGKGERGLPGPPGSKGEKGARGNDCVRISPDAPLQCAEGPKGEKGESGALGPSGLPGSTGEKGQKGEKGDGGIKGVPGKPGRDGRPGEICVIGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGIGLPGTPGDPGGPPGPKGDKGSSGIPGKEGPGGKPGKPGVKGEKGDPCEVCPTLPEGFQNFVGLPGKPGPKGEPGDPVPARGDPGIQGIKGEKGEPCLSCSSVVGAQHLVSSTGASGDVGSPGFGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGEPCEPCPALSNLQDGDVRVVALPGPSGEKGEPGPPGFGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGDGCTACPSLQGTVTDMAGRPGQPGPKGEQGPEGVGRPGKPGQPGLPGVQGPPGLKGVQGEPGPPGRGVQGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGASVSGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGECSCPSQGDLIFSGMPGAPGLWMGSSWQPGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGLTAELGSLPIEQHLLKSICGDCVQGQRAHPGYLVEKGEKGDQGIPGVPGLDNCAQCFLSLERPRAEEARGDNSEGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGPQAEKGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGISAVGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGMPGGPGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGDMVNYDEIKRFIRQEIIKMFDERMAYYTSRMQFPMEMAAAPGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGIGIAGENGLPGPPGPQGPPGYGKMGATGPMGQQGIPGIPGPPGPMGQPGKAGHCNPSDCFGAMPMEQQYPPMKTMKGPFG
I want to first grep for pattern matching 6 or more xGx repeats, where x is any character. This, I can easily do,
grep -EIho -B1 '([^G]G[^G]){6,}' file
which outputs
>Q07092
KGERGLPGPPGSKGEKGARGN
EGPKGEKGESGALGPSGLPGSTGEKGQKGEKGD
IGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGI
PGPKGDKGSSGIPGKEGP
FGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGE
FGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGD
AGRPGQPGPKGEQGPEGV
PGKPGQPGLPGVQGPPGLKGVQGEPGPPGR
QGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGA
SGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGE
PGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGL
EGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGP
KGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGI
VGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGM
PGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGD
PGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGI
AGENGLPGPPGPQGPPGY
MGATGPMGQQGIPGIPGPPGPMGQPGKAGH
Now, I want to find the character position of all G's when they occur in 'TGA' or 'SGA'. The character positions should be based on the input and NOT the output.
Expected output,
$ some-grep-awk-code
>Q07092
TGA: 573
SGA: 384
The awk solution,
awk -v str='TGA' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
outputs TGA both at character position 25 and 573. However, I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
Really appreciate any help!
Here's a basic awk solution:
Each sequence must span a single line
The resulting positions are relatives to the start of the line
The algorithm first searches the parts of the line that match [^G]G[^G]{6,}, then searches for the occurrences of SGA and TGA in those parts. The implementation is a little tedious, as there's no offset option for the match() and index() functions of awk.
awk '
BEGIN {
regexp = "([^G]G[^G]){6,}"
search["SGA"]
search["TGA"]
}
/^>/ {
print
next
}
{
i0 = 1
s0 = $0
while ( match( s0, regexp ) ) {
head = substr(s0,RSTART,RLENGTH)
tail = substr(s0,RSTART+RLENGTH)
i0 += RSTART - 1
for (s in search) {
s1 = head
i1 = i0
while ( i = index(s1, s) ) {
s1 = substr(s1, i+1)
i1 += i
search[s] = search[s] " " i1-1
}
}
s0 = tail
i0 += RLENGTH
}
for (s in search) {
print s ":" search[s]
search[s] = ""
}
}
'
Example with simplified sequences
>TEST1
SGA.G..G.TGATGA.G..G..G.SGA.....TGA.....SGA.....G..G.SGA.G..G..G.
>TEST2
.G..G.TGA.G..G.G.....G..G..G..G.SGA.G.
>TEST1
SGA: 1 25 54
TGA: 10 13
>TEST2
SGA: 33
TGA:
TODO
Parameterize the regex and the search strings: it's not difficult per se but the current code will run into an infinite loop when a search string is empty or when the regex allows 0-length matches; you'll need to prevent that from happening.
Allow multi-line sequences
Allow overlapping matches for the regex. Basically, it means to look for the next match at RSTART+1 of the previous iteration; that will generate a lot of duplicate results that you need to discard one way or an other.
You could match all occurrences of the regular expression [ST]GA and look at the wider substring surrounding each match to compare that window to (.G.){6}. Here is some code to do that:
$ awk '
/^>/ { label = $0 ORS; next }
{
while (match(substr($0, pos + 1), /[ST]GA/)) {
pos += RSTART
if (len = RLENGTH) {
wbeg = pos - 18 + len # 18 is the length of .G..G..G..G..G..G.
wlen = 2 * 18 - len + (wbeg < 1 ? wbeg - 1 : 0)
wbeg = (wbeg < 1 ? 1 : wbeg) # substr must start from at least 1
window = substr($0, wbeg, wlen)
if (window ~ /.G..G..G..G..G..G./) {
str = substr($0, pos, len)
print label str ":", pos + int(len / 2)
label = ""
}
pos += len - 1
}
if (pos >= length($0)) {
break
}
}
pos = 0
}
' file
>Q07092
SGA: 384
The output only shows SGA: 384 because that is the only portion of the example input that meets the requirement:
I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
With your shown samples please try following awk code. Written and tested in GNU awk should work in any POSIX awk. In this code we could pass how many strings/variables into the function and can get their ALL present index values in the line. Pass all the values needs to be searched into awk variable named keyWords and it will look for all those into the lines.
awk -v keyWords="SGA,TGA" '
BEGIN{
num=split(keyWords,arr1,",")
for(i=1;i<=num;i++){
checkValues[arr1[i]]
}
}
!/>/{
start=diff=prev=""
while(match($0,/(.G.){6,}/)){
lineMatch=substr($0,RSTART,RLENGTH)
start+=(RSTART>1?RSTART-1:RSTART)
diff=(start-prev)
for(key in checkValues){
if(ind=index(lineMatch,key)){
print substr(lineMatch,ind,length(key)),(RSTART?RSTART-1:1)+ind+start+diff
}
prev=start
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Output with shown samples will be as follows:
>Q07092
SGA: 384

Average over diagonally in a Matrix

I have a matrix. e.g. 5 x 5 matrix
$ cat input.txt
1 5.6 3.4 2.2 -9.99E+10
2 3 2 2 -9.99E+10
2.3 3 7 4.4 5.1
4 5 6 7 8
5 -9.99E+10 9 11 13
Here I would like to ignore -9.99E+10 values.
I am looking for average of all entries after dividing diagonally. Here are four possibilities (using 999 in place of -9.99E+10 to save space in the graphic):
I would like to average over all the values under different shaded triangles.
So the desire output is:
$cat outfile.txt
P1U 3.39 (Average of all values of Lower side of Possible 1 without considering -9.99E+10)
P1L 6.88 (Average of all values of Upper side of Possible 1 without considering -9.99E+10)
P2U 4.90
P2L 5.59
P3U 3.31
P3L 6.41
P4U 6.16
P4L 4.16
It is being difficult to develop a proper algorithm to write it in fortran or in shell script.
I am thinking of the following algorithm, but can't able to think what is next.
step 1: #Assign -9.99E+10 to the Lower diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i,j+1]=-9.99E+10
done
done
step 2: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1U, sum
step 3: #Assign -9.99E+10 to the upper diagonal values of a[ij]
for i in {1..5};do
for j in {1..5};do
a[i-1,j]=-9.99E+10
done
done
step 4: #take the average
sum=0
for i in {1..5};do
for j in {1..5};do
sum=sum+a[i,j]
done
done
printf "%s %5.2f",P1L,sum
Just save all the values in an aray indexied by row and column number and then in the END section repeat this process of setting the beginning and end row and column loop delimiters as needed when defining the loops for each section:
$ cat tst.awk
{
for (colNr=1; colNr<=NF; colNr++) {
vals[colNr,NR] = $colNr
}
}
END {
sect = "P1U"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=begRowNr; colNr<=endColNr-rowNr+1; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
sect = "P1L"
begColNr = 1; endColNr = NF; begRowNr = 1; endRowNr = NR
sum = cnt = 0
for (rowNr=begRowNr; rowNr<=endRowNr; rowNr++) {
for (colNr=endColNr-rowNr+1; colNr<=endColNr; colNr++) {
val = vals[colNr,rowNr]
if ( val != "-9.99E+10" ) {
sum += val
cnt++
}
}
}
printf "%s %.2f\n", sect, (cnt ? sum/cnt : 0)
}
.
$ awk -f tst.awk file
P1U 3.39
P1L 6.88
I assume given the above for handling the first quadrant diagonal halves you'll be able to figure out the other quadrant diagonal halves and the horizontal/vertical quadrant halves are trivial (just set begRowNr to int(NR/2)+1 or endRowNr to int(NR/2) or begColNr to int(NF/2)+1 or endColNr to int(NF/2) then loop through the resultant full range of values of each).
you can compute all in one iteration
$ awk -v NA='-9.99E+10' '{for(i=1;i<=NF;i++) a[NR,i]=$i}
END {for(i=1;i<=NR;i++)
for(j=1;j<=NF;j++)
{v=a[i,j];
if(v!=NA)
{if(i+j<=6) {p["1U"]+=v; c["1U"]++}
if(i+j>=6) {p["1L"]+=v; c["1L"]++}
if(j>=i) {p["2U"]+=v; c["2U"]++}
if(i<=3) {p["3U"]+=v; c["3U"]++}
if(i>=3) {p["3D"]+=v; c["3D"]++}
if(j<=3) {p["4U"]+=v; c["4U"]++}
if(j>=3) {p["4D"]+=v; c["4D"]++}}}
for(k in p) printf "P%s %.2f\n", k,p[k]/c[k]}' file | sort
P1L 6.88
P1U 3.39
P2U 4.90
P3D 6.41
P3U 3.31
P4D 6.16
P4U 4.16
I forgot to add P2D, but from the pattern it should be clear what needs to be done.
To generalize further as suggested. Assert NF==NR, otherwise diagonals not well defined. Let n=NF (and n=NR) You can replace 6 with n+1 and 3 with ceil(n/2). Which can be implemented as function ceil(x) {return x==int(x)?x:x+1}

Replace numeric genotype code with DNA letter

how can i replace the numeric genotype code with a DNA letter?
i have a modified vcf file that looks like that:
POS REF ALT A2.bam C10.bam
448 T C 0/0:0,255,255 0/0:0,255,255
2402 C T 1/1:209,23,0 xxx:255,0,255
n...
i want to replace the 0/0 with the ref letter, 1/1 with the alt letter and delete all the string after it.
it should look like this:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C G G xxx
n...
been trying to do it with sed but it didn't work
don't know how to approach it
Would you please try:
awk '{
if (NR > 1) {
for (i=4; i<=5; i++) {
split($i, a, ":")
$i = a[1]
if ($i == "0/0") $i = $2
if ($i == "1/1") $i = $3
}
}
print
}' file.txt
Output:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C T T xxx
n...
The for loop processes the 4th and 5th columns (A2.bam and C10.bam).
First it chops off the substring after ":".
If the remaining value is equal to "0/0", then replace it with the 2nd column (REF).
In case of "1/1", use the 3rd column (ALT).
Hope this helps.

How to use "column" to center a chart?

I was wondering what the best way to sort a chart using the column command to center each column instead of the default left aligned column was. I have been using the column -t filename command.
Current Output:
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Desired Output: Something like this
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Basically, I want it to be centered instead of left aligned.
Here is an awk solution:
# Collect all lines in "data", keep track of maximum width for each field
{
data[NR] = $0
for (i = 1; i <= NF; ++i)
max[i] = length($i) > max[i] ? length($i) : max[i]
}
END {
for (i = 1; i <= NR; ++i) {
# Split record into array "arr"
split(data[i], arr)
# Loop over array
for (j = 1; j <= NF; ++j) {
# Calculate amount of padding required
pad = max[j] - length(arr[j])
# Print field with appropriate padding, see below
printf "%*s%*s%s", length(arr[j]) + int(pad/2), arr[j], \
pad % 2 == 0 ? pad/2 : int(pad/2) + 1, "", \
j == NF ? "" : " "
}
# Newline at end of record
print ""
}
}
Called like this:
$ awk -f centre.awk infile
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
The printf statement uses padding with dynamic widths:
The first %*s takes care of left padding and the data itself: arr[j] gets printed and padded to a total width of length(arr[j]) + int(pad/2).
The second %*s prints the empty string, left padded to half of the total padding required. pad % 2 == 0 ? pad/2 : int(pad/2) + 1 checks if the total padding was an even number, and if not, adds an extra space.
The last %s prints j == NF ? "" : " ", i.e., two spaces, unless we're at the last field.
Some older awks don't support the %*s syntax, but the formatting string can be assembled like width = 5; "%" width "s" in that case.
Here's a Python program to do what you want. It's probably too hard to do in bash, so you'll need to use a custom program or awk script. Basic algorithm:
count number of columns
[optional] make sure each line has the same number of columns
figure out the maximum length of data for each column
print each line using the max lengths
.
#!/usr/bin/env python3
import sys
def column():
# Read file and split each line into fields (by whitespace)
with open(sys.argv[1]) as f:
lines = [line.split() for line in f]
# Check that each line has the same number of fields
num_fields = len(lines[0])
for n, line in enumerate(lines):
if len(line) != num_fields:
print('Line {} has wrong number of columns: expected {}, got {}'.format(n, num_fields, len(line)))
sys.exit(1)
# Calculate the maximum length of each field
max_column_widths = [0] * num_fields
for line in lines:
line_widths = (len(field) for field in line)
max_column_widths = [max(z) for z in zip(max_column_widths, line_widths)]
# Now print them centered using the max_column_widths
spacing = 4
format_spec = (' ' * spacing).join('{:^' + str(n) + '}' for n in max_column_widths)
for line in lines:
print(format_spec.format(*line))
if __name__ == '__main__':
column()

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

Resources