Averaging of the digital data

Averaging of the digital data - bash

I have a log file consisted title and number of relevant strings consisted of digital data
This is the benchmarks ns/day for wsp systems
21.473
21.483
21.425
21.548
21.588
21.587
21.522
21.547
21.549
21.487
Within the log I need to automatically add last line consisted of AVERAGE = and to calculate averaged value from the digits on 2-11 strings. Will be very thankful for elegant bash script which will open the log, looping the strings and add average within it last string!
Thx!
James

This awk one-liner should work for you:
awk '7;NR>1{t+=$0}END{printf "EVERAGE=%.3f\n",t/(NR-1)}' file
What it does:
read a line and print as it is
if line number >1, sum the value, and save to var t
after printing the last line in file out, calculate the average, and print in same format as other numbers. (printf function)
if you test it on your example file, it gives:
kent$ awk '7;NR>1{t+=$0}END{printf "AVERAGE=%.3f\n",t/(NR-1)}' f
this is the benchmarks ns/day for wsp systems
21.473
21.483
21.425
21.548
21.588
21.587
21.522
21.547
21.549
21.487
AVERAGE=21.521

I would use awk for this
awk '{sum+=$1; print} END{print "AVERAGE=" sum/(NR-1)}' logfile
sum+=$1 will take add the first element of each string onto a variable
The END statement will do the average based on the number of line (or 'record' in awk vocabulary) NR

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)

If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.

if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Looking up and extracting a line from a big file matching the lines of another big file

I permitted myself to create a new question as some parameters changed dramatically compared to my first question in my bash script optimising (Optimising my script which lookups into a big compressed file)
In short : I want to lookup, and extract all the lines where the variable of the first column of a file(1) (a bam file) matches the first column of a text file (2). For bioinformaticians, it's actually extracting the matching reads id from two files.
File 1 is a binary compressed 130GB file
File 2 is a tsv file of 1 billion lines
Recently a user came with a very elegant one liner combining the decompression of the file and the lookup with awk and it worked very well. With the size of the files it is now looking up for more than 200 hours (multithreaded).
Does this "problem" have a name in algorithmics ?
What could be a good way to tackle this challenge ? (If possible with simple solutions such as sed, awk, bash .. )
Thank you a lot
Edit : Sorry for the code, as it was on the link I though it would be a "doublon". Here is the one liner used :
#!/bin/bash
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

Think of this as a long comment rather than an answer. The 'merge sort' method can be summarised as: If two records don't match, advance one record in the file with the smaller record. If they do match then record the match and advance one record in the big file.
In pseudocode, this looks something like:
currentSmall <- readFirstRecord(smallFile)
currentLarge <- readFirstRecord(largeFile)
searching <- true
while (searching)
if (currentLarge < currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge = currentSmall)
//Bingo!
saveMatchData(currentLarge, currentSmall)
currentLarge <- readNextRecord(largeFile)
else if (currentLarge > currentsmall)
currentSmall <- readNextRecord(smallFile)
endif
if (largeFile.EOF or smallFile.EOF)
searching <- false
endif
endwhile
Quite how you translate that into awk or bash is beyond my meagre knowledge of either.

Replace a part of a file by a part of another file

I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values).
A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from.
Here are examples of my two files:
File1:
14 4
2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
File2:
Some text on the first line
1
Some text on the third line
0
AND01 0.53758275 0.65728944
AND02 0.64889566 0.53386002
AND03 0.65729386 0.64628194
AND04 0.26586960 0.46582925
AND05 0.46480534 0.57415869
In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944).
Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files.
I am very new to using bash scripts and have only use "sed" command till now to modify my files.
Any help is welcome :)
Thanks a lot for your inputs!

It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk:
awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1
Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val.
In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space.
The result (output):
14 4
0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01

AWK - replace with constant character in a specified number of random lines

I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes.
The file I do this in looks like this (genotype.dat):
M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537
and to mask it, I simply change M to S2.
Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated).
I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this.
Anyway, any suggestions of how to tackle this will be deeply appreciated.

Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it):
sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
genotype.dat > genotype.masked
A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; you can easily extract the line count with lines=$(wc -l < genotype.dat), and from there you can compute the percentage.
shuf is used to produce a random sample of lines, usually from a file; the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). I sorted that for efficiency before using printf to create a sed edit script.

awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat
How it works
In sum, we first read in maskedlines.txt into an associative array a. This file is assumed to have one number per line and a of that number is set to one. We then read in genotype.dat. If a for that line number is one, we change the first field to S2 to mask it. The line, whether changed or not, is then printed.
In detail:
NR==FNR{a[$1]=1;next;}
In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. So, when NR==FNR, we are reading the first file (maskedlines.txt). This file contains the line number of lines in genotype.dat that are to be masked. For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line.
a[FNR]{$1="S2"}
If we get here, we are working on the second file: genotype.dat. For each line in this file, we check to see if its line number, FNR, was mentioned in maskedlines.txt. If it was, we set the first field to S2 to mask this line.
1
This is awk's cryptic shorthand to print the current line.

Search a specific line for a value within a range. Unix bash script

I'd like to jump to a specific line in a file, line 33866. If the third number in this line is within the range -10 and +10 then I'd like to print the entire next line, 33867, to a file and stop.
If it isn't then it should look at line 67893 (difference of +34027), now if its in the range - print the next line and stop.
This should continue, next looking at line 101920 (difference of +34027) and so on until it finds a value in that range or reaches the end of the file.
Now regardless of whether or not that printed anything I need it to repeat the process but at a new starting line, this time the new start line is 33869 (difference 3), to print line 33870 to the same file.
Ideally, it would repeat n times, n being a read value input by the user when the script is ran.
Please stop me right there if this is too much to ask and I'll go back to banging my head against the wall and searching around the net for how to make this work on my own. Also let me know if I'm going about this the wrong way by trying to jump to a specific line and should search for the line by another means.
Any input greatly appreciated!
Edit:
Here is an example of the two lines being handled:
17.33051021 18.02125499 30.40520932
1.776579372 -23.74037576 12.48448432
with the first number starting in column 6, the second number starting in 26 and third in 46. (if minus is ignored I don't think it will matter)

reading your question, I guess your file could be pretty big. Also I assume "the 3rd number" is 3rd field. so I come up with this one-liner:
awk -v l=33866 -v d=34027 'NR==l&&$3>=-10&&$3<=10{p=1;next}p{print;exit}{l+=d}' file
you just need to change the two arguments (l (first line No. you need to check) and d (difference)).
After found the right line to print, awk stops processing further lines in your file.
didn't test, if there were typoes, sorry, but it shows my idea
you should give some example input etc. i.e. the 3rd number, what is that? the 3rd field? or like aa bb 2 dfd 3 asf 555, the 555?
another one, actually you should show what you have done for your problem

Since we don't have any input to test with, I am giving you an answer without testing.
tl=$(wc -l input)
awk '{
for (i=33866; i<tl; i+=34027) {
if (NR==i && $3 >= -10 && $3 <= 10) {
getline;
print;
exit;
}
}
}' input

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Averaging of the digital data - bash

I would use awk for this awk '{sum+=$1; print} END{print "AVERAGE=" sum/(NR-1)}' logfile sum+=$1 will take add the first element of each string onto a variable The END statement will do the average based on the number of line (or 'record' in awk vocabulary) NR

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

Looking up and extracting a line from a big file matching the lines of another big file

Replace a part of a file by a part of another file

AWK - replace with constant character in a specified number of random lines

Search a specific line for a value within a range. Unix bash script

Categories

Resources