AWK - replace with constant character in a specified number of random lines - bash

I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes.
The file I do this in looks like this (genotype.dat):
M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537
and to mask it, I simply change M to S2.
Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated).
I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this.
Anyway, any suggestions of how to tackle this will be deeply appreciated.

Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it):
sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
genotype.dat > genotype.masked
A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; you can easily extract the line count with lines=$(wc -l < genotype.dat), and from there you can compute the percentage.
shuf is used to produce a random sample of lines, usually from a file; the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). I sorted that for efficiency before using printf to create a sed edit script.

awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat
How it works
In sum, we first read in maskedlines.txt into an associative array a. This file is assumed to have one number per line and a of that number is set to one. We then read in genotype.dat. If a for that line number is one, we change the first field to S2 to mask it. The line, whether changed or not, is then printed.
In detail:
NR==FNR{a[$1]=1;next;}
In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. So, when NR==FNR, we are reading the first file (maskedlines.txt). This file contains the line number of lines in genotype.dat that are to be masked. For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line.
a[FNR]{$1="S2"}
If we get here, we are working on the second file: genotype.dat. For each line in this file, we check to see if its line number, FNR, was mentioned in maskedlines.txt. If it was, we set the first field to S2 to mask this line.
1
This is awk's cryptic shorthand to print the current line.

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Replace a part of a file by a part of another file

I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values).
A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from.
Here are examples of my two files:
File1:
14 4
2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
File2:
Some text on the first line
1
Some text on the third line
0
AND01 0.53758275 0.65728944
AND02 0.64889566 0.53386002
AND03 0.65729386 0.64628194
AND04 0.26586960 0.46582925
AND05 0.46480534 0.57415869
In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944).
Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files.
I am very new to using bash scripts and have only use "sed" command till now to modify my files.
Any help is welcome :)
Thanks a lot for your inputs!
It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk:
awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1
Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val.
In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space.
The result (output):
14 4
0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01

Averaging of the digital data

I have a log file consisted title and number of relevant strings consisted of digital data
This is the benchmarks ns/day for wsp systems
21.473
21.483
21.425
21.548
21.588
21.587
21.522
21.547
21.549
21.487
Within the log I need to automatically add last line consisted of AVERAGE = and to calculate averaged value from the digits on 2-11 strings. Will be very thankful for elegant bash script which will open the log, looping the strings and add average within it last string!
Thx!
James
This awk one-liner should work for you:
awk '7;NR>1{t+=$0}END{printf "EVERAGE=%.3f\n",t/(NR-1)}' file
What it does:
read a line and print as it is
if line number >1, sum the value, and save to var t
after printing the last line in file out, calculate the average, and print in same format as other numbers. (printf function)
if you test it on your example file, it gives:
kent$ awk '7;NR>1{t+=$0}END{printf "AVERAGE=%.3f\n",t/(NR-1)}' f
this is the benchmarks ns/day for wsp systems
21.473
21.483
21.425
21.548
21.588
21.587
21.522
21.547
21.549
21.487
AVERAGE=21.521
I would use awk for this
awk '{sum+=$1; print} END{print "AVERAGE=" sum/(NR-1)}' logfile
sum+=$1 will take add the first element of each string onto a variable
The END statement will do the average based on the number of line (or 'record' in awk vocabulary) NR

Comparison of extra-large subsets of strings

There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this:
850025000010145
401115000010152
400025000010166
770025555010152
512498004158752
From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked files.
Which language and algorithm should I use? How to implement it?
You should be able to do this without having to write any code beyond a simple script (i.e. bash, Windows batch, Powershell, etc.). There are standard tools that make quick work of this type of thing.
First, you have some number of files that contain from 2 million to 4 million numbers. It's difficult to work with all those files, so the first thing you want to do is create a combined file that's sorted. The simple-minded way to do that is to concatenate all the files into a single file, sort it, and remove duplicates. For example, using the GNU/Linux cat and sort commands:
cat file1 file2 file3 file4 > combined
sort -u combined > combined_sort
(The -u removes duplicates)
The problem with that approach is that you end up sorting a very large file. Figure 4 million lines at 15 characters, plus newlines, on each line, and almost 100 days of files, and you're working with 7 gigabytes. A whole year's worth of data would be 25 gigabytes. That takes a long time.
So instead, sort each individual file, then merge them:
sort -u file1 >file1_sort
sort -u file2 >file2_sort
...
sort -m -u file1 file2 file3 > combined_sorted
The -m switch merges the already-sorted files.
Now what you have is a sorted list of all the identifiers you've seen so far. You want to compare today's file with that. First, sort today's file:
sort -u today >today_sort
Now, you can compare the files and output only the files unique to today's file:
comm -2 -3 today_sort combined_sort
-2 says suppress lines that occur only in the second file, and -3 says to suppress lines that are common to both files. So all you'll get is the lines in today_sort that don't exist in combined_sort.
Now, if you're going to do this every day, then you need to take the output from the comm command and merge it with combined_sort so that you can use that combined file tomorrow. That prevents you from having to rebuild the combined_sort file every day. So:
comm -2 -3 today_sort combined_sort > new_values
Then:
sort -m combined_sort new_values > combined_sort_new
You'd probably want to name the file with the date, so you'd have combined_sort_20140401 and combined_sort_20140402, etc.
So if you started at the beginning of the year and wanted to do this every day, your script would look something like:
sort -u $todays_file > todays_sorted_file
comm -2 -3 todays_sorted_file $old_combined_sort > todays_uniques
sort -m $old_combined_sort todays_sorted_file > $new_combined_sort
$todays_file, $old_combined_sort, and $new_combined_sort are parameters that you pass on the command line. So, if the script was called "daily":
daily todays_file.txt all_values_20140101 all_values_20140102
If you must solve the problem by hands:
- Convert strings to 64-bit integers. This saves space (2x to 4x) and speeds up - calculations calculations.
- Sort current file of integers
- Merge current file with old data file (already sorted), selecting new numbers
Merging step may looks like merge step of MergeSort.
You can store ranges of numbers in separate files to avoid extra large file sizes.
P.S. I wanted to propose to use bit map, but it will have size about 125 TB
One solution could be to build a prefix tree based on the previous n-1 files(suppose n-th file was created today). The most time-consuming build process has to be done only once. After you build the prefix tree, you can save it as file(google for this topic).
Run the program to check the new file:
try(BufferedReader br = new BufferedReader(new FileReader("new_file.txt"))) {
String line = br.readLine();
while (line != null) {
if(!tree.contains(line)){
counter++;
}else{
tree.insert(line);
}
line = br.readLine();
}
}
So every day you run this 'pseudo' code, get the unique queries and update the tree.
contains takes O(m) time where m is number of chars
insert takes O(m) time too
I would suggest Java.

Search a specific line for a value within a range. Unix bash script

I'd like to jump to a specific line in a file, line 33866. If the third number in this line is within the range -10 and +10 then I'd like to print the entire next line, 33867, to a file and stop.
If it isn't then it should look at line 67893 (difference of +34027), now if its in the range - print the next line and stop.
This should continue, next looking at line 101920 (difference of +34027) and so on until it finds a value in that range or reaches the end of the file.
Now regardless of whether or not that printed anything I need it to repeat the process but at a new starting line, this time the new start line is 33869 (difference 3), to print line 33870 to the same file.
Ideally, it would repeat n times, n being a read value input by the user when the script is ran.
Please stop me right there if this is too much to ask and I'll go back to banging my head against the wall and searching around the net for how to make this work on my own. Also let me know if I'm going about this the wrong way by trying to jump to a specific line and should search for the line by another means.
Any input greatly appreciated!
Edit:
Here is an example of the two lines being handled:
17.33051021 18.02125499 30.40520932
1.776579372 -23.74037576 12.48448432
with the first number starting in column 6, the second number starting in 26 and third in 46. (if minus is ignored I don't think it will matter)
reading your question, I guess your file could be pretty big. Also I assume "the 3rd number" is 3rd field. so I come up with this one-liner:
awk -v l=33866 -v d=34027 'NR==l&&$3>=-10&&$3<=10{p=1;next}p{print;exit}{l+=d}' file
you just need to change the two arguments (l (first line No. you need to check) and d (difference)).
After found the right line to print, awk stops processing further lines in your file.
didn't test, if there were typoes, sorry, but it shows my idea
you should give some example input etc. i.e. the 3rd number, what is that? the 3rd field? or like aa bb 2 dfd 3 asf 555, the 555?
another one, actually you should show what you have done for your problem
Since we don't have any input to test with, I am giving you an answer without testing.
tl=$(wc -l input)
awk '{
for (i=33866; i<tl; i+=34027) {
if (NR==i && $3 >= -10 && $3 <= 10) {
getline;
print;
exit;
}
}
}' input

Resources