Random sampling in bash - bash

I have an ensemble with a large number of samples in it ( say 100 different samples at different time in one ensemble). My ensemble looks like this
20
-166.26604715
C -6.8775736572 0.7377700983 -1.2173950464
C -6.3769524449 2.0225374370 -1.4858792908
C -5.9530432940 -0.2309614983 -0.7933107594
C 0.924046 0.593909 0.306394
C 0.578941 0.740133 0.786926
C 0.43637 0.332195 0.77888
C 0.100887 0.785084 0.835159
C 0.761209 0.496077 0.426298
C 0.945798 0.821802 0.709269
C 0.157828 0.119752 0.909685
C 0.868084 0.449256 0.705432
C 0.399686 0.645049 0.696163
C 0.300211 0.591664 0.956569
C 0.156318 0.796877 0.132388
C 0.548236 0.984306 0.823073
C 0.422985 0.964365 0.793915
C 0.173531 0.568816 0.93252
C 0.205224 0.0199054 0.84918
C 0.726009 0.758101 0.197576
C 0.924046 0.593909 0.306394
20
-166.45321715
C -6.8775736572 0.7377700983 -1.2173950464
C -6.3769524449 2.0225374370 -1.4858792908
C -5.9530432940 -0.2309614983 -0.7933107594
C 0.924046 0.593909 0.306394
C 0.578941 0.740133 0.786926
C 0.43637 0.332195 0.77888
C 0.100887 0.785084 0.835159
C 0.761209 0.496077 0.426298
C 0.945798 0.821802 0.709269
C 0.157828 0.119752 0.909685
C 0.868084 0.449256 0.705432
C 0.399686 0.645049 0.696163
C 0.300211 0.591664 0.956569
C 0.156318 0.796877 0.132388
C 0.548236 0.984306 0.823073
C 0.422985 0.964365 0.793915
C 0.173531 0.568816 0.93252
C 0.205224 0.0199054 0.84918
C 0.726009 0.758101 0.197576
C 0.924046 0.593909 0.306394
20
-166.41234567
..
..
continues
Where the first line represents the number of atoms so \s+20 is my repeating pattern and it repeats after 22nd lines – the second line represents energy and from third line on – spatial coordinates (x, y, z). I want to randomly sample out for example just 4 samples (out of 100 in this example so 4*22 = 88 lines).Each sample (4 samples) should have the same data structure as shown above (2 headers + 20 lines) – I think I could use random number generators in python but because I am using bash for rest of the code I would like to see if there is a way in bash. Thanks in advance!

Your sample file is not proper for testing. So I created this one and changed 106 to 20 to keep it small
20
-166.26604715
C -6.8775736572 0.7377700983 -1.2173950464
C -6.3769524449 2.0225374370 -1.4858792908
C -5.9530432940 -0.2309614983 -0.7933107594
C 0.924046 0.593909 0.306394
C 0.578941 0.740133 0.786926
C 0.43637 0.332195 0.77888
C 0.100887 0.785084 0.835159
C 0.761209 0.496077 0.426298
C 0.945798 0.821802 0.709269
C 0.157828 0.119752 0.909685
C 0.868084 0.449256 0.705432
C 0.399686 0.645049 0.696163
C 0.300211 0.591664 0.956569
C 0.156318 0.796877 0.132388
C 0.548236 0.984306 0.823073
C 0.422985 0.964365 0.793915
C 0.173531 0.568816 0.93252
C 0.205224 0.0199054 0.84918
C 0.726009 0.758101 0.197576
C 0.924046 0.593909 0.306394
So, the goal is to create a random sample of size N from the records starting line 3 to 22 (2 headers + 20 records).
$ awk -v s=4 'NR==1 {n=$1}
NR<3;
NR>2 && NR<=n+2 {print | "shuf -n"s}' file
20
-166.26604715
C 0.945798 0.821802 0.709269
C 0.548236 0.984306 0.823073
C 0.157828 0.119752 0.909685
C 0.422985 0.964365 0.793915
Here I picked the sample size as 4. Reads the number of records, prints the first two lines and samples requested records out of the number of records specified.
Note that this is sampling without replacement, meaning the same record can not be picked more than once, usually that's what is desired.
You may want to print the new number of records on top perhaps, but that's an easy change, left as an exercise...
UPDATE
For multiple data sets in the same structure (actually the number of records don't have to be the same) you need these modifications.
$ awk -v s=4 'BEGIN {cmd="shuf -n"s; n=-2}
r==n+2 {n=$1; close(cmd)}
{r=(NR-1)%(n+2)+1}
r<=2;
r>2 && r<=n+2 {print | cmd }' file.3
20
-166.26604715
C 0.422985 0.964365 0.793915
C 0.205224 0.0199054 0.84918
C 0.399686 0.645049 0.696163
C 0.726009 0.758101 0.197576
20
-166.26604715
C 0.43637 0.332195 0.77888
C 0.761209 0.496077 0.426298
C -6.3769524449 2.0225374370 -1.4858792908
C 0.205224 0.0199054 0.84918
20
-166.26604715
C 0.156318 0.796877 0.132388
C 0.157828 0.119752 0.909685
C -6.8775736572 0.7377700983 -1.2173950464
C -5.9530432940 -0.2309614983 -0.7933107594
r is the relative position index within data set, and some special handling is required for line 1 (hence n=-2}. Also need to close the command after each data set to flush the buffers. Otherwise the logic is essentially the same with NR replaced with r

Related

How to count number of occurrences in a sorted text file

I have a sorted text file with the following format:
Company1 Company2 Date TransactionAmount
A B 1/1/19 20000
A B 1/4/19 200000
A B 1/19/19 324
A C 2/1/19 3456
A C 2/1/19 663633
A D 1/6/19 3632
B C 1/9/19 84335
B C 1/23/19 253
B C 1/13/19 850
B D 1/1/19 234
B D 1/8/19 635
C D 1/9/19 749
C D 1/10/19 203200
Ultimately I want a Python dictionary so that each pair maps to a list containing the number of transactions and the total amount of all transactions. For instance, (A,B) would map to [3,220324].
The file has ~250,000 lines in this format and each pair may have 1 transaction up to ~10 or so transactions. There are also tens of thousands of pairs of companies.
Here's the only way I've thought of implementing it.
my_dict = {}
file = open("my_file.txt").readlines()[1:]
for i in file:
i = i.split()
pair = (i[0],i[1])
amt = int(i[3])
if pair in my_dict:
exist = my_dict[pair]
exist[0] += 1
exist[1] += amt
my_dict[pair] = exist
else:
my_dict[pair] = [1,amt]
I feel like there is a faster way to do this. Any ideas?

conditional replacement in a file based on a column

I have a file with several columns that looks like this:
MARKER EA NEA N_x EA_y NEA_y N_y
rs1000000 G A 231410.0 G A 118230.0
rs10000010 T C 322079.0 C T 118230.0
rs10000017 C T 233146.0 C T 118230.0
rs10000023 G T 233860.0 T G 118230.0
rs10000027 C G 72852.4 C G 118230.0
rs10000029 T C 179950.0 NA NA NA
rs1000002 C T 233932.0 C T 118230.0
I want to replace values in columns EA and NEA with values from EA_y and NEA_y, but if EA_y and NEA_y are NA then I want to keep values in EA and NEA.
I can do it in R but using ifelse but I would like to learn how to do it with awk or similar.
Note: the file has approximately 3 million rows
Using awk you can do this easily:
awk '$5 != "NA" && $6 != "NA" {$2=$5; $3=$6} 1' file | column -t
MARKER EA_y NEA_y N_x EA_y NEA_y N_y
rs1000000 G A 231410.0 G A 118230.0
rs10000010 T C 322079.0 T C 118230.0
rs10000017 C T 233146.0 C T 118230.0
rs10000023 G T 233860.0 G T 118230.0
rs10000027 C G 72852.4 C G 118230.0
rs10000029 T C 179950.0 NA NA NA
rs1000002 C T 233932.0 C T 118230.0
I used column -t for tabular formatting of output.
Since fields 5, 6, 7 are always set to "NA" at the same time, you can use:
awk -v OFS="\t" 'NR>1&&$7!="NA"{$2=$5;$3=$6}1' file
If you want to proceed several files, avoid to use a loop on the output of the ls command, it's better to use find that gives you more control on how the path looks like.

Rewrite matrix into rules

I have a lot of rectangular matrices where each cell represents some outcome. As matrices are difficult to maintain, it is my goal to rewrite all of them into rules.
Example Matrix 1:
This is easy to turn into rules (pseudocode):
if (i <= 5 and j <=3) then A
else if (i <= 5 and j >=4) then B
else C
How do I rewrite the following matrix?
Plain text:
ij 1 2 3 4 5 6 7 8 9
1 A A A A C C C C B
2 A A A C C C C B B
3 A A C C C C B B B
4 A C C C C B B B B
5 C C C C B B B B B
6 C C C B B B B B B
7 C C B B B B B B B
8 C B B B B B B B B
9 B B B B B B B B B
The second matrix can be represented as:
if (i+j <= 5)
return A;
else if (i+j <= 9)
return C;
else
return B;
In general, you can check which side of a diagonal line a point is on by testing i+j for a / line, or i-j for a \ line.

Removing certain columns from a text file [duplicate]

This question already has answers here:
Deleting columns from a file with awk or from command line on linux
(4 answers)
Closed 8 years ago.
I have a text file that looks like this:
A B C A B C A B C A B
G T C A G T C A G T C
A B C A B C A B C A B
A B C A B C A B C A B
A D E A B D E A B D E
A B C A B C A B C A B
C B D G C B D G C B D
Is there a way to extract only certain columns and leave the other columns intact?
For example removing only columns 2 and 5:
A C A C A B C A B
G C A T C A G T C
A C A C A B C A B
A C A C A B C A B
A E A D E A B D E
A C A C A B C A B
C D G B D G C B D
Thanks in advance.
UPDATE:
Found this answer using awk, but this extract whole "block" of columns and I only want to extract some.
Awk for extracting columns 3 to 5:
awk -F 'FS' 'BEGIN{FS="\t"}{for (i=1; i<=NF-1; i++) if(i<3 || i>5) {printf $i FS};{print $NF}}' input.txt
in your case you could do
cat your_file |cut -d ' ' --complement -s -f2,5
where ' ' is the delimiter(in your case the space)

Which function/algorithm for this merging and filling operation?

I have written R code that merges two data frames based on first column and for missing data adds the value from above. Here is what is does:
Two input data frames:
1 a
2 b
3 c
5 d
And
1 e
4 f
6 g
My code gives this output:
1 a e
2 b e
3 c e
4 c f
5 d f
6 d g
My code is however inefficient as it is not vectorized properly. Are there some R functions which I could use? Basically a function I am looking for is that fills in missing values / NA values and takes the value from previous element and puts it in place of NA.
I looked through reference book of R, but could not find anything.
Here is a solution making use of zoo::na.locf
library(zoo)
a <- data.frame(id=c(1,2,3,5), v=c("a","b","c", "d"))
b <- data.frame(id=c(1,4,6), v=c("e", "f", "g"))
n <- max(c(a$id, b$id))
an <- merge(data.frame(id=1:n), a, all.x=T)
bn <- merge(data.frame(id=1:n), b, all.x=T)
an$v <- na.locf(an$v)
bn$v <- na.locf(bn$v)
data.frame(an$id, an$v, bn$v)
an.id an.v bn.v
1 1 a e
2 2 b e
3 3 c e
4 4 c f
5 5 d f
6 6 d g

Resources