How to count number of occurrences in a sorted text file - sorting

I have a sorted text file with the following format:
Company1 Company2 Date TransactionAmount
A B 1/1/19 20000
A B 1/4/19 200000
A B 1/19/19 324
A C 2/1/19 3456
A C 2/1/19 663633
A D 1/6/19 3632
B C 1/9/19 84335
B C 1/23/19 253
B C 1/13/19 850
B D 1/1/19 234
B D 1/8/19 635
C D 1/9/19 749
C D 1/10/19 203200
Ultimately I want a Python dictionary so that each pair maps to a list containing the number of transactions and the total amount of all transactions. For instance, (A,B) would map to [3,220324].
The file has ~250,000 lines in this format and each pair may have 1 transaction up to ~10 or so transactions. There are also tens of thousands of pairs of companies.
Here's the only way I've thought of implementing it.
my_dict = {}
file = open("my_file.txt").readlines()[1:]
for i in file:
i = i.split()
pair = (i[0],i[1])
amt = int(i[3])
if pair in my_dict:
exist = my_dict[pair]
exist[0] += 1
exist[1] += amt
my_dict[pair] = exist
else:
my_dict[pair] = [1,amt]
I feel like there is a faster way to do this. Any ideas?

Related

data.table create table from rows

I would like to analyze a table that reports job codes used by people over the course of several pay periods: I want to know how many times each person has used each job code.
The table lists people in the first column, and pay periods in subsequent columns -- I cannot transpose without creating new problems with names.
The table looks like this:
people
pp1
pp2
pp3
pp4
Bob
A
A
A
C
Ted
B
B
B
B
Alice
B
A
C
C
My desired output looks like this:
people
A
B
C
Bob
3
0
1
Ted
0
4
0
Alice
1
1
2
My code is as follows:
myDT <- data.table(
people = c('Bob','Ted','Alice'),
pp1 = c('A','B','B'),
pp2 = c('A','B','A'),
pp3 = c('A','B','C'),
pp4 = c('C','B','C')
)
id.col=paste('pp',1:3)
myDT[ , table(as.matrix(.SD)), .SDcols = id.col, by = 1:nrow(myDT)]
but it's nowhere close to working
melt(myDT, "people") |>
dcast(people ~ value, fun.aggregate = length)
# people A B C
# <char> <int> <int> <int>
# 1: Alice 1 1 2
# 2: Bob 3 0 1
# 3: Ted 0 4 0

Random sampling in bash

I have an ensemble with a large number of samples in it ( say 100 different samples at different time in one ensemble). My ensemble looks like this
20
-166.26604715
C -6.8775736572 0.7377700983 -1.2173950464
C -6.3769524449 2.0225374370 -1.4858792908
C -5.9530432940 -0.2309614983 -0.7933107594
C 0.924046 0.593909 0.306394
C 0.578941 0.740133 0.786926
C 0.43637 0.332195 0.77888
C 0.100887 0.785084 0.835159
C 0.761209 0.496077 0.426298
C 0.945798 0.821802 0.709269
C 0.157828 0.119752 0.909685
C 0.868084 0.449256 0.705432
C 0.399686 0.645049 0.696163
C 0.300211 0.591664 0.956569
C 0.156318 0.796877 0.132388
C 0.548236 0.984306 0.823073
C 0.422985 0.964365 0.793915
C 0.173531 0.568816 0.93252
C 0.205224 0.0199054 0.84918
C 0.726009 0.758101 0.197576
C 0.924046 0.593909 0.306394
20
-166.45321715
C -6.8775736572 0.7377700983 -1.2173950464
C -6.3769524449 2.0225374370 -1.4858792908
C -5.9530432940 -0.2309614983 -0.7933107594
C 0.924046 0.593909 0.306394
C 0.578941 0.740133 0.786926
C 0.43637 0.332195 0.77888
C 0.100887 0.785084 0.835159
C 0.761209 0.496077 0.426298
C 0.945798 0.821802 0.709269
C 0.157828 0.119752 0.909685
C 0.868084 0.449256 0.705432
C 0.399686 0.645049 0.696163
C 0.300211 0.591664 0.956569
C 0.156318 0.796877 0.132388
C 0.548236 0.984306 0.823073
C 0.422985 0.964365 0.793915
C 0.173531 0.568816 0.93252
C 0.205224 0.0199054 0.84918
C 0.726009 0.758101 0.197576
C 0.924046 0.593909 0.306394
20
-166.41234567
..
..
continues
Where the first line represents the number of atoms so \s+20 is my repeating pattern and it repeats after 22nd lines – the second line represents energy and from third line on – spatial coordinates (x, y, z). I want to randomly sample out for example just 4 samples (out of 100 in this example so 4*22 = 88 lines).Each sample (4 samples) should have the same data structure as shown above (2 headers + 20 lines) – I think I could use random number generators in python but because I am using bash for rest of the code I would like to see if there is a way in bash. Thanks in advance!
Your sample file is not proper for testing. So I created this one and changed 106 to 20 to keep it small
20
-166.26604715
C -6.8775736572 0.7377700983 -1.2173950464
C -6.3769524449 2.0225374370 -1.4858792908
C -5.9530432940 -0.2309614983 -0.7933107594
C 0.924046 0.593909 0.306394
C 0.578941 0.740133 0.786926
C 0.43637 0.332195 0.77888
C 0.100887 0.785084 0.835159
C 0.761209 0.496077 0.426298
C 0.945798 0.821802 0.709269
C 0.157828 0.119752 0.909685
C 0.868084 0.449256 0.705432
C 0.399686 0.645049 0.696163
C 0.300211 0.591664 0.956569
C 0.156318 0.796877 0.132388
C 0.548236 0.984306 0.823073
C 0.422985 0.964365 0.793915
C 0.173531 0.568816 0.93252
C 0.205224 0.0199054 0.84918
C 0.726009 0.758101 0.197576
C 0.924046 0.593909 0.306394
So, the goal is to create a random sample of size N from the records starting line 3 to 22 (2 headers + 20 records).
$ awk -v s=4 'NR==1 {n=$1}
NR<3;
NR>2 && NR<=n+2 {print | "shuf -n"s}' file
20
-166.26604715
C 0.945798 0.821802 0.709269
C 0.548236 0.984306 0.823073
C 0.157828 0.119752 0.909685
C 0.422985 0.964365 0.793915
Here I picked the sample size as 4. Reads the number of records, prints the first two lines and samples requested records out of the number of records specified.
Note that this is sampling without replacement, meaning the same record can not be picked more than once, usually that's what is desired.
You may want to print the new number of records on top perhaps, but that's an easy change, left as an exercise...
UPDATE
For multiple data sets in the same structure (actually the number of records don't have to be the same) you need these modifications.
$ awk -v s=4 'BEGIN {cmd="shuf -n"s; n=-2}
r==n+2 {n=$1; close(cmd)}
{r=(NR-1)%(n+2)+1}
r<=2;
r>2 && r<=n+2 {print | cmd }' file.3
20
-166.26604715
C 0.422985 0.964365 0.793915
C 0.205224 0.0199054 0.84918
C 0.399686 0.645049 0.696163
C 0.726009 0.758101 0.197576
20
-166.26604715
C 0.43637 0.332195 0.77888
C 0.761209 0.496077 0.426298
C -6.3769524449 2.0225374370 -1.4858792908
C 0.205224 0.0199054 0.84918
20
-166.26604715
C 0.156318 0.796877 0.132388
C 0.157828 0.119752 0.909685
C -6.8775736572 0.7377700983 -1.2173950464
C -5.9530432940 -0.2309614983 -0.7933107594
r is the relative position index within data set, and some special handling is required for line 1 (hence n=-2}. Also need to close the command after each data set to flush the buffers. Otherwise the logic is essentially the same with NR replaced with r

sorting a dataframe by values and storing index and columns

I have a pandas DataFrame which is actually a matrix. It looks as shown below
a b c
d 1 0 5
e 0 6 2
f 2 0 3
I need the values to be sorted and need the values of index and columns of them. the result should be
index Column Value
e b 6
d c 5
f c 3
You need stack for reshape with nlargest:
df1 = df.stack().nlargest(3).rename_axis(['idx','col']).reset_index(name='val')
print (df1)
idx col val
0 e b 6
1 d c 5
2 f c 3
For MultiIndex:
df2 = df.stack().nlargest(3).to_frame(name='val')
print (df2)
val
e b 6
d c 5
f c 3

Using Pig, best way to count numbers within tuples

I'm working with tuples of data:
dump c;
(20
5
5
)
(1
1
1
5
10
)
The output I'm trying to achieve is count the occurrences of each number in total, so like this:
(1,3)
(5,3)
(10,1)
(20,1)
I'm attempted this command, and it was unsuccessful:
d = FOREACH c GENERATE COUNT($0);
I currently do not have schema for c (not sure that it matters at this point):
describe c;
Schema for c unknown.
Looking for suggestions.
Input Tuple:
(20 5 5)
(1 1 1 5 10)
You could get the count across the tuple by tokenizing and then grouping it.
A = LOAD 'file' using TextLoader() as (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as (line:chararray);
C = GROUP B BY line;
D = FOREACH C GENERATE group,COUNT(B);
dump D;
Output:
(1,3)
(5,3)
(10,1)
(20,1)

Unix / Shell Add a range of columns to file

So I've been trying the same problem for the last few days, and I'm at a formatting road block.
I have a program that will only run if its working on an equal number of columns. I know the total column count, and the number needed to add with a filler value of 0, but am not sure how to do this. Is there some time of range option with awk or sed for this?
Input:
A B C D E
A B C D E 1 1 1 1
Output:
A B C D E 0 0 0 0
A B C D E 1 1 1 1
The the alphabet columns are always present (with different values), but this "fill in the blank" function is eluding me. I can't use R for this due to data file size.
One way using awk:
$ awk 'NF!=n{for(i=NF+1;i<=n;i++)$i=0}1' n=9 file
A B C D E 0 0 0 0
A B C D E 1 1 1 1
Just set n to the number of columns you want to pad upto.

Resources