I have two files and would like to find out which parts of file 1 occur in the same order/sequence of file 2 based on one of multiple columns (col4). The files are sorted based on an identifier in col1 (from 1 to n) but the identifier is not the between the files. The column in file 1 always occurs as one block in file 2.
file1:
x 1
x 2
x 3
file2:
y 5
y 1
y 2
y 3
y 6
output:
y 1
y 2
y 3
Another thing to take into consideration is, that the entries in the column to be filtered on are not unique.
I already tried
awk 'FNR==NR{ a[$2]=$2;next } ($2 in a)' file1 file2 > output
but it only works if you have unique identifiers.
To clarify it with real life data: I would like to extract the rows where I have the same order based on column 4.
File1:
ATOM 13 O ALA A 2 37.353 35.331 -19.903 1.00 71.02 O
ATOM 18 O TRP A 3 38.607 32.133 -18.273 1.00 69.13 O
File2:
ATOM 1 N MET A 1 42.218 38.990 -18.511 1.00 64.21 N
ATOM 10 CA ALA A 2 38.451 37.475 -20.033 1.00 71.02 C
ATOM 13 O ALA A 2 37.353 35.331 -19.903 1.00 71.02 O
ATOM 18 O TRP A 3 38.607 32.133 -18.273 1.00 69.13 O
ATOM 29 CA ILE A 4 38.644 33.633 -15.907 1.00 72.47 C
I have a list of multiple instances of the same worker with multiple instances of work and rest.
I need the name of the worker and time of the first instance of Work and the time of the last instance of Rest.
Per person I can get the first work time if I type the actual range by this formula
=index(Filter(C2:D9,C2:C9="work"),1,2) This is Col G
I can determine how many entries of Rest there are for the same person to then work out the time of the last rest
with =index(Filter(C2:D9,C2:C9="rest"),E2,2). This is Col H
And I can get a unique list of names with this formula
=sort(unique(A2:A)) this is Col F
But I can't work out the formula to look at Col F against Cols A to D then return First Work Time in Col G
And a formula to look at Col F (against Cols A to D) and return the Last Rest Time in Col H
My data is in columns A to D and the results I would like are shown in Cols F to H
A B c d e f g h
1 Name Date Action Time Name First Work Last Rest
2 Smith, Fred 14-03-2022 rest 00:00 4 Smith, Fred 06:03 07:08
3 Smith, Fred 14-03-2022 work 06:03 2 Jones, Harry 07:48 08:08
4 Smith, Fred 14-03-2022 rest 06:05
5 Smith, Fred 14-03-2022 work 06:06
6 Smith, Fred 14-03-2022 drive 06:15
7 Smith, Fred 14-03-2022 rest 06:59
8 Smith, Fred 14-03-2022 drive 07:02
9 Smith, Fred 14-03-2022 rest 07:08
10 Jones, Harry 14-03-2022 rest 00:00
11 Jones, Harry 14-03-2022 work 07:48
12 Jones, Harry 14-03-2022 drive 08:01
13 Jones, Harry 14-03-2022 work 08:03
14 Jones, Harry 14-03-2022 drive 08:04
15 Jones, Harry 14-03-2022 work 08:07
16 Jones, Harry 14-03-2022 rest 08:08
I understand you would want the First work and Last rest to be aggregated by name (as you have indicated) and also by date (so that for each name you would get one row with First work and Last rest for each date, if there are multiple dates associated with a given name).
With a data structure as in your example (A: Name, B: Date, C: Action, D: Time), you can use the following formula (you do not need a list of names):
={QUERY(A:D,"select A, B, min(D) where C='work' group by A, B label min(D) 'First work'",1),
QUERY(A:D,"select max(D) where C='rest' group by A, B label max(D) 'Last rest'",1)}
This assumes that there is at least one work and one rest for each name for each date, otherwise it will get misaligned or show error (if the amount of 'work' and 'rest' results are not equal).
If this is possible with your actual data, then it is safer to display the data as two separate tables for First work:
=QUERY(A:D,"select A, B, min(D) where C='work' group by A, B label min(D) 'First work'",1)
and for Last rest:
=QUERY(A:D,"select A,B, max(D) where C='rest' group by A, B label max(D) 'Last rest'",1)
I hope it is easy to understand what the queries are doing, if not then consult the documentation of the QUERY formula and Query Language or ask if something is still not clear.
Problem is to use the group by function to find only the average of books checked out by students of a specific department. However, it keeps outputting the average of all checked out books from all students.
What I have so far:
γ avg(Books_Quantity) -> y (Student) ⨝ (σ Department = 'Computer_Science' (Student))
The output should be 1.75, but is instead outputting the average for all the departments.
y Student.Student_ID Student.Student_Name Student.Department Student.Books_Quantity
1.5 1 John Computer_Science 2
1.5 2 Lisa Computer_Science 1
1.5 5 Xina Computer_Science 3
1.5 7 Chang Computer_Science 1
I found the answer. You have to put the Select option inside the table selection operation. Like so:
γ avg(Books_Quantity) -> y (σ Department = 'Computer_Science' (Student))
I have a data which has the following structure -
1 John US
2 Mary CN
3 Smith US
4 John US
5 Mary CN
I need to find duplicate names within each country. Result should be something like this
{US : (1, John, US),(4,John, US)}
{CN : (2, Mary, CN),(5, Mary, CN)}. Could someone help me with a Pig script for my problem?
I'm able to load the data and group it by Country Name.
I assume you have the input in the following format:
1 John US
2 Mary CN
3 Smith US
4 John US
5 Mary CN
In that case you can come up with the followings:
A = load 'data.txt' using PigStorage(' ')
as (id:int, name:chararray, country:chararray);
B = foreach (group A by (country, name)) generate group.country, A,
COUNT(A) as count;
C = foreach (FILTER B by count > (long)1) generate country, A;
dump C;
(CN,{(2,Mary,CN),(5,Mary,CN)})
(US,{(1,John,US),(4,John,US)})
Take for example the list (L):
John, John, John, John, Jon
We are to presume one item is to be correct (e.g. John in this case), and give a probability it is correct.
First (and good!) attempt: MostFrequentItem(L).Count / L.Count (e.g. 4/5 or 80% likelihood)
But consider the cases:
John, John, Jon, Jonny
John, John, Jon, Jon
I want to consider the likelihood of the correct item being John to be higher in the first list! I know I have to count the SecondMostFrequent Item and compare them.
Any ideas? This is really busting my brain!
Thx,
Andrew
As an extremely simple solution, compared to the more correct but complicated solutions above, you could take counts of every variation, square the counts, and use those to calculate weights. So:
[John, John, Jon, Jonny]
would give John a weight of 4 and the other two a weight of 1, for a probability of 66% that John is correct.
[John, John, Jon, Jon]
would give weights of 4 to both John and Jon, so John's probability is only 50%.
Maybe Edit Distance? Just a direction to a solution, though...
Just off the top of my head, what if you compare the % occurrence vs the % if all items had equal number of occurences
In your example above
John, John, Jon, Jonny
50% John
25% Jon
25% Jonny
33.3% Normal? (I'm making up a word because I don't know what to call this. 3 items: 100%/3)
John's score = 50% - 33.3% = 16.7%
John, John, Jon, Jon
50% John
50% Jon
50% Normal (2 items, 100%/2)
John's score = 50% - 50% = 0%
If you had [John, John, John, Jon, Jon] then John's score would be 60%-50% = 10% which is lower than the first case, but higher than the 2nd (hopefully that's the desired result, otherwise you'll need to clarify more what the desired results should be)
In your first case [John, John, John, John, Jon] you'd get 80%-50% = 30%
For [John, John, John, John, Jon, Jonny] you'd get 66.6%-33.3% = 33.3%
That may or may not be what you want.
Where the above might factor in more is if you had John*97+Jon+Jonny+Johnny, that would give you 97%-25% = 72%, but John*99+Jon would only give you a score of 99-50% = 49%
You'd need to figure out how you want to handle the degenerate case of them all being the same, otherwise you'd get a score of 0% for that which is probably not what you want.
EDIT (okay I made lots of edits, but this one isn't just more examples :p)
To normalize the results, take the score as calculated above divide by the limit of max possible score given the number of different values. (Okay, that sounds way more complicated than it needs to, example time)
Example:
[John, John, Jon, Jonny] 50% - 33.3% = 16.7%. That's the previous score, but with 3 items the upper limit of your score would be 100%-33.3% = 66.6%, so if we take that into account, we have 16.7/66.6 = 25%
[John, John, Jon, Jon] gives (50-50) /50 = 0%
[John, John, John, Jon, Jon] gives (60-50) /50 = 20%
[John, John, John, John, Jon] gives (80-50)/50 = 60%
[John, John, John, John, Jon, Jonny] gives (66.6-33.3)/(66.6)= 50%
[John*97, Jon, Jonny, Johnny] gives (97-25)/75 = 96%
[John*99, Jon] gives (99-50)/50 = 98%
I think you'd need a kind of scoring system.
Solely identifying the different tokens is not sufficient:
[John, Johen, Jon, Jhon, Johnn]
With your algorithm there is no clear winner here, whereas the most probable name is 'John', the others being just 1 away in the Damerau-Levenshtein distance.
So I would do a 2-steps process:
Count the occurrences of each word
For each word, add a "bonus" for each other word, inversely proportional to their distance
For the bonus, I would propose the following formula:
lhs = 'John'
rhs = 'Johen'
d = distance(lhs,rhs)
D = max( len(lhs), len(rhs) ) # Maximum distance possible
tmp = score[lhs]
score[lhs] += (1-d/D)*score[rhs]
score[rhs] += (1-d/D)*tmp
Note that you should not apply this first for (John, Johen) and then for (Johen, John).
Example:
# 1. The occurences
John => 1
Johen => 1
Jon => 1
Jhon => 1
Johnn => 1
# 2. After executing it for John
John => 4.1 = 1 + 0.80 + 0.75 + 0.75 + 0.80
Johen => 1.8 = (1) + 0.80
Jon => 1.75 = (1) + 0.75
Jhon => 1.75 = (1) + 0.75
Johnn => 1.8 = (1) + 0.80
# 3. After executing it for Johen (not recounting John :p)
John => 4.1 = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8 = (1 + 0.80) + 0.60 + 0.60 + 0.80
Jon => 2.35 = (1 + 0.75) + 0.60
Jhon => 2.35 = (1 + 0.75) + 0.60
Johnn => 2.6 = (1 + 0.80) + 0.80
# 4. After executing it for Jon (not recounting John and Johen)
John => 4.1 = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8 = (1 + 0.80 + 0.60 + 0.60 + 0.80)
Jon => 3.7 = (1 + 0.75 + 0.60) + 0.75 + 0.60
Jhon => 3.1 = (1 + 0.75 + 0.60) + 0.75
Johnn => 3.2 = (1 + 0.80 + 0.80) + 0.60
# 5. After executing it for Jhon(not recounting John, Johen and Jon)
John => 4.1 = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8 = (1 + 0.80 + 0.60 + 0.60 + 0.80)
Jon => 3.7 = (1 + 0.75 + 0.60 + 0.75 + 0.60)
Jhon => 3.7 = (1 + 0.75 + 0.60 + 0.75) + 0.60
Johnn => 3.8 = (1 + 0.80 + 0.80 + 0.60) + 0.60
I'm not sure it's perfect and I have no idea how to transform this into some kind of percentage... but I think it gives a pretty accurate idea (of the most likely). Perhaps the bonus ought to be lessened (which factor ?) But check this degenerate case:
[John*99, Jon]
# 1. Occurences
John => 99
Jon => 1
# 2. Applying bonus for John
John => 99.8 = (99) + 0.80
Jon => 80.2 = (1) + 0.80*99
As you can see, it can't be directly converted into some kind of percentages: 99.8% of the correct result being 'John' seems low here!
Note: Implementing the distance efficiently is hard, kudos to Peter Norvig for his Python solution!
First off, I suspect that you are using terms inconsistently. It will help if you use technical terms like "probability" and "likelihood" with strict correctness.
The probability of a thing allows us to reason from a parameter to an outcome. For example, suppose we have an unfair coin which is 60% likely to come up heads. The 60% is a parameter. From that we can reason that the probability of getting two heads in a row is 60% * 60% = 36%.
The likelihood of a thing allows us to reason from an outcome to a parameter. That is, we flip a pair of identical coins a thousand times and discover that we get two heads 36% of the time. We can compute "the likelihood of the probability of heads is 60% given the outcome that 36% of pairs were two heads".
Now, a reasonable question is "how confident can we be that we've deduced the correct parameter given the outcome?" If you flip pairs of coins a million times and get 36% double heads, it seems plausible that we can be very confident that the parameter for one coin is 60%. The likelihood is high. If we flip pairs of coins three times and get double heads 33% of the time, we have very little confidence that the parameter for one coin getting heads is close to 60%. It could be 50% or 40%, and we just got lucky one time in three. The likelihood is low.
All this is a preamble to simply asking you to clarify the question. You have an outcome: a bunch of results. Do you wish to make an estimate of the parameters that produced that outcome? Do you wish to get a confidence interval on that estimate? What exactly are you going for here?
I'm not sure why you need to calculate the second most frequent item. In the latter example, couldn't you simply look at (the number of matching entries) / (the total number of entries) and say that it's correct with 4/8 probability? Is that not a sufficient metric? You would then also say that John has a 3/8 probability of being correct and Jonny has 1/8?
Why would that be insufficient for your purposes?