Computing the Dot Product for calculating proximity

Computing the Dot Product for calculating proximity - algorithm

I have already asked a similar question at Calculating Word Proximity in an inverted Index.
However i felt that the question was too general and not refined enough. So here goes.
I have a List which contains the location of tokens in a document. for each token it goes as
public List<int> hitLocation;
Lets say the the document is
Java programming language has a name similar to java island in Indonesia however
local language in java bears no resemblance to the programming language called java.
and the query is
java island language
So Say i lock on to the Java HitList and attempt to directly calculate the distance between the Java HisList, Island HitList and Language Hitlist.
Now the first problem is that there are 4 java tokens occurrences in the sentence. Which one do i select. Assuming i select the first one.
I go onto the island token list and after comparing find it that it adjacent to the second occurrence of java. So i change my selection and lock onto the second occurrence of java.
Proceeding to the third token language i find that it situated at quite a distance from our selection however i find it that it is quite near the first java occurrence.
So you see the dilemma here if now again revert back to the original selection i.e the first occurrence of java the distance to second token "island" increases and if i stay with my current selection the sheer distance of the second occurrence of the token "language" will make relevance busted.
Previously there was the suggestion of dot product however i am at loss on how to proceed forward with that option.
Any other solution would also be welcomed.
I Understand that this question is quite detailed. However i have searched long and hard and haven't found any question like this on this topic.
I feel if this question is answered it will be a great addition to the community and will make anybody who is designing anything related to relevancy quite happy.
Thank You.

You seem to be using the hit lists a little differently then how they are intended to be used (at least given my understanding).
Typically people compare hit lists returned by different documents. This is how they rank one document as being "more relevant" than a different document.
That said, if you want to find all locations of some multi-word phrase like "java island" given the locations of the words "java" and "island" you would...
Get a list of locations for "java"
Get a list of locations for "island"
Sort both lists
Iterate through both lists at the same time. You start be getting the first entry of both lists. Now test this pair of entries. I.E., if these entries are "off by one" you have found one instance of "java island" (or perhaps "island java"). Get the next entry in the list that currently shows the minimum value. Test this new pair of entries. Repeat.
BTW -- The dot product is more useful when comparing 2 different documents.

Well, since you explicitly ask about the dot product suggestion, i'll try to explain a little more formally what I had in mind. Keep in mind that it's not very efficient as it might convert the complexity from basing on the lengths of the hitlists, into something based on the length of the text (unless there's some trick to cut that).
My initial thought was to convert each hitlist into a series of binary value at the text length, high where there's a hit and low otherwise.
for e.g. java would look
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
But since you want proximity, convert each occurrence into a pyramid, for e.g. -
3 2 1 0 0 0 1 2 3 2 1 0 0 0 1 2 3 2 0 0 0 0 0 1 2 3
Same way for island -
0 0 0 0 0 0 0 1 2 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now a dot product would give you some sort of proximity "score" between the two vectors, since it accumulates all the locations where two words are close (the closer the better). Java and island can be said to have a mutual score of 16. For a higher threshold you could stretch the pyramid further, or play with the shape.
Now, here you add another suggestion that this method isn't very suited for, you also want to catch the exact location of highest proximity, this isn't very well defined IMHO, what if word1 matches word2 (at some level) in position1, but word2 matches word3 at the same level in position2 - what location would you want?
Also, keep in mind that this method is O(text_length * words^2), that might be good in some cases, but very bad for others (if you're searching the bible for e.g.)

Related

Compact structure for looking up the number of 1's before a given index

Given a (long) string of 0's and 1's, I need to be able to answer quickly queries of the kind: how many 1's in the string precede a given index i? One can assume that a 1 is located at index i.
I am looking for an as compact as possible a data structure that can be computed once for the given string of 0's and 1's and then used as a look-up table to answer quickly the queries as described above.
Background. In my particular case, the string of 0's and 1's encodes a grid map (such as in a video game), where 0 denotes an obstacle and 1 denotes a passable location. I store distances from all passable locations to one special location in an array. The query corresponds to this: given a passable location (i.e. an index into the string of 0's and 1's), I need to be able to determine quickly the corresponding index into the array of distances.

You're looking for "succinct indexable dictionaries", which are known by many other names, as well - you can also Google for "succinct rank select". The best solutions have ~5% overhead and constant-time lookups.
"Space-Efficient, High-Performance Rank & Select Structures on Uncompressed Bit Sequences", by Dong Zhou, David G. Andersen, Michael Kaminsky
https://github.com/efficient/rankselect
https://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf

I am looking for an as compact as possible a data structure that can be computed once for the given string of 0's and 1's and then used as a look-up table to answer quickly the queries as described above.
This problem is about 6 decades old, and extensively solved. What you're looking at is really just a vector that you could define to be 0 for every value but 1.
If there's very little 1s compared to other values, just go with one of the many sparse vector representations that have been around with linear algebra libraries for ever.
You're not giving enough info (like, is your original vector still going to be available, or is it going to be deleted as soon as you have your data storage? I'm going to assume this), but assuming this is a test in solving real world problems on your own rather than choosing the right library to do so:
Knowing real computers are nothing like what the algorithms they teach in basic CS were optimized for, the best storage is almost always linear storage.
Because counting ones is actually much less time-intense than loading data from RAM into CPU registers, the most effective choice here is the simplest:
Take a wordlength (for example, 64) of your original vector's values, and convert them to bits set (or not set, if the value != 1) in a word; move on to the next word and the next part of your original vector.
Now, to evaluate the number of ones, you would just use a "population count" instruction that practically all CPUs nowadays have – for example, as introduced n x86(_64) by SSE4.1 as POPCNT. Use SIMD instructions to generate the sum over adjacent word population counts, and accumulate them up to the point of your index/wordlength. You can, if your problem is both large enough and you have multiple cores with individual caches, also easily divide that algorithm into multiple parallel threads, because there's no mutual dependency. You just add up the sums at the end. Having implemented similar SIMD-optimized code myself, multithreading doesn't pay off if you're limited on CPU cache, because you just end up waiting on RAM with multiple cores.
Anyone telling you to use "runlength" or "linked-list" implementations to encode the distance between 1s neglects the fact that, as mentioned, the problematic part is getting data from RAM, not the actual counting. Memory controllers always fetch a whole memory "row", not just a single value, so that whilst waiting for the first element might easily take the time it does to count the 1s in a couple hundred words worth of wordlength original values each, subsequent accesses to words from the same row are pretty fast.
This is pretty nicely illustrated (partly with invisible graphs) by Bjarne Stroustrup (being one of the evil masterminds behind C++) in this short lecture.

EDIT realised I'd answered the inverse of your problem; this tells you how many 1's come after your position. Just go forward through the bitmap instead of what I suggest below.
Try creating an int array the length of your bitmap. Working backwards, sum the number of 1's you've seen so far; eg
[ 1 0 0 1 1 0 1 0 1 1 1 0 0 0 ]
gives
[ - - - - - - - - - - - - - 0 ]
[ - - - - - - - - - - - - 0 0 ]
[ - - - - - - - - - - - 0 0 0 ]
[ - - - - - - - - - - 1 0 0 0 ]
[ - - - - - - - - - 2 1 0 0 0 ]
[ - - - - - - - - 3 2 1 0 0 0 ]
[ - - - - - - - 3 3 2 1 0 0 0 ]
...
[ 7 6 6 6 5 5 4 3 3 2 1 0 0 0 ]
Now it's just an array lookup, with the added benefit that if you want to know the number between any two points, you can work it out by subtracting one from another.

Recommendation system and baseline predictors

I have a bunch of data where the first column represents users, the second column is movies, and the third is a ten-points rating.
0 0 9
0 1 8
1 1 4
1 2 6
2 2 7
And I have to predict the third number for another ser of data (user, movie, ?):
0 2
1 0
2 0
2 1
I use this way for finding bias values https://youtube.com/watch?v=dGM4bNQcVKI and this way for predicting https://www.youtube.com/watch?v=4RSigTais8o.
Bias value for user number 0: 9 + 8 / 2 = 8.5 - 1.5 = 7.
Bias value for movie number 2: 6 + 7 / 2 = 6.5 - 1.5 = 5.
And baseline predictors:
1.5 + 7 + 5, where result is 13.5, but in contest result is: 7.052009.
But the problem description says the result of my Recommendation system should be:
0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943
Where is my mistake?

The raw average is the average of ALL the present scores ((9+8+4+6+7) / 5 = 6.8), I don't see that number anywhere, so I guess that's your error.
In the video Prof. used the raw average of 3.5 on all the calculations, including calculating bias, he skipped how to reach that number, if you add all numbers on the table of the video and divide, you get 3.5.
0 2 9.2 is the answer for the first one, using your videos as guide. The videos claims to have avoided calculus, the different final answers of the contest probably come from using the "full" method.
0 2 ?, user 0 (row 0: 9 8 x), movie 2 (column 2: x 6 7)
raw average = 6.8
bias user 0: (9+8) / 2 - 6.8 = 1.7
bias movie 2: (6+7) / 2 - 6.8 = -0.3
prediction: 6.8+1.7-0.3 = 8.2
The problem looks like a variation of the Netflix Contest, the contest' host knows the actual answers (the ratings), he doesn't give them to you, you are expected to guess/predict them, the winner of the contest is the one that gets the closest to the actual answers.
The winner of you contest got the closest, but he got there using an unknown method, or his own variation of a know method, if your goal is to match his answer exactly, you are better off asking him what method he used and how did he modify it, and try to replicate his results.
If this was homework and not a contest, then the teacher would expect you to use the "correct" method he taught you (there's no set method, just many methods that work with different accuracy), you'd have to use it exactly like he taught you. But it is a contest, your goal is to find a base method that approximates the best (the one you used is very low on accuracy), and tinker with it a bit to get even better results.
If you want to understand the link I suggest you research and later ask a statistics question, because it's just plain statistics. You can try to understand the link or research Matrix factorization on your own. Remember that to get contest winning results (or close) you won't be able to use a simple method like the one you found on the youtube video, but require a method with a lot more math.

Battlefield - Board Game Strategy [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Consider a square matrix, all slots filled with zeroes.This will be the battlefield. Now, to place ships, I indicate the by putting 1. A ship can be of size 1, 2, 3 - meaning two or three continuous blocks should be 1. They can also be horizontal or vertical. Now , what is the best strategy for an enemy to search for my ships. He has no idea how I have placed my ships. What could be a good strategy to search the matrix ? OR How do I make the CPU a better player when it comes to making 'smart moves' ?
Search randomly
Search, and when you find one attack the neighbouring blocks to check if it size 2/3 ship.
Also, the initial positioning of the CPU can be based on the previous winning positions and not just based on random numbers.
Any other idea ..... ??
The idea can be extended to form a larger board game of 20 x 20 matrix with multiple ships.An example is given below.
0 0 0 0 0 0
0 1 0 0 0 0
0 0 0 1 1 0
0 1 0 0 0 0
0 1 0 0 0 0
0 1 0 0 0 0
Any help would be much appreciated !!

As you have a ship with a size of 1, you basically need to enumerate through all of the fields and check the neighbors for bigger ships. You can save some work by using specific order like going through all of the rows:
1 2 3 4 5 6
7 8 9 10 11 12
13 ...
If you have to detect bigger ships, you check the following two right fields (if not out of bounds; you check the first field and if it's a part of a ship, then you check the second) as well of the two bottom fields (again boundary check). Using that traversal you ensure that you never check the left and the top fields for bigger ship. When you check for bigger ships, you should remember how many positions you have visited right and skip those after moving along.
It's just a suggestion and is relatively efficient one. With some memory usage you can avoid double visiting fields after checking bottom, but this won't lead to a performance win in the real life.

Generating Bulls and Cows secret word with given n String inputs

I am stuck on this problem for quite a while, it is basically reverse engineering bulls and cow game.
Read more here: http://rosettacode.org/wiki/Bulls_and_cows
I am not able to develop a logic for the problem given below, if you can think of a solving approach please comment the same.
Problem Statement:
Given few clue words(of form ABCD/DBCA etc) and the number of cows and bulls for each word,program
should be able to work out the actual word by evaluating the given clue words and generate the output secret word.
TEST CASES:
Input:
4
DBCC 0 2
CDAB 2 1
CAAD 1 2
CDDA 2 0
Output:
BDAA

The idea is to reduce the space of possible solutions. Before you start, all 4^4 combinations are possible. After you read the first clue [DBCC 0 2 ], you can eliminate a number of possible solutions, in this particular example you can eliminate all states which have a D in the first place, all which have a B in the second place and so on. Just eliminate all possible solutions which do not "fit" the current clue.
Do this with each clue, until only one solution is left. Another interesting problem of course is how to generate good clue patterns.

The way I did it is:
1. Generate all possible words, put them in a list (array)
2. Randomly select one of them (first question)and ask for clues
3. Take the answer (let's say it is 2,1)
4. Start comparing that question with
first, second, ..., to the last word from the list
5. if they give the same clue: count them, plac

Finding similarities in a multidimensional array

Consider a sales department that sets a sales goal for each day. The total goal isn't important, but the overage or underage is. For example, if Monday of week 1 has a goal of 50 and we sell 60, that day gets a score of +10. On Tuesday, our goal is 48 and we sell 46 for a score of -2. At the end of the week, we score the week like this:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
In this example, both Monday (0,0) and Thursday and Friday (0,3 and 0,4) are "hot"
If we look at the results from week 2, we see:
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
For week 2, the end of the week is hot, and Tuesday is warm.
Next, if we compare weeks one and two, we see that the end of the week tends to be better than the first part of the week. So, now let's add weeks 3 and 4:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
[2,0]=-8,[2,1]=-2,[2,2]=-1,[2,3]=2,[2,4]=3
[3,0]=2,[3,1]=3,[3,2]=4,[3,3]=7,[3,4]=9
From this, we see that the end of the week is better theory holds true. But we also see that end of the month is better than the start. Of course, we would want to next compare this month with next month, or compare a group of months for quarterly or annual results.
I'm not a math or stats guy, but I'm pretty sure there are algorithms designed for this type of problem. Since I don't have a math background (and don't remember any algebra from my earlier days), where would I look for help? Does this type of "hotspot" logic have a name? Are there formulas or algorithms that can slice and dice and compare multidimensional arrays?
Any help, pointers or advice is appreciated!

This data isn't really multidimensional, it's just a simple time series, and there are many ways to analyse it. I'd suggest you start with the Fourier Transform, it detects "rhythms" in a series, so this data would show a spike at 7 days, and also around thirty, and if you extended the data set to a few years it would show a one-year spike for seasons and holidays. That should keep you busy for a while, until you're ready to use real multidimensional data, say by adding in weather information, stock market data, results of recent sports events and so on.

The following might be relevant to you: Stochastic oscillators in technical analysis, which are used to determine whether a stock has been overbought or oversold.
I'm oversimplifying here, but essentially you have two moving calculations:
14-day stochastic: 100 * (today's closing price - low of last 14 days) / (high of last 14 days - low of last 14 days)
3-day stochastic: same calculation, but relative to 3 days.
The 14-day and 3-day stochastics will have a tendency to follow the same curve. Your stochastics will fall somewhere between 1.0 and 0.0; stochastics above 0.8 are considered overbought or bearish, below 0.2 indicates oversold or bullish. More specifically, when your 3-day stochastic "crosses" the 14-day stochastic in one of those regions, you have predictor of momentum of the prices.
Although some people consider technical analysis to be voodoo, empirical evidence indicates that it has some predictive power. For what its worth, a stochastic is a very easy and efficient way to visualize the momentum of prices over time.

It seems to me that an OLAP approach (like pivot tables in MS Excel) fit the problem perfectly.

What you want to do is quite simple - you just have to calculate the autocorrelation of your data and look at the correlogram. From the correlogram you can see 'hidden' periods of your data and then you can use this information to analyze the periods.
Here is the result - your numbers and their normalized autocorrelation.
10 1,000
-2 0,097
1 -0,121
7 0,084
6 0,098
-4 0,154
2 -0,082
-1 -0,550
4 -0,341
5 -0,027
-8 -0,165
-2 -0,212
-1 -0,555
2 -0,426
3 -0,279
2 0,195
3 0,000
4 -0,795
7 -1,000
9
I used Excel to get the values. But the sequence in column A and add the equation =CORREL($A$1:$A$20;$A1:$A20) to cell B1 and copy it then up to B19. If you the add a line diagram, you can nicely see the structure of the data.

You can already make reasonable guesses about the periods of patterns - you're looking at things like weekly and monthly. To look for weekly patterns, for example, just average all the mondays together and so on. Same goes for days of the month, for months of the year.
Sure, you could use a complex algorithm to find out that there's a weekly pattern, but you already know to expect that. If you think there really may be patterns buried there that you'd never suspect (there's a strange community of people who use a 5-day week and frequent your business), by all means, use a strong tool -- but if you know what kinds of things to look for, there's really no need.

Daniel has the right idea when he suggested correlation but I don't think autocorrelation is what you want. Instead I would suggest correlating each week with each other week. Peaks in your correlation--that is values close to 1--suggest that the values of the weeks resemble each other (I.e. are peiodic) for that particular shift.
For example when you cross correlate
0 0 1 2 0 0
with
0 0 0 1 1 0
the result would be
2 0 0 1 3 0
the highest value is 3, which corresponds to shifting (right) the second array by 4
0 0 0 1 1 0 --> 0 0 1 1 0 0
and thenn multiplying component wise
0 0 1 2 0 0
0 0 1 1 0 0
----------------------
0 + 0 + 1 + 2 + 0 + 0 = 3
Note that when you correlate you can create your own "fake" week and cross-correlate all your real weeks, the idea being that you are looking for "shapes" of your weekly values that correspond to the shape of your fake week by looking for peaks in the correlation result.
So if you are interested in finding weeks that are close near the end of the week you could use the "fake" week
-1 -1 -1 -1 1 1
and if you get a high response in the first value of the correlation this means that the real week that you correlated with has roughly this shape.

This is probably beyond the scope of what you're looking for, but one technical approach that would give you the ability to do forecasting, look at things like statistical significance, etc., would be ARIMA or similar Box-Jenkins models.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio